News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Bin$

Started by jj2007, June 20, 2008, 01:39:06 AM

Previous topic - Next topic

hutch--

OK,

I have untangled JJs code and put it into the test bed. It passed data by registers so I modified both sinsi's algo and NightWares and get these timings. My own algo has run out of legs so I did not bother.

By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster. TThe technique of creating the table dynamcally in memory could be useful for a WORD version with a 64k table, far too big for initialised data but no big deal in terms of allocated memory. It would make possible WORD sized reads and writes which shold be nearly twice as fast.


01001001100101100000001011010010 sinsi_ex
01001001100101100000001011010010 NightWare
01001001100101100000001011010010 hutch_ex2
01001001100101100000001011010010 JJ BIN$

593 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
593 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
454 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$

NightWare timing average = 593
sinsi_ex  timing average = 453
hutch_ex2 timing average = 656
JJ BIN$   timing average = 500
Press any key to continue ...




[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced  :red :

Your version:
nwDw2Bin PROC
mov BYTE PTR [esi+32], 0
mov ecx, 32 ; 28 would produce garbage
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
mov DWORD PTR [esi+ecx], edx
jnz Label1
ret
nwDw2Bin ENDP


My edits:
nwDw2BinJJ PROC
mov BYTE PTR [esi+32], 0
mov ecx, 28 ; works (JJ)
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
; shr eax, 4 moved up to reduce stalls (NightWare)
mov DWORD PTR [esi+ecx+4], edx ; JJ: +4 is one byte longer but faster
; sub ecx, DWORD moved up to reduce stalls (NightWare)
jnz Label1 ; jns Label1
ret
nwDw2BinJJ ENDP


Timings:
53 cycles timing BIN$            139 bytes      625 LAMPs
61 cycles timing pbin2           147 bytes      740 LAMPs
89 cycles timing nwDw2Bin        101 bytes      894 LAMPs
74 cycles timing nwDw2BinJJ      102 bytes      747 LAMPs
304 cycles timing Dword2Bin2     59 bytes       2335 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The difference seems in the factor 28/32. Hmmm...

[attachment deleted by admin]

jj2007

Quote from: hutch-- on June 24, 2008, 09:15:32 AM
By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster.

NightWare timing average = 595
sinsi_ex  timing average = 492
hutch_ex2 timing average = 861
JJ BIN$   timing average = 625

My version, now with OPTION PROLOGUE:NONE for BIN$ but not including sinsi's algo with the external bintable; pbin2 is sinsi plus generated table:
47 cycles timing BIN$            139 bytes      554 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
88 cycles timing nwDw2Bin        101 bytes      884 LAMPs
67 cycles timing nwDw2BinJJ      102 bytes      677 LAMPs
299 cycles timing Dword2Bin2     59 bytes       2297 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The modified version of NightWare, see previous post, looks also promising (EDIT:), especially since the 101/102 bytes include already the 64 bytes of .data BinTable

hutch--

JJ,

If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code. With sinsi's code I reordered some of the instructions to increase the instruction count between memory read and writes to prevent or at least slow down the read after write stalls and it got faster for doing so.

For small non-critical code I would use NightWare's original as it is small and only has a 64 byte table but with the speed advantage of sinsi's code, it would have to be the one to use for performance reasons.

While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library. Compliments on your BIN$ macro and the algo it calls, it does perform well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version

That does the trick, it seems:
48 cycles timing BIN$            139 bytes      566 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
80 cycles timing nwDw2Bin        101 bytes      804 LAMPs
69 cycles timing nwDw2BinJJ      102 bytes      697 LAMPs
49 cycles timing NightWare       234 bytes      750 LAMPs
296 cycles timing Dword2Bin2     56 bytes       2215 LAMPs



[attachment deleted by admin]

jj2007

Quote from: hutch-- on June 24, 2008, 09:44:26 AM
Compliments on your BIN$ macro and the algo it calls, it does perform well.

Thanxalot, Hutch. Your version of BIN$ still has a little bug (ok in dw2binFinal.zip posted above) :

BIN$ MACRO dwArg:REQ, tgt:=<0>
  cc INSTR <dwArg>, <eax>   ;; to avoid a mov eax, eax
  if cc
   mov eax, dwArg         ;; eax passes value to translate
  endif
  if @SizeStr(tgt) gt 1                  ;; BUG: "if tgt" won't work correctly
   mov edx, tgt         ;; dest buffer - if 0, use Dw2BinBuffer
  else
   mov edx, offset Dw2BinBuffer
  endif
  call Dword2Bin
  EXITM <edx>
ENDM

Quote
While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library.

BIN$ accepts one required argument and returns a pointer to the result.  I chose to pass the value in eax, since all Win32 API's return values in that register. As it stands, it looks foolproof...

invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, BIN$(eax), chr$("Caps Lock in Bit 0:"), MB_OK

MichaelW

This is 27 bytes and 141 cycles on a P3.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 40 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

dw2bin proc dwNumber:DWORD, pszString:DWORD
    mov eax, [esp+8]
    mov ecx, 31
  @@:
    xor edx, edx
    shr DWORD PTR [esp+4], 1
    adc edx, '0'
    mov [eax+ecx], dl
    dec ecx
    jns @B
    ret 8
dw2bin endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke dw2bin, 0, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, 01010101h, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, -1, ADDR buff
    print ADDR buff,13,10,13,10

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke dw2bin, 01010101h, ADDR buff
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.
eschew obfuscation

jj2007

Quote from: MichaelW on June 24, 2008, 10:26:00 AM
For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.

Never satisfied? Your wish is my command...

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b),
   chr$("BIN$ plain:"), MB_OK

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b, f),
   chr$("BIN$ formatted:"), MB_OK

Fortunately the timings are not affected, but wow, that cost me almost a hundred LAMPs :wink

47 cycles timing BIN$            180 bytes      631 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
92 cycles timing nwDw2Bin        101 bytes      925 LAMPs
67 cycles timing nwDw2BinJJ      102 bytes      677 LAMPs
49 cycles timing NightWare       234 bytes      750 LAMPs
297 cycles timing Dword2Bin2     56 bytes       2223 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

hutch--

Michael,

I plugged it in and tested it and the result is correct but its very slow against the rest. It also effected the timings of the other test algos but was running about 10 time slower than the others. It may just be a very bad stall on my PIV.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

#39
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code.
hmm, last shr eax,4 is useless  :wink, i've also unrolled the code and made few change :
; unrolled version
mov BYTE PTR [esi+32],0
mov ecx,32

mov edx,eax
and edx,00000000Fh
shr eax,2
mov edx,DWORD PTR [BinaryTable+edx*4]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch ; 0Fh * 4
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx

mov edx,eax
and edx,00000003Ch
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mixing both should give a good result...

EDIT : something like :; mov ecx,value
; mov edx,buffer
; no need to pudh/pop esi/edi

mov BYTE PTR [edx+32],0

mov eax,ecx
and eax,00000000Fh
shr ecx,2
mov eax,DWORD PTR [BinaryTable+eax*4]
mov DWORD PTR [edx+28],eax

mov eax,ecx
and eax,00000003Ch ; 0Fh * 4
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+24],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+20],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+16],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+12],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+8],eax

mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+4],eax

and ecx,00000003Ch
mov ecx,DWORD PTR [BinaryTable+ecx]
mov DWORD PTR [edx],ecx

hutch--

I maved this topic to the LAB so it would not get lost in a hurry.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

The same but faster: :lol
        mov BYTE PTR [edx+32],0

        mov eax, ecx
        and ecx, 00000000Fh
        mov ecx, DWORD PTR [BinaryTable+ecx*4]
        shr eax, 2
mov DWORD PTR [edx+28],ecx

mov ecx, eax
and eax, 00000003Ch ; 0Fh * 4
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+24],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+20],ecx

mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+16],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+12],ecx

mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
        mov DWORD PTR [edx+8],eax

mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
        mov DWORD PTR [edx+4], ecx

and eax, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx],ecx



NightWare


jj2007

Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol

Not so easy to get stable timings, but I would vote for the Nightware/Lingo algo:

69 cycles timing BIN$            180 bytes      926 LAMPs
128 cycles timing pbin2          147 bytes      1552 LAMPs
124 cycles timing nwDw2Bin       101 bytes      1246 LAMPs
103 cycles timing nwDw2BinJJ     102 bytes      1040 LAMPs
84 cycles timing NightWare       204 bytes      1200 LAMPs
89 cycles timing BinLingo        204 bytes      1271 LAMPs

48 cycles timing BIN$            180 bytes      644 LAMPs
87 cycles timing pbin2           147 bytes      1055 LAMPs
115 cycles timing nwDw2Bin       101 bytes      1156 LAMPs
71 cycles timing nwDw2BinJJ      102 bytes      717 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs
75 cycles timing BinLingo        204 bytes      1071 LAMPs

48 cycles timing BIN$            180 bytes      644 LAMPs
57 cycles timing pbin2           147 bytes      691 LAMPs
121 cycles timing nwDw2Bin       101 bytes      1216 LAMPs
72 cycles timing nwDw2BinJJ      102 bytes      727 LAMPs
42 cycles timing NightWare       204 bytes      600 LAMPs
42 cycles timing BinLingo        204 bytes      600 LAMPs

53 cycles timing BIN$            180 bytes      711 LAMPs
59 cycles timing pbin2           147 bytes      715 LAMPs
88 cycles timing nwDw2Bin        101 bytes      884 LAMPs
83 cycles timing nwDw2BinJJ      102 bytes      838 LAMPs
53 cycles timing NightWare       204 bytes      757 LAMPs
43 cycles timing BinLingo        204 bytes      614 LAMPs

Just for fun, here is one with the original library function:
48 cycles timing BIN$            180 bytes      644 LAMPs
58 cycles timing pbin2           147 bytes      703 LAMPs
55 cycles timing NightWare       204 bytes      786 LAMPs
42 cycles timing BinLingo        204 bytes      600 LAMPs
82 cycles timing dw2bin_ex       2140 bytes     3793 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

Full source attached, with some switches.

[attachment deleted by admin]

drizz

here's my attempt  :8):

;; for 32 bytes
bintab equ BinaryTable
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
DwToBin proc dwValue:dword,pBuffer:dword
mov eax,[esp+4];dwValue
mov edx,[esp+8];pBuffer
push edi
push esi
push ebx
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
shr eax,16
mov [edx+7*4],ebx
mov [edx+6*4],ecx
mov [edx+5*4],edi
mov [edx+4*4],esi
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
mov [edx+3*4],ebx
mov [edx+2*4],ecx
mov [edx+1*4],edi
mov [edx+0*4],esi
mov byte ptr [edx+32],0
pop ebx
pop esi
pop edi
ret 2*4
DwToBin endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF

and here's my regular function
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
;returns size in eax
DwToBin proc dwValue:dword,pBuffer:dword;buff max 33bytes
push edi
mov ecx,31
mov edx,[esp+4][4];dwValue
mov edi,[esp+8][4];pBuffer
bsr eax,edx
jz @2
sub ecx,eax
shl edx,cl
mov ecx,eax
@1: add edx,edx
inc edi
mov al,'0' shr 1
adc al,al
dec ecx
mov [edi-1],al
jns @1
mov [edi],dl
mov eax,edi
pop edi
sub eax,[esp+8];pBuffer
ret 2*4
@2: mov word ptr [edi],'0'
mov eax,1
pop edi
ret 2*4
DwToBin endp

OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
The truth cannot be learned ... it can only be recognized.