Bin$

hutch-- · June 24, 2008, 09:15:32 AM

OK,

I have untangled JJs code and put it into the test bed. It passed data by registers so I modified both sinsi's algo and NightWares and get these timings. My own algo has run out of legs so I did not bother.

By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster. TThe technique of creating the table dynamcally in memory could be useful for a WORD version with a 64k table, far too big for initialised data but no big deal in terms of allocated memory. It would make possible WORD sized reads and writes which shold be nearly twice as fast.

Code Select


01001001100101100000001011010010 sinsi_ex
01001001100101100000001011010010 NightWare
01001001100101100000001011010010 hutch_ex2
01001001100101100000001011010010 JJ BIN$

593 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
593 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
454 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$

NightWare timing average = 593
sinsi_ex  timing average = 453
hutch_ex2 timing average = 656
JJ BIN$   timing average = 500
Press any key to continue ...

[attachment deleted by admin]

jj2007 · June 24, 2008, 09:22:06 AM

Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced :red :

Your version:

Code Select

nwDw2Bin PROC
	mov BYTE PTR [esi+32], 0
	mov ecx, 32	; 28 would produce garbage
Label1:
	mov edx, eax
	and edx, 00000000Fh
	shr eax, 4
	mov edx, DWORD PTR [BinaryTable+edx*4]
	sub ecx, DWORD
	mov DWORD PTR [esi+ecx], edx
	jnz Label1
ret
nwDw2Bin ENDP

My edits:

Code Select

nwDw2BinJJ PROC
	mov BYTE PTR [esi+32], 0
	mov ecx, 28		 ; works (JJ)
Label1:
	mov edx, eax
	and edx, 00000000Fh
	shr eax, 4
	mov edx, DWORD PTR [BinaryTable+edx*4]
	sub ecx, DWORD
	; shr eax, 4 moved up to reduce stalls (NightWare)
	mov DWORD PTR [esi+ecx+4], edx	; JJ: +4 is one byte longer but faster
	; sub ecx, DWORD moved up to reduce stalls (NightWare)
	jnz Label1	; jns Label1
ret
nwDw2BinJJ ENDP

Timings:
53 cycles timing BIN$ 139 bytes 625 LAMPs
61 cycles timing pbin2 147 bytes 740 LAMPs
89 cycles timing nwDw2Bin 101 bytes 894 LAMPs
74 cycles timing nwDw2BinJJ 102 bytes 747 LAMPs
304 cycles timing Dword2Bin2 59 bytes 2335 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The difference seems in the factor 28/32. Hmmm...

[attachment deleted by admin]

jj2007 · June 24, 2008, 09:37:23 AM

Quote from: hutch-- on June 24, 2008, 09:15:32 AM
By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster.

NightWare timing average = 595
sinsi_ex timing average = 492
hutch_ex2 timing average = 861
JJ BIN$ timing average = 625

My version, now with OPTION PROLOGUE:NONE for BIN$ but not including sinsi's algo with the external bintable; pbin2 is sinsi plus generated table:
47 cycles timing BIN$ 139 bytes 554 LAMPs
59 cycles timing pbin2 147 bytes 715 LAMPs
88 cycles timing nwDw2Bin 101 bytes 884 LAMPs
67 cycles timing nwDw2BinJJ 102 bytes 677 LAMPs
299 cycles timing Dword2Bin2 59 bytes 2297 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

The modified version of NightWare, see previous post, looks also promising (EDIT:), especially since the 101/102 bytes include already the 64 bytes of .data BinTable

hutch-- · June 24, 2008, 09:44:26 AM

JJ,

If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code. With sinsi's code I reordered some of the instructions to increase the instruction count between memory read and writes to prevent or at least slow down the read after write stalls and it got faster for doing so.

For small non-critical code I would use NightWare's original as it is small and only has a 64 byte table but with the speed advantage of sinsi's code, it would have to be the one to use for performance reasons.

While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library. Compliments on your BIN$ macro and the algo it calls, it does perform well.

jj2007 · June 24, 2008, 10:00:14 AM

Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version

That does the trick, it seems:
48 cycles timing BIN$ 139 bytes 566 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
80 cycles timing nwDw2Bin 101 bytes 804 LAMPs
69 cycles timing nwDw2BinJJ 102 bytes 697 LAMPs
49 cycles timing NightWare 234 bytes 750 LAMPs
296 cycles timing Dword2Bin2 56 bytes 2215 LAMPs

[attachment deleted by admin]

jj2007 · June 24, 2008, 10:17:14 AM

Quote from: hutch-- on June 24, 2008, 09:44:26 AM
Compliments on your BIN$ macro and the algo it calls, it does perform well.

Thanxalot, Hutch. Your version of BIN$ still has a little bug (ok in dw2binFinal.zip posted above) :

BIN$ MACRO dwArg:REQ, tgt:=<0>
cc INSTR <dwArg>, <eax>   ;; to avoid a mov eax, eax
if cc
   mov eax, dwArg         ;; eax passes value to translate
endif
if @SizeStr(tgt) gt 1 ;; BUG: "if tgt" won't work correctly
   mov edx, tgt         ;; dest buffer - if 0, use Dw2BinBuffer
else
   mov edx, offset Dw2BinBuffer
endif
call Dword2Bin
EXITM <edx>
ENDM

Quote
While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library.

BIN$ accepts one required argument and returns a pointer to the result. I chose to pass the value in eax, since all Win32 API's return values in that register. As it stands, it looks foolproof...

Code Select

	invoke GetKeyState, VK_CAPITAL
	invoke MessageBox, 0, BIN$(eax), chr$("Caps Lock in Bit 0:"), MB_OK

MichaelW · June 24, 2008, 10:26:00 AM

This is 27 bytes and 141 cycles on a P3.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      buff db 40 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

dw2bin proc dwNumber:DWORD, pszString:DWORD
    mov eax, [esp+8]
    mov ecx, 31
  @@:
    xor edx, edx
    shr DWORD PTR [esp+4], 1
    adc edx, '0'
    mov [eax+ecx], dl
    dec ecx
    jns @B
    ret 8
dw2bin endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke dw2bin, 0, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, 01010101h, ADDR buff
    print ADDR buff,13,10
    invoke dw2bin, -1, ADDR buff
    print ADDR buff,13,10,13,10

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      invoke dw2bin, 01010101h, ADDR buff
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.

jj2007 · June 24, 2008, 12:54:47 PM

Quote from: MichaelW on June 24, 2008, 10:26:00 AM
For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.

Never satisfied? Your wish is my command...

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b),
   chr$("BIN$ plain:"), MB_OK

   invoke MessageBox, 0, BIN$(10011001100110011001100110011001b, f),
   chr$("BIN$ formatted:"), MB_OK

Fortunately the timings are not affected, but wow, that cost me almost a hundred LAMPs :wink

47 cycles timing BIN$ 180 bytes 631 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
92 cycles timing nwDw2Bin 101 bytes 925 LAMPs
67 cycles timing nwDw2BinJJ 102 bytes 677 LAMPs
49 cycles timing NightWare 234 bytes 750 LAMPs
297 cycles timing Dword2Bin2 56 bytes 2223 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

[attachment deleted by admin]

hutch-- · June 24, 2008, 02:33:07 PM

Michael,

I plugged it in and tested it and the result is correct but its very slow against the rest. It also effected the timings of the other test algos but was running about 10 time slower than the others. It may just be a very bad stall on my PIV.

NightWare · June 24, 2008, 09:31:39 PM

Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code.

hmm, last shr eax,4 is useless :wink, i've also unrolled the code and made few change :

Code Select

; unrolled version
		mov BYTE PTR [esi+32],0
		mov ecx,32

		mov edx,eax
		and edx,00000000Fh
		shr eax,2
		mov edx,DWORD PTR [BinaryTable+edx*4]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch	; 0Fh * 4
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		shr eax,4
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

		mov edx,eax
		and edx,00000003Ch
		mov edx,DWORD PTR [BinaryTable+edx]
		sub ecx,DWORD
		mov DWORD PTR [esi+ecx],edx

mixing both should give a good result...

EDIT : something like :

Code Select

;		mov ecx,value
;		mov edx,buffer
; no need to pudh/pop esi/edi

		mov BYTE PTR [edx+32],0

		mov eax,ecx
		and eax,00000000Fh
		shr ecx,2
		mov eax,DWORD PTR [BinaryTable+eax*4]
		mov DWORD PTR [edx+28],eax

		mov eax,ecx
		and eax,00000003Ch	; 0Fh * 4
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+24],eax

		mov eax,ecx
		and eax,00000003Ch
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+20],eax

		mov eax,ecx
		and eax,00000003Ch
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+16],eax

		mov eax,ecx
		and eax,00000003Ch
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+12],eax

		mov eax,ecx
		and eax,00000003Ch
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+8],eax

		mov eax,ecx
		and eax,00000003Ch
		shr ecx,4
		mov eax,DWORD PTR [BinaryTable+eax]
		mov DWORD PTR [edx+4],eax

		and ecx,00000003Ch
		mov ecx,DWORD PTR [BinaryTable+ecx]
		mov DWORD PTR [edx],ecx

hutch-- · June 27, 2008, 12:17:44 AM

I maved this topic to the LAB so it would not get lost in a hurry.

lingo · June 27, 2008, 01:54:26 AM

The same but faster: :lol

Code Select


        mov 	BYTE PTR [edx+32],0
		
        mov 	eax, 	ecx
        and 	ecx, 	00000000Fh
        mov 	ecx, 	DWORD PTR [BinaryTable+ecx*4]
        shr	eax, 	2
	mov 	DWORD PTR [edx+28],ecx

	mov 	ecx, 	eax
	and 	eax, 	00000003Ch	; 0Fh * 4
	mov 	eax, 	DWORD PTR [BinaryTable+eax]
	shr 	ecx,	4
        mov 	DWORD PTR [edx+24],eax

	mov 	eax, 	ecx
	and 	ecx, 	00000003Ch
	mov 	ecx, 	DWORD PTR [BinaryTable+ecx]
	shr 	eax, 	4
        mov 	DWORD PTR [edx+20],ecx

	mov 	ecx, 	eax
	and 	eax, 	00000003Ch
	mov 	eax, 	DWORD PTR [BinaryTable+eax]
	shr 	ecx, 	4
        mov 	DWORD PTR [edx+16],eax

	mov 	eax,	ecx
	and 	ecx,	00000003Ch
	mov 	ecx,	DWORD PTR [BinaryTable+ecx]
	shr 	eax,	4
        mov 	DWORD PTR [edx+12],ecx

	mov 	ecx,	eax
	and	eax,	00000003Ch
	mov 	eax,	DWORD PTR [BinaryTable+eax]
	shr 	ecx,	4
        mov 	DWORD PTR [edx+8],eax

	mov 	eax,	ecx
	and 	ecx,	00000003Ch
	mov 	ecx,	DWORD PTR [BinaryTable+ecx]
	shr 	eax,	4
        mov 	DWORD PTR [edx+4], ecx

	and 	eax,	00000003Ch
	mov 	ecx,	DWORD PTR [BinaryTable+eax]
	mov 	DWORD PTR [edx],ecx

NightWare · June 27, 2008, 02:33:12 AM

Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol

:U

jj2007 · June 27, 2008, 08:03:22 AM

Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol

Not so easy to get stable timings, but I would vote for the Nightware/Lingo algo:

69 cycles timing BIN$ 180 bytes 926 LAMPs
128 cycles timing pbin2 147 bytes 1552 LAMPs
124 cycles timing nwDw2Bin 101 bytes 1246 LAMPs
103 cycles timing nwDw2BinJJ 102 bytes 1040 LAMPs
84 cycles timing NightWare 204 bytes 1200 LAMPs
89 cycles timing BinLingo 204 bytes 1271 LAMPs

48 cycles timing BIN$ 180 bytes 644 LAMPs
87 cycles timing pbin2 147 bytes 1055 LAMPs
115 cycles timing nwDw2Bin 101 bytes 1156 LAMPs
71 cycles timing nwDw2BinJJ 102 bytes 717 LAMPs
42 cycles timing NightWare 204 bytes 600 LAMPs
75 cycles timing BinLingo 204 bytes 1071 LAMPs

48 cycles timing BIN$ 180 bytes 644 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
121 cycles timing nwDw2Bin 101 bytes 1216 LAMPs
72 cycles timing nwDw2BinJJ 102 bytes 727 LAMPs
42 cycles timing NightWare 204 bytes 600 LAMPs
42 cycles timing BinLingo 204 bytes 600 LAMPs

53 cycles timing BIN$ 180 bytes 711 LAMPs
59 cycles timing pbin2 147 bytes 715 LAMPs
88 cycles timing nwDw2Bin 101 bytes 884 LAMPs
83 cycles timing nwDw2BinJJ 102 bytes 838 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
43 cycles timing BinLingo 204 bytes 614 LAMPs

Just for fun, here is one with the original library function:
48 cycles timing BIN$ 180 bytes 644 LAMPs
58 cycles timing pbin2 147 bytes 703 LAMPs
55 cycles timing NightWare 204 bytes 786 LAMPs
42 cycles timing BinLingo 204 bytes 600 LAMPs
82 cycles timing dw2bin_ex 2140 bytes 3793 LAMPs

LAMPs = Lean And Mean Points = cycles * sqrt(size)

Full source attached, with some switches.

[attachment deleted by admin]

drizz · June 27, 2008, 03:04:18 PM

here's my attempt :8):

Code Select


;; for 32 bytes
bintab equ BinaryTable
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
DwToBin proc dwValue:dword,pBuffer:dword
	mov eax,[esp+4];dwValue
	mov edx,[esp+8];pBuffer
	push edi
	push esi
	push ebx
	mov ebx,1111b
	mov ecx,11110000b
	mov edi,111100000000b
	mov esi,1111000000000000b
	and ecx,eax
	and edi,eax
	shr ecx,4
	and esi,eax
	shr edi,8
	and ebx,eax
	shr esi,12
	mov ebx,[ebx*4+bintab]
	mov ecx,[ecx*4+bintab]
	mov edi,[edi*4+bintab]
	mov esi,[esi*4+bintab]
	shr eax,16
	mov [edx+7*4],ebx
	mov [edx+6*4],ecx
	mov [edx+5*4],edi
	mov [edx+4*4],esi
	mov ebx,1111b
	mov ecx,11110000b
	mov edi,111100000000b
	mov esi,1111000000000000b
	and ecx,eax
	and edi,eax
	shr ecx,4
	and esi,eax
	shr edi,8
	and ebx,eax
	shr esi,12
	mov ebx,[ebx*4+bintab]
	mov ecx,[ecx*4+bintab]
	mov edi,[edi*4+bintab]
	mov esi,[esi*4+bintab]
	mov [edx+3*4],ebx
	mov [edx+2*4],ecx
	mov [edx+1*4],edi
	mov [edx+0*4],esi
	mov byte ptr [edx+32],0
	pop ebx
	pop esi
	pop edi
	ret 2*4
DwToBin endp
OPTION PROLOGUE:PROLOGUEDEF 
OPTION EPILOGUE:EPILOGUEDEF

and here's my regular function

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
;returns size in eax
DwToBin proc dwValue:dword,pBuffer:dword;buff max 33bytes
	push edi
	mov ecx,31
	mov edx,[esp+4][4];dwValue
	mov edi,[esp+8][4];pBuffer
	bsr eax,edx
	jz @2
	sub ecx,eax
	shl edx,cl
	mov ecx,eax
@1:	add edx,edx
	inc edi
	mov al,'0' shr 1
	adc al,al
	dec ecx
	mov [edi-1],al
	jns @1
	mov [edi],dl
	mov eax,edi
	pop edi
	sub eax,[esp+8];pBuffer
	ret 2*4
@2:	mov word ptr [edi],'0'
	mov eax,1
	pop edi
	ret 2*4
DwToBin endp

OPTION PROLOGUE:PROLOGUEDEF 
OPTION EPILOGUE:EPILOGUEDEF

News:

Bin$