I hope I am not reinventing the wheel, but I didn't find a Bin$ macro in the library. Str$ and Hex$ are there but Bin$ is not, so I rolled my own. Here it is, for comments and fine-tuning.
include \masm32\include\masm32rt.inc
Bin$ MACRO dwArg:REQ
cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
if cc eq 0
mov eax, dwArg
endif
call Dword2Bin
EXITM <edx>
ENDM
.code
MyApp1 db "Dword2Bin: press Control to see GetKeyState (VK_CONTROL)", 0
MyApp2 db "Dword2Bin: press CAPS to see GetKeyState (VK_CAPITAL)", 0
MyApp3 db "Dword2Bin: binary representation of GetKeyState (VK_CAPITAL)", 0
start: ; direct call: 0FC0FC0AAh = 11111100000011111100000010101010b = -66076502 decimal
invoke MessageBox, 0, Bin$(0FC0FC0AAh), addr MyApp1, MB_OK
; indirect call, using a result in eax:
invoke GetKeyState, VK_CONTROL
invoke MessageBox, 0, Bin$(eax), addr MyApp2, MB_OK
invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, Bin$(eax), addr MyApp3, MB_OK
exit
Dword2Bin proc uses ebx
ifndef Dw2BinBuffer
.data?
Dw2BinBuffer db 36 dup (?) ; 32 are needed, we pad 4 for alignment
.code
endif
mov edx, offset Dw2BinBuffer+32
xor ecx, ecx
mov [edx], cl ; null terminator
add ecx, 31
@@: xor ebx, ebx
dec edx ; on exit, edx will point to the string
sar eax, 1 ; we work from right to left
adc ebx, 48 ; +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx], bl
dec ecx ; decrement counter
jge @B ; dec sets only the sign flag
ret
Dword2Bin endp
end start
jj2007,
Your code works fine, it correctly handled everything I threw at it.
The procedure in the masm32 library is dw2bin_ex.
Here is a macro I wrote that uses it:
BinString MACRO n:REQ
IFNDEF szBinStringBuffer
.DATA
szBinStringBuffer BYTE 40 DUP(0)
.CODE
ENDIF
INVOKE dw2bin_ex, n, ADDR szBinStringBuffer
mov eax, OFFSET szBinStringBuffer
EXITM <eax>
ENDM
Thanks for pointing me to dw2bin_ex, Greg.
I had a look at dw2bin_ex now and must admit it's very fast - about 140 cycles as compared to 460 for my own. So I tried to optimise my code, and Dword2Bin2 does it in 335 cycles - see attachment, benchmarks welcome.
You may ask "why roll your own if the lib routine is three times as fast?". Well, the only justification is that dw2bin_ex performs badly on the LeanAndMeanPoints test, where LMP=cycles * sqrt(size):
139 cycles timing dw2bin_ex size 2140 bytes 139*sqrt(2140)=6384
462 cycles timing Dword2Bin size 31 bytes 462*sqrt(31)=2572
335 cycles timing Dword2Bin2 size 57 bytes 335*sqrt(57)=2529
Here is the winning code, for fine-tuning and suggestions. What struck me is that:
- apparently it is a little bit faster when I activate the SizeTest switch
- it is a lot faster with the extra NOP !
Any explanation?
Dword2Bin2 proc uses ebx
if SizeTest
mov ecx, offset gL2
sub ecx, offset pStart
add ecx, 4
mov cs2, ecx
endif
pStart:
mov edx, offset Dw2BinBuffer
xor ecx, ecx
mov [edx+32], cl ; null terminator
add ecx, 7 ; 31
; 4* setc bl plus add ebx, 30303030h is a lot slower
@@: sub ebx, ebx ; 2 cycle
; ## this nop looks superfluous but leads to a decrease of up to 78 cycles ...! #################
nop
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1
sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1
sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1
sal ebx, 8 ; 2 cycles - we write from right to left
sar eax, 1 ; 2 cycles - we read from left to right
adc ebx, 48 ; 1 +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx+4*ecx], ebx
dec ecx ; 1 decrement counter
; sub ecx, 1 ; 1 but in practice 9 cycles slower
jge @B ; dec sets only the sign flag
ret
Dword2Bin2 endp
gL2: ; global label for getting code size
[attachment deleted by admin]
jj,
Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intened task. The real limit on the speed of this algo is the number of memory accesses. It is probably possible to make a faster version but it would be an interesting table design and may be a lot larger again.
Timings
00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 original
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
766 dw2bin_ex
671 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
734 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
672 dw2bin_ex2
750 dw2bin_ex
dw2bin_ex2 timing average = 671
dw2bin_ex timing average = 750
Press any key to continue ...
Source
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
comment * -----------------------------------------------------
Build this template with
"CONSOLE ASSEMBLE AND LINK"
----------------------------------------------------- *
dw2bin_ex2 PROTO :DWORD,:DWORD
EXTERNDEF bintable:DWORD
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
call main
inkey
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
lcnt equ <50000000>
main proc
LOCAL buf1[64]:BYTE
LOCAL buf2[64]:BYTE
LOCAL ptr1 :DWORD
LOCAL ptr2 :DWORD
LOCAL tot1 :DWORD
LOCAL tot2 :DWORD
mov ptr1, ptr$(buf1)
mov ptr2, ptr$(buf2)
mov tot1, 0
mov tot2, 0
invoke dw2bin_ex2,1234,ptr1
print ptr1," dw2bin_ex2",13,10
invoke dw2bin_ex,1234,ptr2
print ptr2," original",13,10,13,10
push esi
REPEAT 8
; =====================================
invoke GetTickCount
push eax
mov esi, lcnt
@@:
invoke dw2bin_ex2,1234,ptr1
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tot1, eax
print str$(eax)," dw2bin_ex2",13,10
; =====================================
invoke GetTickCount
push eax
mov esi, lcnt
@@:
invoke dw2bin_ex,1234,ptr1
sub esi, 1
jnz @B
invoke GetTickCount
pop ecx
sub eax, ecx
add tot2, eax
print str$(eax)," dw2bin_ex",13,10
; =====================================
ENDM
shr tot1, 3
shr tot2, 3
print chr$(13,10),"dw2bin_ex2 timing average = "
print str$(tot1),13,10
print "dw2bin_ex timing average = "
print str$(tot2),13,10
pop esi
ret
main endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
dw2bin_ex2 proc var:DWORD,buffer:DWORD
push ebx
mov ebx, OFFSET bintable
mov edx, [esp+12]
movzx eax, BYTE PTR [esp+8][3]
mov ecx, [ebx+eax*8]
mov [edx], ecx
mov ecx, [ebx+eax*8+4]
mov [edx+4], ecx
movzx eax, BYTE PTR [esp+8][2]
mov ecx, [ebx+eax*8]
mov [edx+8], ecx
mov ecx, [ebx+eax*8+4]
mov [edx+12], ecx
movzx eax, BYTE PTR [esp+8][1]
mov ecx, [ebx+eax*8]
mov [edx+16], ecx
mov ecx, [ebx+eax*8+4]
mov [edx+20], ecx
movzx eax, BYTE PTR [esp+8]
mov ecx, [ebx+eax*8]
mov [edx+24], ecx
mov ecx, [ebx+eax*8+4]
mov [edx+28], ecx
mov BYTE PTR [edx+32], 0
pop ebx
ret 8
dw2bin_ex2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Quote from: hutch-- on June 22, 2008, 12:58:54 AM
Give this version a blast, I removed the stack frame and tweaked a couple of mnemonics and its about 12% faster on my old PIV. The algo is actually designed for streaming and its size is secondary to its intended task.
That explains everything :bg
Was there ever a more compact dw2bin_no_ex version? As said above, dw2bin_ex performs poorly on the LeanAndMeanPoints test, where LAMPs = cycles * sqrt(size):
139 cycles timing dw2bin_ex size 2140 bytes 139*sqrt(2140)=
6384 LAMPs, poor
462 cycles timing Dword2Bin size 31 bytes 462*sqrt(31)=2572 LAMPS, almost good
335 cycles timing Dword2Bin2 size 57 bytes 335*sqrt(57)=
2529 LAMPS, good
Imho there should be a compact medium fast dw2bin_lean_and_mean version. The ex is great for streaming, yes, but I checked my 0.6 MB Dashboard source for BIN$ and found a mere 7 entries, all of them in non-critical loops. Many small rarely used functions should be optimised for size, while frequently used algos should focus on speed. An economist's 2 cents worth :wink
I got no reply re the pretty significant speed oddities mentioned above - stalls??
:bg
JJ,
I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.
It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.
JJ,
Just a quick look at your last posted algo and the comment about some stalls.
The SETx byte will be slow due to partial read of a 32 bit register.
Your other slow instructions are SAL SAR and ADC.
See if you can break it down into lower level primitives even if its a bit longer in byte count.
A little bit faster -
00000000000000000000010011010010 dw2bin_ex2
00000000000000000000010011010010 pbin
687 dw2bin_ex2
407 pbin
687 dw2bin_ex2
391 pbin
687 dw2bin_ex2
406 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin
688 dw2bin_ex2
390 pbin
688 dw2bin_ex2
406 pbin
dw2bin_ex2 timing average = 687
pbin timing average = 400
align 16
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
pbin PROC PUBLIC num:DWORD,buf:DWORD
push ebx
push esi
push edi
mov ecx,16[esp]
mov edx,20[esp]
mov edi,offset bintable
movzx eax,cl
mov ebx,[edi+eax*8]
mov esi,[edi+eax*8+4]
mov 24[edx],ebx
mov 28[edx],esi
movzx eax,ch
mov ebx,[edi+eax*8]
mov esi,[edi+eax*8+4]
shr ecx,16
mov 16[edx],ebx
mov 20[edx],esi
movzx eax,cl
mov ebx,[edi+eax*8]
mov esi,[edi+eax*8+4]
mov 8[edx],ebx
mov 12[edx],esi
movzx eax,ch
mov ebx,[edi+eax*8]
mov esi,[edi+eax*8+4]
sub eax,eax
mov [edx],ebx
mov 4[edx],esi
mov 32[edx],al
pop edi
pop esi
pop ebx
ret 8
pbin ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
sinsi,
:U
Quote from: hutch-- on June 23, 2008, 02:59:38 AM
I use a different criterion, there is the FAST test, then the FASTER test and then there is the EVEN FASTER test in most algo design.
It does make sense to have a smaller version for general purpose work so it would be worth the effort to make a super reliable one that is fast enough in most instances.
So you put more emphasis on the
mean but we agree basically that lean
and mean is good :bg
Quote from: hutch-- on June 23, 2008, 06:30:08 AM
sinsi,
:U
60 cycles timing pbin 2140 bytes
2776 LAMPs
82 cycles timing dw2bin_ex 2140 bytes 3793 LAMPs
299 cycles timing Dword2Bin2 57 bytes
2257 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
The choice gets harder with Sinsi's LAMPs approaching mine...
I wrote a little macro for measuring LAMPs:
MinLampSize = 200 ; "default size" in case of external APIs, e.g. crt_strstr
LAMP$ MACRO cycles:REQ, csize:REQ ; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
ffree ST(7) ;; free the stack for pushing
push csize
fild dword ptr [esp] ;; move into ST (0)
pop eax
fsqrt
mov eax, cycles
.if eax==0
add eax, MinLampSize
.endif
push eax
fild dword ptr [esp] ;; push cycles on FPU
fmul ST, ST(1) ;; multiply with sqrt(size)
fistp dword ptr [esp]
pop eax
invoke dwtoa, eax, addr LampsBuffer
EXITM <offset LampsBuffer>
ENDM
.data?
LampsBuffer dd 4 dup (?)
Sinsi, the result for your code is faked because I couldn't find this bit:
mov edi, offset bintable ; where defined??
Attached code is complete except for this line. There is a UsePbin = 0 switch on top, if you manage to activate it, that would be great.
[attachment deleted by admin]
JJ,
Its a seperate module in the masm32 library. One VERY LARGE table. :bg
Quote from: jj2007 on June 23, 2008, 09:06:16 AM
Sinsi, the result for your code is faked because I couldn't find this bit:
mov edi, offset bintable ; where defined??
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm
Is there any way of cutting the size of the table down? Maybe somehow using the high bit to cut it down to 128? I can't think now, too pissed... :bg
[edit]
Tried your code with UsePbin=1 and got
33 cycles timing pbin 0 bytes 0 LAMPs
53 cycles timing dw2bin_ex 2140 bytes 2452 LAMPs
164 cycles timing Dword2Bin 31 bytes 913 LAMPs
130 cycles timing Dword2Bin2 57 bytes 981 LAMPs
[/edit]
Quote from: sinsi on June 23, 2008, 09:33:34 AM
Sorry, I used hutch's code which includes masm32rt.inc. The file is \masm32\m32lib\bintbl.asm
include \masm32\include\masm32rt.inc
Line 1 of my code ;-)
My version of masm32rt.inc has 3217 bytes, dated 25.05.2005. No bintbl in there, and ml chokes when I tried to add the include manually. So I copied the whole table into my source and ran it again:
59 cycles timing pbin 2145 bytes 2733 LAMPs
79 cycles timing dw2bin_ex 2140 bytes 3655 LAMPs
648 cycles timing Dword2Bin 31 bytes 3608 LAMPs
304 cycles timing Dword2Bin2 57 bytes 2295 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is on (and costs some cycles)
55 cycles timing pbin 2145 bytes
2547 LAMPs :clap:
79 cycles timing dw2bin_ex 2140 bytes 3655 LAMPs
644 cycles timing Dword2Bin 31 bytes 3586 LAMPs
299 cycles timing Dword2Bin2 57 bytes
2257 LAMPs :cheekygreen:
LAMPs = Lean And Mean Points = cycles * sqrt(size)
SizeTest is off (means I added the previously calculated sizes by hand)
Seriously: Masm is attractive because it produces fast
and compact code. Especially newbies are impressed when they have to start counting in bytes rather than kBytes or MBytes again. So a "LAMP" style optimisation rule makes sense when designing a big library. Those who need a fast routine for streaming should find it, those who want to boast with compact and still fast code should be served, too.
[attachment deleted by admin]
Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?
Quote from: sinsi on June 23, 2008, 10:14:39 AM
Where on earth does "LAMPs = Lean And Mean Points = cycles * sqrt(size)" come from?
Northern Italy, on the shores of Lago Maggiore.
After some reflection, we have a new winner now:
56 cycles timing pbin 149 bytes 684 LAMPs :cheekygreen:
80 cycles timing dw2bin_ex 2140 bytes 3701 LAMPs :naughty:
303 cycles timing Dword2Bin2 57 bytes 2288 LAMPs :clap:
[attachment deleted by admin]
JJ,
sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.
Quote from: hutch-- on June 23, 2008, 02:16:50 PM
JJ,
sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.
No, it doesn't.
Sinsi, does the sub eax, eax have a function? I timed it repeatedly with and without but could not find a real difference.
mov esi,[edi+eax*8+4]
sub eax, eax <--
mov [edx],ebx
Quote from: sinsi on June 23, 2008, 09:33:34 AM
[Is there any way of cutting the size of the table down? Maybe somehow using the high bit to cut it down to 128? I can't think now, too pissed... :bg
hmm, my attempt :
.DATA
ALIGN 4
BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
DWORD "0001","1001","0101","1101","0011","1011","0111","1111"
.CODE
ALIGN 16
;
; Syntax :
; mov eax,valeur
; mov esi,String address
; call nwDw2Bin
;
nwDw2Bin PROC
mov BYTE PTR [esi+32],0
mov ecx,28
Label1: mov edx,eax
and edx,00000000Fh
mov edx,DWORD PTR [BinaryTable+edx*4]
shr eax,4
mov DWORD PTR [esi+ecx],edx
sub ecx,DWORD
jns Label1
ret
nwDw2Bin ENDP
Welcome on board!
Testing correctness of results for BIN$, pbin2 and nwDw2Bin
11110111000000001111111100000000 BIN$
11110111000000001111111100000000 pbin2
11110111000000001111111100000000 nwDw2Bin
11110111000000001111111100000000 original value
00001000111111110000000011111111
00001000111111110000000011111111
00001000111111110000000011111111 nwDw2Bin
00001000111111110000000011111111
00100011111111000000001111111100
00100011111111000000001111111100
00100011111111000000001111111100 nwDw2Bin
00100011111111000000001111111100
10101010101010101010101010101010
10101010101010101010101010101010
10101010101010101010101010101010 nwDw2Bin
10101010101010101010101010101010
00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000000000000000000 nwDw2Bin
GetKeyState (VK_CAPITAL), i.e. Caps Lock
40 cycles timing BIN$ 139 bytes 472 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
67 cycles timing nwDw2Bin 101 bytes 673 LAMPs
128 cycles timing dw2bin_ex 2140 bytes 5921 LAMPs
351 cycles timing Dword2Bin2 57 bytes 2650 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
Just as a note, I am buried in the middle of a mounain of work at the moment so I don't have enough time to track this topic in real detail but it would be very useful to get a conclusion that produced two seperate algos, one for streaming like the table based masm32 lib version and a much smaller one for a non "_ex" procedure.
Something I remember is an algorithm of this type that Michael Webster wrote a couple of years ago that would handle byte, word and dword sized binary conversions but I don't know where it is any longer
Quote from: jj2007 on June 23, 2008, 03:09:17 PM
Quote from: hutch-- on June 23, 2008, 02:16:50 PM
JJ,
sinsi's code still uses the bintable so it larger than you have listed. The table adds 2k of data.
No, it doesn't.
mov edi,offset bintable
yes it does.
Quote
Sinsi, does the sub eax, eax have a function? I timed it repeatedly with and without but could not find a real difference.
mov esi,[edi+eax*8+4]
sub eax, eax <--
mov [edx],ebx
sub eax,eax
mov [edx],ebx
mov 4[edx],esi
mov 32[edx],al ;<---
I was always told to use a register...old habits die hard.
stalls reduced :red :
mov BYTE PTR [esi+32],0
mov ecx,32
Label1: mov edx,eax
and edx,00000000Fh
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx*4]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
jnz Label1
NightWare,
I have just had a quick play and formatted your algo into a library module. I wonder how much slower it would be with arguments passed on the stack rather than in registers ?
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.486 ; maximum processor model
.model flat, stdcall ; memory model & calling convention
option casemap :none ; case sensitive
; Syntax :
; mov eax, DWORD value
; mov esi, String address
; call nwDw2Bin
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
align 16
; ------------------------------------
; NightWare's DWORD to bin string algo
; ------------------------------------
nwDw2Bin PROC
push esi
.data
BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
DWORD "0001","1001","0101","1101","0011","1011","0111","1111"
.code
mov BYTE PTR [esi+32], 0
mov ecx, 32
lbl0:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, 4
mov DWORD PTR [esi+ecx], edx
jnz lbl0
pop esi
ret
nwDw2Bin ENDP
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
It got bigger but it also got faster.
Results
00000000000000000000010011010010 sinsi_ex
00000000000000000000010011010010 NightWare
00000000000000000000010011010010 hutch_ex2
656 NightWare
547 sinsi_ex
657 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
657 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
657 hutch_ex2
672 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
656 NightWare
547 sinsi_ex
656 hutch_ex2
NightWare timing average = 658
sinsi_ex timing average = 547
hutch_ex2 timing average = 656
Press any key to continue ...
Algo
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
; ---------------------------------------------
; Modified NightWare's DWORD to bin string algo
; ---------------------------------------------
NightWare PROC value:DWORD,buffer:DWORD
.data
align 16
BinaryTable DWORD "0000","1000","0100","1100","0010","1010","0110","1110"
DWORD "0001","1001","0101","1101","0011","1011","0111","1111"
.code
push esi
push edi
mov eax, [esp+4][8] ;; value
mov esi, [esp+8][8] ;; buffer
mov BYTE PTR [esi+32], 0
mov ecx, 32
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-4], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-8], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-12], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-16], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-20], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-24], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-28], edx
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
mov DWORD PTR [esi+ecx-32], edx
pop edi
pop esi
ret 8
NightWare ENDP
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
I apologise for the fact that the attachments I supplied have become apparently so confusing that you dare no longer read them. As said above, Sinsi's modified code does no longer need a table, and in the final "BIN$" version it is also by far the fastest of all previous versions (including Nightware's code). So there is no longer a need for separate fast vs compact codes.
Cheers, JJ
Here is a tweak of sinsi's version, basically some instruction re-ordering and swapping the names of some registers.
00000000000000000000010011010010 sinsi_ex
00000000000000000000010011010010 NightWare
00000000000000000000010011010010 hutch_ex2
657 NightWare
516 sinsi_ex
656 hutch_ex2
656 NightWare
516 sinsi_ex
657 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
657 NightWare
516 sinsi_ex
656 hutch_ex2
656 NightWare
516 sinsi_ex
657 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
656 NightWare
515 sinsi_ex
656 hutch_ex2
NightWare timing average = 656
sinsi_ex timing average = 515
hutch_ex2 timing average = 656
Press any key to continue ...
The algo
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
sinsi_ex proc var:DWORD,buffer:DWORD
mov ecx,4[esp]
push ebx
mov edx,12[esp]
push esi
movzx esi,cl
push edi
mov edi,offset bintable
mov ebx,[edi+esi*8]
mov eax,[edi+esi*8+4]
movzx esi,ch
mov 24[edx],ebx
mov 28[edx],eax
mov ebx,[edi+esi*8]
mov eax,[edi+esi*8+4]
shr ecx,16
movzx esi,cl
mov 16[edx],ebx
mov 20[edx],eax
mov ebx,[edi+esi*8]
mov eax,[edi+esi*8+4]
movzx esi,ch
mov BYTE PTR 32[edx], 0
mov 8[edx],ebx
mov 12[edx],eax
mov ecx,[edi+esi*8]
mov eax,[edi+esi*8+4]
pop edi
pop esi
pop ebx
mov [edx],ecx
mov 4[edx],eax
ret 8
sinsi_ex endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
JJ, would you just post your fast algo so I can put it into a test piece ?
Cycles Code size including tables etc.
47 cycles timing BIN$ 139 bytes
56 cycles timing pbin2 147 bytes
107 cycles timing nwDw2Bin 101 bytes
107 cycles timing dw2bin_ex 2140 bytes
402 cycles timing Dword2Bin2 57 bytes
Usage:
invoke MessageBox, 0, BIN$(11111000001111100000111111110000b), chr$("A binary string:"), MB_OK
or
mov xxx, BIN$(11111000001111100000111111110000b)
or
mov edx, offset Dw2BinBuffer
mov eax, 11111000001111100000111111110000b
call Dword2Bin
invoke MessageBox, 0, addr Dw2BinBuffer, chr$("A binary string:"), MB_OK
The macro:
BIN$ MACRO dwArg:REQ, tgt:=<0>
cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
if cc eq 0
mov eax, dwArg ;; eax passes value to translate
endif
if @SizeStr(tgt) gt 1
mov edx, tgt ;; dest buffer - if 0, use Dw2BinBuffer
else
mov edx, offset Dw2BinBuffer
endif
call Dword2Bin
EXITM <edx>
ENDM
.data?
Dw2BinBuffer dd 9 dup (?) ; 32 are needed, we pad 4 for alignment
Dw2BinTable dd 2048/4 dup (?) ; former bintable now in .data? section
The proc
align 4
Dword2Bin proc
ifndef Dw2BinBuffer ; credits to Sinsi of the Masm32 Forum
.data?
Dw2BinBuffer dd 36/4 dup (?) ; 32 are needed, we pad 4 for alignment
Dw2BinTable dd 2048/4 dup (?) ; old bintable
.code
endif
push ebx
push esi
push edi
mov edi, offset Dw2BinTable ; old \masm32\m32lib\bintbl.asm
mov ecx, [edi]
.if ecx==0 ; seems you have to initialise your bintable...
push eax ; save value
push edx ; save destination
mov edx, offset Dw2BinTable+2048 ; destination bintable
mov esi, 63 ; outer loop counter
mov edi, 0FCFDFEFFh ; seed for creating table
btInit:
mov eax, edi
sub edi, 04040404h
mov ecx, 31 ; inner loop counter
@@: xor ebx, ebx
dec edx ; on exit, edx will point to the string
sar eax, 1 ; we work from right to left
adc ebx, 48 ; +48: Ascii 0 or, with carry, +49 : Ascii 1
mov [edx], bl
dec ecx ; decrement inner counter
jge @B ; dec sets only the sign flag
dec esi ; decrement outer counter
jge btInit ; dec sets only the sign flag
mov edi, edx
pop edx ; get destination
pop eax ; get value
.endif
movzx ecx, al ; the value to translate
mov ebx, [edi+ecx*8]
mov esi, [edi+ecx*8+4]
mov 24[edx], ebx
mov 28[edx], esi
movzx ecx, ah
mov ebx, [edi+ecx*8]
mov esi, [edi+ecx*8+4]
shr eax, 16
mov 16[edx], ebx
mov 20[edx], esi
movzx ecx, al
mov ebx, [edi+ecx*8]
mov esi, [edi+ecx*8+4]
mov 8[edx], ebx
mov 12[edx], esi
movzx ecx, ah
mov ebx, [edi+ecx*8]
mov esi, [edi+ecx*8+4]
mov [edx], ebx
mov 4[edx], esi
mov 32[edx], ch ; null terminator
pop edi
pop esi
pop ebx
ret ; not: 8
Dword2Bin endp
Quote from: sinsi on June 23, 2008, 11:59:48 PM
mov edi,offset bintable
yes it does.
Sorry, I forgot to say that I ruthlessly eliminated the table from your code ;-)
sub eax,eax
mov [edx],ebx
mov 4[edx],esi
mov 32[edx],al ;<---
I was always told to use a register...old habits die hard.
Quote
mov
zx eax,ch
mov ebx,[edi+eax*8]
mov esi,[edi+eax*8+4]
; sub eax, eax
mov [edx],ebx
mov 4[edx],esi
mov 32[edx],a
h ; al is non-zero
If you agree... :bg
Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced :red :
Thanks. The effect is not very clear, maybe the increased counter compensates part of the gain:
mov ecx, 28+4 ; +4 because sub ecx was shifted up??
EDIT:
I just saw that you changed the jns to jnz at the end of the loop.
If I use mov ecx, 32 and jnz, as in your last post, I get garbage.
If I use mov ecx, 28 and jnz, code produces correct results and is faster.
EDIT (2):
Forget Edit (1) and see next post below.
nwDw2Bin PROC
mov BYTE PTR [esi+32],0
mov ecx, 28
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax,4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
; shr eax,4 moved up to reduce stalls
mov DWORD PTR [esi+ecx+4], edx
; sub ecx,DWORD moved up to reduce stalls
jnz Label1 ; ex jns Label1
ret
Quote from: jj2007 on June 24, 2008, 08:28:27 AM
Sorry, I forgot to say that I ruthlessly eliminated the table from your code ;-)
Being ruthless is what asm is all about...
Quote from: jj2007 on June 24, 2008, 08:28:27 AM
If you agree... :bg
Everyone hates a smartarse. :bdg
Anyway, it's hard to see through beer goggles. :dazzled:
OK,
I have untangled JJs code and put it into the test bed. It passed data by registers so I modified both sinsi's algo and NightWares and get these timings. My own algo has run out of legs so I did not bother.
By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster. TThe technique of creating the table dynamcally in memory could be useful for a WORD version with a 64k table, far too big for initialised data but no big deal in terms of allocated memory. It would make possible WORD sized reads and writes which shold be nearly twice as fast.
01001001100101100000001011010010 sinsi_ex
01001001100101100000001011010010 NightWare
01001001100101100000001011010010 hutch_ex2
01001001100101100000001011010010 JJ BIN$
593 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
593 NightWare
453 sinsi_ex
657 hutch_ex2
500 JJ BIN$
594 NightWare
454 sinsi_ex
656 hutch_ex2
500 JJ BIN$
594 NightWare
453 sinsi_ex
656 hutch_ex2
500 JJ BIN$
NightWare timing average = 593
sinsi_ex timing average = 453
hutch_ex2 timing average = 656
JJ BIN$ timing average = 500
Press any key to continue ...
[attachment deleted by admin]
Quote from: NightWare on June 24, 2008, 01:29:06 AM
stalls reduced :red :
Your version:
nwDw2Bin PROC
mov BYTE PTR [esi+32], 0
mov ecx, 32 ; 28 would produce garbage
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
mov DWORD PTR [esi+ecx], edx
jnz Label1
ret
nwDw2Bin ENDP
My edits:
nwDw2BinJJ PROC
mov BYTE PTR [esi+32], 0
mov ecx, 28 ; works (JJ)
Label1:
mov edx, eax
and edx, 00000000Fh
shr eax, 4
mov edx, DWORD PTR [BinaryTable+edx*4]
sub ecx, DWORD
; shr eax, 4 moved up to reduce stalls (NightWare)
mov DWORD PTR [esi+ecx+4], edx ; JJ: +4 is one byte longer but faster
; sub ecx, DWORD moved up to reduce stalls (NightWare)
jnz Label1 ; jns Label1
ret
nwDw2BinJJ ENDP
Timings:53 cycles timing BIN$ 139 bytes 625 LAMPs
61 cycles timing pbin2 147 bytes 740 LAMPs
89 cycles timing
nwDw2Bin 101 bytes 894 LAMPs
74 cycles timing
nwDw2BinJJ 102 bytes 747 LAMPs
304 cycles timing Dword2Bin2 59 bytes 2335 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
The difference seems in the factor 28/32. Hmmm...
[attachment deleted by admin]
Quote from: hutch-- on June 24, 2008, 09:15:32 AM
By the results sinsi's code is clearly fastest while NightWare's code is the smallest as it does not dynamically create a table in memory. JJ's result is a good one and it could be made faster.
NightWare timing average = 595
sinsi_ex timing average = 492
hutch_ex2 timing average = 861
JJ BIN$ timing average = 625
My version, now with OPTION PROLOGUE:NONE for BIN$ but not including sinsi's algo with the external bintable; pbin2 is sinsi plus generated table:
47 cycles timing BIN$ 139 bytes 554 LAMPs
59 cycles timing pbin2 147 bytes 715 LAMPs
88 cycles timing nwDw2Bin 101 bytes 884 LAMPs
67 cycles timing nwDw2BinJJ 102 bytes 677 LAMPs
299 cycles timing Dword2Bin2 59 bytes 2297 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
The modified version of NightWare, see previous post, looks also promising (EDIT:), especially since the 101/102 bytes include already the 64 bytes of .data BinTable
JJ,
If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code. With sinsi's code I reordered some of the instructions to increase the instruction count between memory read and writes to prevent or at least slow down the read after write stalls and it got faster for doing so.
For small non-critical code I would use NightWare's original as it is small and only has a 64 byte table but with the speed advantage of sinsi's code, it would have to be the one to use for performance reasons.
While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library. Compliments on your BIN$ macro and the algo it calls, it does perform well.
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version
That does the trick, it seems:
48 cycles timing BIN$ 139 bytes 566 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
80 cycles timing nwDw2Bin 101 bytes 804 LAMPs
69 cycles timing nwDw2BinJJ 102 bytes 697 LAMPs
49 cycles timing NightWare 234 bytes 750 LAMPs
296 cycles timing Dword2Bin2 56 bytes 2215 LAMPs
[attachment deleted by admin]
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
Compliments on your BIN$ macro and the algo it calls, it does perform well.
Thanxalot, Hutch. Your version of BIN$ still has a little bug (ok in dw2binFinal.zip posted above) :
BIN$ MACRO dwArg:REQ, tgt:=<0>
cc INSTR <dwArg>, <eax> ;; to avoid a mov eax, eax
if cc
mov eax, dwArg ;; eax passes value to translate
endif
if @SizeStr(tgt) gt 1 ;; BUG: "if tgt" won't work correctly
mov edx, tgt ;; dest buffer - if 0, use Dw2BinBuffer
else
mov edx, offset Dw2BinBuffer
endif
call Dword2Bin
EXITM <edx>
ENDM
Quote
While algos of short instruction count benefit from register passing, they are not truly general purpose so they are not all that useful for a library.
BIN$ accepts one required argument and returns a pointer to the result. I chose to pass the value in eax, since all Win32 API's return values in that register. As it stands, it looks foolproof...
invoke GetKeyState, VK_CAPITAL
invoke MessageBox, 0, BIN$(eax), chr$("Caps Lock in Bit 0:"), MB_OK
This is 27 bytes and 141 cycles on a P3.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buff db 40 dup(0)
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
dw2bin proc dwNumber:DWORD, pszString:DWORD
mov eax, [esp+8]
mov ecx, 31
@@:
xor edx, edx
shr DWORD PTR [esp+4], 1
adc edx, '0'
mov [eax+ecx], dl
dec ecx
jns @B
ret 8
dw2bin endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke dw2bin, 0, ADDR buff
print ADDR buff,13,10
invoke dw2bin, 01010101h, ADDR buff
print ADDR buff,13,10
invoke dw2bin, -1, ADDR buff
print ADDR buff,13,10,13,10
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
invoke dw2bin, 01010101h, ADDR buff
counter_end
print ustr$(eax)," cycles",13,10,13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.
Quote from: MichaelW on June 24, 2008, 10:26:00 AM
For me 32-bit binary strings are hard to read and interpret, because as you move in from the ends it becomes increasingly difficult to know which bit position you are looking at. I think a more reasonable format would include spaces between the bytes.
Never satisfied? Your wish is my command...
invoke MessageBox, 0, BIN$(10011001100110011001100110011001b),
chr$("BIN$ plain:"), MB_OK
invoke MessageBox, 0, BIN$(10011001100110011001100110011001b
, f),
chr$("BIN$ formatted:"), MB_OK
Fortunately the timings are not affected, but wow, that cost me almost a hundred LAMPs :wink
47 cycles timing BIN$ 180 bytes 631 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
92 cycles timing nwDw2Bin 101 bytes 925 LAMPs
67 cycles timing nwDw2BinJJ 102 bytes 677 LAMPs
49 cycles timing NightWare 234 bytes 750 LAMPs
297 cycles timing Dword2Bin2 56 bytes 2223 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
Michael,
I plugged it in and tested it and the result is correct but its very slow against the rest. It also effected the timings of the other test algos but was running about 10 time slower than the others. It may just be a very bad stall on my PIV.
Quote from: hutch-- on June 24, 2008, 09:44:26 AM
If you look at the code I posted I unrolled NightWare's version to stabilise its timing and remove the loop code.
hmm, last shr eax,4 is useless :wink, i've also unrolled the code and made few change :
; unrolled version
mov BYTE PTR [esi+32],0
mov ecx,32
mov edx,eax
and edx,00000000Fh
shr eax,2
mov edx,DWORD PTR [BinaryTable+edx*4]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch ; 0Fh * 4
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
shr eax,4
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mov edx,eax
and edx,00000003Ch
mov edx,DWORD PTR [BinaryTable+edx]
sub ecx,DWORD
mov DWORD PTR [esi+ecx],edx
mixing both should give a good result...
EDIT : something like :
; mov ecx,value
; mov edx,buffer
; no need to pudh/pop esi/edi
mov BYTE PTR [edx+32],0
mov eax,ecx
and eax,00000000Fh
shr ecx,2
mov eax,DWORD PTR [BinaryTable+eax*4]
mov DWORD PTR [edx+28],eax
mov eax,ecx
and eax,00000003Ch ; 0Fh * 4
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+24],eax
mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+20],eax
mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+16],eax
mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+12],eax
mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+8],eax
mov eax,ecx
and eax,00000003Ch
shr ecx,4
mov eax,DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx+4],eax
and ecx,00000003Ch
mov ecx,DWORD PTR [BinaryTable+ecx]
mov DWORD PTR [edx],ecx
I maved this topic to the LAB so it would not get lost in a hurry.
The same but faster: :lol
mov BYTE PTR [edx+32],0
mov eax, ecx
and ecx, 00000000Fh
mov ecx, DWORD PTR [BinaryTable+ecx*4]
shr eax, 2
mov DWORD PTR [edx+28],ecx
mov ecx, eax
and eax, 00000003Ch ; 0Fh * 4
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
mov DWORD PTR [edx+24],eax
mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
mov DWORD PTR [edx+20],ecx
mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
mov DWORD PTR [edx+16],eax
mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
mov DWORD PTR [edx+12],ecx
mov ecx, eax
and eax, 00000003Ch
mov eax, DWORD PTR [BinaryTable+eax]
shr ecx, 4
mov DWORD PTR [edx+8],eax
mov eax, ecx
and ecx, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+ecx]
shr eax, 4
mov DWORD PTR [edx+4], ecx
and eax, 00000003Ch
mov ecx, DWORD PTR [BinaryTable+eax]
mov DWORD PTR [edx],ecx
Quote from: lingo on June 27, 2008, 01:54:26 AM
The same but faster: :lol
Not so easy to get stable timings, but I would vote for the Nightware/Lingo algo:
69 cycles timing BIN$ 180 bytes 926 LAMPs
128 cycles timing pbin2 147 bytes 1552 LAMPs
124 cycles timing nwDw2Bin 101 bytes 1246 LAMPs
103 cycles timing nwDw2BinJJ 102 bytes 1040 LAMPs
84 cycles timing NightWare 204 bytes 1200 LAMPs
89 cycles timing BinLingo 204 bytes 1271 LAMPs
48 cycles timing BIN$ 180 bytes 644 LAMPs
87 cycles timing pbin2 147 bytes 1055 LAMPs
115 cycles timing nwDw2Bin 101 bytes 1156 LAMPs
71 cycles timing nwDw2BinJJ 102 bytes 717 LAMPs
42 cycles timing NightWare 204 bytes 600 LAMPs
75 cycles timing BinLingo 204 bytes 1071 LAMPs
48 cycles timing BIN$ 180 bytes 644 LAMPs
57 cycles timing pbin2 147 bytes 691 LAMPs
121 cycles timing nwDw2Bin 101 bytes 1216 LAMPs
72 cycles timing nwDw2BinJJ 102 bytes 727 LAMPs
42 cycles timing NightWare 204 bytes 600 LAMPs
42 cycles timing BinLingo 204 bytes 600 LAMPs
53 cycles timing BIN$ 180 bytes 711 LAMPs
59 cycles timing pbin2 147 bytes 715 LAMPs
88 cycles timing nwDw2Bin 101 bytes 884 LAMPs
83 cycles timing nwDw2BinJJ 102 bytes 838 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
43 cycles timing BinLingo 204 bytes 614 LAMPs
Just for fun, here is one with the original library function:48 cycles timing BIN$ 180 bytes 644 LAMPs
58 cycles timing pbin2 147 bytes 703 LAMPs
55 cycles timing NightWare 204 bytes 786 LAMPs
42 cycles timing BinLingo 204 bytes 600 LAMPs
82 cycles timing
dw2bin_ex 2140 bytes 3793 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
Full source attached, with some switches.
[attachment deleted by admin]
here's my attempt :8):
;; for 32 bytes
bintab equ BinaryTable
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
DwToBin proc dwValue:dword,pBuffer:dword
mov eax,[esp+4];dwValue
mov edx,[esp+8];pBuffer
push edi
push esi
push ebx
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
shr eax,16
mov [edx+7*4],ebx
mov [edx+6*4],ecx
mov [edx+5*4],edi
mov [edx+4*4],esi
mov ebx,1111b
mov ecx,11110000b
mov edi,111100000000b
mov esi,1111000000000000b
and ecx,eax
and edi,eax
shr ecx,4
and esi,eax
shr edi,8
and ebx,eax
shr esi,12
mov ebx,[ebx*4+bintab]
mov ecx,[ecx*4+bintab]
mov edi,[edi*4+bintab]
mov esi,[esi*4+bintab]
mov [edx+3*4],ebx
mov [edx+2*4],ecx
mov [edx+1*4],edi
mov [edx+0*4],esi
mov byte ptr [edx+32],0
pop ebx
pop esi
pop edi
ret 2*4
DwToBin endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
and here's my regular function
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
;returns size in eax
DwToBin proc dwValue:dword,pBuffer:dword;buff max 33bytes
push edi
mov ecx,31
mov edx,[esp+4][4];dwValue
mov edi,[esp+8][4];pBuffer
bsr eax,edx
jz @2
sub ecx,eax
shl edx,cl
mov ecx,eax
@1: add edx,edx
inc edi
mov al,'0' shr 1
adc al,al
dec ecx
mov [edi-1],al
jns @1
mov [edi],dl
mov eax,edi
pop edi
sub eax,[esp+8];pBuffer
ret 2*4
@2: mov word ptr [edi],'0'
mov eax,1
pop edi
ret 2*4
DwToBin endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
Not the same but faster again :lol:
mov eax, esp
lea esp, [edx+32]
mov BYTE PTR [edx+32],0
mov edx, eax
shld eax, ecx, 30
and ecx, 00000000Fh
push DWORD PTR [BinaryTable+ecx*4]
shld ecx, eax, 28
and eax, 00000003Ch
push DWORD PTR [BinaryTable+eax]
shld eax, ecx, 28
and ecx, 3Ch
push DWORD PTR [BinaryTable+ecx]
mov ecx, eax
and eax, 3Ch
push DWORD PTR [BinaryTable+eax]
shr ecx, 4
mov eax, ecx
and ecx, 3Ch
push DWORD PTR [BinaryTable+ecx]
shr eax, 4
mov ecx, eax
and eax, 3Ch
push DWORD PTR [BinaryTable+eax]
shr ecx, 4
mov eax, ecx
and ecx, 3Ch
push DWORD PTR [BinaryTable+ecx]
shr eax, 4
and eax, 3Ch
push DWORD PTR [BinaryTable+eax]
mov esp, edx
ret
My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1
37 cycles timing BIN$ 180 bytes 496 LAMPs
42 cycles timing pbin2 147 bytes 509 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
32 cycles timing nwDw2BinJJ 102 bytes 323 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
27 cycles timing BinLingo 187 bytes 369 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
Here is an SIMD version (uses SSSE3). I've test it with aligned and unaligned move (movdqu/movdqa)
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
align 16
_qword proc ;var:DWORD,buffer:DWORD
;var in edx
;buffer in ecx
.data
align 16
shufmsk db 8 dup (1)
db 8 dup (0)
bitmsk db 128,64,32,16,8,4,2,1
db 128,64,32,16,8,4,2,1
ascmsk db 16 dup (31h)
.code
movdqa xmm3,OWORD ptr [shufmsk]
movdqa xmm4,OWORD ptr [bitmsk]
movdqa xmm5,OWORD ptr [ascmsk]
pinsrw xmm0,edx,0
shr edx,16
pxor xmm2,xmm2
pinsrw xmm1,edx,0
pshufb xmm1,xmm3
pand xmm1,xmm4
pcmpeqb xmm1,xmm2
paddsb xmm1,xmm5
movdqu OWORD ptr [ecx],xmm1
align 8
pshufb xmm0,xmm3
pand xmm0,xmm4
pcmpeqb xmm0,xmm2
paddsb xmm0,xmm5
movdqu OWORD ptr [ecx+16],xmm0
mov BYTE ptr [ecx+32],0
ret
_qword endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
my results(Core2Duo):
with movdqu:
NightWare timing average = 347
sinsi_ex timing average = 401
hutch_ex2 timing average = 680
JJ BIN$ timing average = 386
mqword timing average = 376
with movdqa:
NightWare timing average = 337
sinsi_ex timing average = 401
hutch_ex2 timing average = 676
JJ BIN$ timing average = 376
mqword timing average = 241
[attachment deleted by admin]
Quote from: lingo on June 27, 2008, 04:10:59 PM
Not the same but faster again :lol:
My Results:
Intel Core 2 E8500,4000 MHz (9.5 x 421),Vista64-SP1
37 cycles timing BIN$ 180 bytes 496 LAMPs
42 cycles timing pbin2 147 bytes 509 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
32 cycles timing nwDw2BinJJ 102 bytes 323 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
27 cycles timing BinLingo 187 bytes 369 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
And shorter, too. Will have to study some of these exotic lingoish opcodes ;-)
However, my Celeron does not seem to like it much. Interesting how big the differences between processors are in this case.
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
76 cycles timing BinLingo 204 bytes 1085 LAMPs **** previous version
C:\MASM32\GFA2MASM>bl
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
60 cycles timing nwDw2Bin 101 bytes 603 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs **** latest version
Quote from: drizz on June 27, 2008, 03:04:18 PM
here's my attempt :8):
Good start, Drizz!
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
60 cycles timing nwDw2Bin 101 bytes 603 LAMPs
65 cycles timing nwDw2BinJJ 102 bytes 656 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing
b2aDrizzAt 235 bytes 705 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
i have ported my idea to mmx, so more people can test it :U
mmx_dw2bin proc ;var:DWORD,buffer:DWORD
;var in edx
;buffer in ecx
.data
align 16
bitmsk db 128,64,32,16,8,4,2,1
ascmsk db 8 dup (031h)
.code
bswap edx
movq mm6,QWORD ptr [bitmsk]
movq mm7,QWORD ptr [ascmsk]
pxor mm5,mm5
movd mm0,edx
punpcklbw mm0,mm0
movq mm1,mm0
pshufw mm1,mm0,0
pand mm1,mm6
pcmpeqb mm1,mm5
paddsb mm1,mm7
movq QWORD ptr [ecx],mm1
movq mm2,mm0
pshufw mm2,mm0,001010101y
pand mm2,mm6
pcmpeqb mm2,mm5
paddsb mm2,mm7
movq QWORD ptr [ecx+8],mm2
movq mm1,mm0
pshufw mm1,mm0,010101010y
pand mm1,mm6
pcmpeqb mm1,mm5
paddsb mm1,mm7
movq QWORD ptr [ecx+16],mm1
movq mm2,mm0
pshufw mm2,mm0,011111111y
pand mm2,mm6
pcmpeqb mm2,mm5
paddsb mm2,mm7
movq QWORD ptr [ecx+24],mm2
mov BYTE ptr [ecx+32],0
ret
mmx_dw2bin endp
results on Core2Duo:
37 cycles timing BIN$ 180 bytes 496 LAMPs
43 cycles timing pbin2 147 bytes 521 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
42 cycles timing nwDw2BinJJ 102 bytes 424 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
27 cycles timing BinLingo 187 bytes 369 LAMPs
34 cycles timing b2aDrizzAt 235 bytes 521 LAMPs
24 cycles timing mmx_dw2bin 177 bytes 319 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
[attachment deleted by admin]
Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it :U
Very interesting. Here are my timings:
40 cycles timing BIN$ 180 bytes 537 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
62 cycles timing nwDw2Bin 101 bytes 623 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
179 cycles timing
mmx_dw2bin 129 bytes 2033 LAMPs
(sorry, I played a bad trick:
mov ecx, offset Dw2BinBuffer
inc ecx
call mmx_dw2bin
... which misaligns the target)
Without this bad trick, your code performs indeed excellently:
40 cycles timing BIN$ 180 bytes 537 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
65 cycles timing nwDw2Bin 101 bytes 653 LAMPs
65 cycles timing nwDw2BinJJ 102 bytes 656 LAMPs
54 cycles timing NightWare 204 bytes 771 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing b2aDrizzAt 235 bytes 705 LAMPs
35 cycles timing mmx_dw2bin 129 bytes
398 LAMPs
Fast and short, congratulations!
mmx is very sensitive to misalignment, but in the case of a BIN$ macro we can safely assume that we are able to align the target, so imho we have a winning code here :cheekygreen:
EDIT: I add the modified code; for consistency with the other algos, I exchanged the variable and destination registers as follows:
;var in edx NEW: eax
;buffer in ecx NEW: edx
[attachment deleted by admin]
Quote from: qWord on June 27, 2008, 09:16:36 PM
i have ported my idea to mmx, so more people can test it :U
Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
Quote
Of minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
your right. - I've forgotten :lol
Quote from: jj2007 on June 27, 2008, 11:19:23 PMOf minor practical relevance: pshufw needs xmm (=SSE1), not mmx.
It's not that hard to replace pshufw. Anyway credits go to qWord! :U
movq mm6,QWORD ptr [bitmsk]
movq mm7,QWORD ptr [ascmsk]
pxor mm5,mm5
movd mm0,edx
punpcklbw mm0,mm0
punpcklwd mm1,mm0
movq mm2,mm0
punpckhwd mm3,mm0
punpcklwd mm0,mm0
punpckhwd mm1,mm1
punpckhwd mm2,mm2
punpckhwd mm3,mm3
punpckldq mm0,mm0
punpckhdq mm1,mm1
punpckldq mm2,mm2
punpckhdq mm3,mm3
pand mm0,mm6
pand mm1,mm6
pand mm2,mm6
pand mm3,mm6
pcmpeqb mm0,mm5
pcmpeqb mm1,mm5
pcmpeqb mm2,mm5
pcmpeqb mm3,mm5
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
movq [ecx+24],mm0
movq [ecx+16],mm1
movq [ecx+8],mm2
movq [ecx],mm3
mov BYTE ptr [ecx+32],0
edit: removed bswap
edit2: removed some more instructions
Looks good. The old qWord version seems to be an edge faster, see attachment qw.exe
[attachment deleted by admin]
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
Quote from: drizz on June 28, 2008, 09:28:36 AM
Quote from: jj2007 on June 28, 2008, 09:08:12 AMThe old qWord version seems to be an edge faster.
yes i know, but sse1 is P3 and above, this modified version will now work on all MMX capable cpus (e.g. 13 year old pentium).
That was the point indeed - for a general purpose library this is clearly the better solution. And in terms of LAMPs it beats the hell out of the other algos. My BIN$ algo (inspired by Sinsi) is pretty fast on the Celeron but sucks on real Pentiums.
EDIT:
Timings on a Celeron
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
64 cycles timing nwDw2BinJJ 102 bytes 646 LAMPs
54 cycles timing NightWare 204 bytes 771 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
46 cycles timing b2aDrizzAt 235 bytes 705 LAMPs
40 cycles timing mmx_dw2bin 132 bytes 460 LAMPs (new Drizz mmx variant)
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
66 cycles timing nwDw2BinJJ 102 bytes 667 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
36 cycles timing mmx_dw2bin 129 bytes 409 LAMPs (old qWord xmm variant)
LAMPs = Lean And Mean Points = cycles * sqrt(size)
Quote
The old qWord version seems to be an edge faster
interesting, on my Core2Duo drizzs version is 2~3 clocks faster
sse1:
24 cycles timing mmx_dw2bin 129 bytes 273 LAMPs
drizz's:
22 cycles timing mmx_dw2bin 132 bytes 253 LAMPs
EDIT: syr, i've forgot to delete baswp and adjust pshufw-instructions =>
sse1:
18 cycles timing mmx_dw2bin
drizz's:
22 cycles timing mmx_dw2bin
Quote from: qWord on June 28, 2008, 09:47:41 AM
Quote
The old qWord version seems to be an edge faster
interesting, on my Core2Duo drizzs version is 2~3 clocks faster
Could you do me a fvour and time the CAT$ macro (http://www.masm32.com/board/index.php?topic=9437.0)?
Sorry, it was an false statement by me , see my previous post
Quote from: qWord on June 28, 2008, 10:22:49 AM
Sorry, it was an false statement by me , see my previous post
No problem; I was just curious how the Core2Duo performs on the CAT$ algo.
I know it's a bit late, but I came across this topic and thought I could give this a shot =P
Quotedw2binstr proc
;Value - EAX
;Buffer - EBX
mov edx, eax ;EDX holds the value
mov ecx, 8 ;Bits per byte
@@:
mov eax, edx
and eax, 01010101h ;Filters the low bit of every byte
or eax, 30303030h ;ASCII convertion
mov byte ptr [ebx+31], al ;Placing the bytes into the buffer
mov byte ptr [ebx+23], ah
ror eax, 16
mov byte ptr [ebx+15], al
mov byte ptr [ebx+07], ah
dec ebx ;Going backwards (buffer-wise)
ror edx,1 ;Setting the next set of bits
dec ecx ;Loop back
jnz @B
retn
dw2binstr endp
So... what do you think?
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...
[attachment deleted by admin]
hey jj,
eventually the following code could be interesting for you:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; ;
; sse1_dw2hex: converts a dword-value to an ;
; ASC-hex-string ;
; ;
; eax = dwValue ;
; edx = lpBuffer , should be aligned to 8 ;
; ;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
align 16
sse1_dw2hex proc ;var:DWORD,buffer:DWORD
.data
align 16
d2h_bitmsk dw 4 dup (0f00fh)
d2h_cmpmsk db 8 dup (9)
d2h_09_msk db 8 dup (030h)
d2h_AF_msk db 8 dup (7)
.code
;bswap eax ;<== insert for mmx only
movq mm4,QWORD ptr [d2h_bitmsk] ; |
movq mm5,QWORD ptr [d2h_cmpmsk] ; |
movq mm6,QWORD ptr [d2h_09_msk] ; |
movq mm7,QWORD ptr [d2h_AF_msk] ; |
; |
movd mm1,eax ; |
punpcklbw mm1,mm1 ; V
pshufw mm1,mm1,000011011y ;<== delete for mmx only
pand mm1,mm4
movq mm0,mm1
psrlw mm0,12
psllw mm1,8
por mm0,mm1
movq mm2,mm0
pcmpgtb mm2,mm5
pand mm2,mm7
paddb mm2,mm6
paddb mm2,mm0
movq QWORD ptr [edx],mm2
mov BYTE ptr [edx+8],0
ret
sse1_dw2hex endp
Quote from: jj2007 on July 03, 2008, 05:59:39 PM
Very short but a bit slow. But the design is interesting, maybe it could be tuned a little bit...
Hmm... wierd...
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).
EDIT: My outputs:
30 cycles timing BIN$ 180 bytes 402 LAMPs
44 cycles timing pbin2 147 bytes 533 LAMPs
36 cycles timing nwDw2Bin 101 bytes 362 LAMPs
42 cycles timing nwDw2BinJJ 102 bytes 424 LAMPs
30 cycles timing NightWare 204 bytes 428 LAMPs
28 cycles timing BinLingo 187 bytes 383 LAMPs
34 cycles timing b2aDrizzAt 235 bytes 521 LAMPs
22 cycles timing mmx_dw2bin 132 bytes 253 LAMPs
75 cycles timing dw2binstr 49 bytes 525 LAMPs
32 Cycles
I'm begining to wonder if it has to do with my CPU...
[attachment deleted by admin]
Quote from: DoomyD on July 03, 2008, 07:35:38 PM
Here's the build I used; for some reason it shows different results... Maybe its the timing macro I use(the one from the sticky).
Shows 101 cycles for me, still slow compared to the 40 of the BIN$ and mmx_ variants. But it's indeed weird that I see 215 cycles on my puter, while your exe performs in 101; and you saw from my source that there is not much overhead.
I use timers.asm: \Masm32\macros\TIMERS.ASM 10095 bytes of 15.02.2005
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
61 cycles timing nwDw2Bin 101 bytes 613 LAMPs
65 cycles timing nwDw2BinJJ 102 bytes 656 LAMPs
54 cycles timing NightWare 204 bytes 771 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
40 cycles timing mmx_dw2bin 132 bytes 460 LAMPs
215 cycles timing dw2binstr 49 bytes 1505 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
Quote from: qWord on July 03, 2008, 06:39:44 PM
hey jj,
eventually the following code could be interesting for you:
How much improvement?
Look for mmx_dw2bin in the previously attached source, change the
if 0 to
if 1, and adapt your old code.
EDIT:
39 cycles timing BIN$ 180 bytes 523 LAMPs
23 cycles timing ssemmxonly 97 bytes 227 LAMPs
Could you PLEASE tune it a little bit? Say, 3 cycles less, just to get a round figure? :cheekygreen:
EDIT (2):
Just discovered that we are comparing apples and oranges: your output is a
hex$, as the name rightly suggests :red
EDIT (3):
39 cycles timing BIN$ 180 bytes 523 LAMPs
45 cycles timing pbin2 147 bytes 546 LAMPs
63 cycles timing nwDw2Bin 101 bytes 633 LAMPs
66 cycles timing nwDw2BinJJ 102 bytes 667 LAMPs
53 cycles timing NightWare 204 bytes 757 LAMPs
70 cycles timing BinLingo 187 bytes 957 LAMPs
45 cycles timing b2aDrizzAt 235 bytes 690 LAMPs
39 cycles timing QwordMmx 132 bytes 448 LAMPs xxxxxx
216 cycles timing dw2binstr 49 bytes 1512 LAMPs
350 cycles timing Dword2Bin2 56 bytes 2619 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
I renamed your dw2binstr to QwordMmx. Still the best LAMPs score :toothy
Attached the latest build with sources, as asm and
rtfQuestion on the latter:
This displays just fine in (MS) WordPad and (jj) RichMasm (http://www.masm32.com/board/index.php?topic=9044.msg65608#msg65608), but (MS) Word has a serious problem with the (MS Windows)
System font – they seem not as compatible as they should... any ideas ?
[attachment deleted by admin]
Quote from: DoomyD on July 03, 2008, 07:35:38 PM
75 cycles timing dw2binstr 49 bytes 525 LAMPs
32 Cycles
I'm begining to wonder if it has to do with my CPU...
Nope, your CPU is fine, you are measuring different cycle counts on the same puter; so it's the code, not the CPU. Mind posting your source?
Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Still the best LAMPs score
nice to see :green
Quote from: jj2007 on July 03, 2008, 08:02:35 PM
Just discovered that we are comparing apples and oranges: your output is a hex$, as the name rightly suggests :red
sorry for confusing you :bg
i attached a file with an modified version of
dw2hex EDIT:
QuoteHow much improvement?
a quick speed test shows that see2_dw2hex is approx. 4 times faster than dw2str from masm32.lib
[attachment deleted by admin]
Attached
[attachment deleted by admin]
Quote from: DoomyD on July 04, 2008, 05:10:48 AM
Attached
Mystery solved: Your code runs twice as fast because there is only
one loop...
counter_begin 100000h,HIGH_PRIORITY_CLASS
mov eax,00010010001101001010101111001101b ;1234ABCDh
mov ebx,offset str1
invoke dw2binstr
counter_end
My version:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ecx, 11111000001111100000111111110000b
mov edx, offset Dw2BinBuffer
call dw2binstr
mov ecx, 00001111100000111110000011111111b
mov edx, offset Dw2BinBuffer
call dw2binstr
counter_end
P.S.: In the standard Masm32 installation, libs sit in \masm32\lib\
include \masm32\include\windows.inc
include \masm32\macros\timers.asm
include \masm32\macros\macros.asm
include \masm32\include\masm32.inc
includelib \masm32\lib\masm32.lib
include \masm32\include\kernel32.inc
includelib \masm32\lib\kernel32.lib
Finally, I found the time to take a closer look at it...
I modifed drizz's modification, and squeezed another cycle out of it =) (although it takes 300,000,000 loops to actually see it :lol)__QwordMmx proc
movq mm6, QWORD ptr [bitmsk]
movq mm7, QWORD ptr [ascmsk]
pxor mm5, mm5
movd mm0, eax
punpcklbw mm0, mm0
punpckhdq mm2, mm0
punpckldq mm0, mm0
punpckhwd mm0, mm0
punpckhwd mm2, mm2
punpckhdq mm1, mm0
punpckldq mm0, mm0
punpckhdq mm3, mm2
punpckldq mm2, mm2
punpckhdq mm1, mm1
punpckhdq mm3, mm3
pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6
pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+8],mm2
movq [edx],mm3
mov BYTE ptr [edx+32],0
retn
__QwordMmx endp
Core 2 Duo x32:25 - QwordMmx - qWord
22 - QwordMmx - drizz
21 - QwordMmx - new
[attachment deleted by admin]
loop count: 300000000
31 - QwordMmx - qWord
33 - QwordMmx - drizz
31 - QwordMmx - new
Celeron M ...
Noticed I can cap another line =)qWordMmxOpt proc
movq mm7, QWORD ptr [ascmsk]
movq mm6, QWORD ptr [bitmsk]
pxor mm5, mm5
movd mm0, eax
punpcklbw mm0, mm0
punpckhdq mm2, mm0
punpcklwd mm0, mm0
punpckhwd mm2, mm2
punpckhdq mm1, mm0
punpckhdq mm3, mm2
punpckldq mm0, mm0
punpckhdq mm1, mm1
punpckldq mm2, mm2
punpckhdq mm3, mm3
pand mm0, mm6
pand mm1, mm6
pand mm2, mm6
pand mm3, mm6
pcmpeqb mm0, mm5
pcmpeqb mm1, mm5
pcmpeqb mm2, mm5
pcmpeqb mm3, mm5
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
movq [edx+24],mm0
movq [edx+16],mm1
movq [edx+08],mm2
movq [edx+00],mm3
mov BYTE ptr [edx+32],0
retn
qWordMmxOpt end
Loop count: 2000000000
Method: timer_begin\timer_end:
10104 {00010010001101000101011001111000} QwordMmx(1) - qWord
10051 {00010010001101000101011001111000} QwordMmx(2) - qWord
9837 {00010010001101000101011001111000} _QwordMmx(1) - drizz
9885 {00010010001101000101011001111000} _QwordMmx(2) - drizz
9245 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
9191 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD
[attachment deleted by admin]
I timed it twice, the 2nd time with reduced loop count, but results are stable on a P4, 3.4 GHz:
Loop count: 2000000000
Method: timer_begin\timer_end:
11319 {00010010001101000101011001111000} QwordMmx(1) - qWord
11452 {00010010001101000101011001111000} QwordMmx(2) - qWord
14740 {00010010001101000101011001111000} _QwordMmx(1) - drizz
14475 {00010010001101000101011001111000} _QwordMmx(2) - drizz
13840 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
13833 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD
Loop count: 200000000
Method: timer_begin\timer_end:
1086 {00010010001101000101011001111000} QwordMmx(1) - qWord
1088 {00010010001101000101011001111000} QwordMmx(2) - qWord
1450 {00010010001101000101011001111000} _QwordMmx(1) - drizz
1455 {00010010001101000101011001111000} _QwordMmx(2) - drizz
1385 {00010010001101000101011001111000} __QwordMmx(1) - DoomyD
1387 {00010010001101000101011001111000} __QwordMmx(2) - DoomyD
My timings with your previous code on the Celeron M saw your code on par with qWord... which processor are you using?
Intel Core 2 Due 6300 @ 1.8 GHz
Model: x86 615
I integrated your code into the package. LAMP-wise you made it, compliments... but NightWare has a little edge on speed, at least on my P4:
48 cycles timing BIN$ 180 bytes 644 LAMPs
59 cycles timing pbin2 147 bytes 715 LAMPs
86 cycles timing nwDw2Bin 101 bytes 864 LAMPs
69 cycles timing nwDw2BinJJ 102 bytes 697 LAMPs
42 cycles timing NightWare 204 bytes 600 LAMPs ****
86 cycles timing BinLingo 187 bytes 1176 LAMPs
50 cycles timing b2aDrizzAt 235 bytes 766 LAMPs
54 cycles timing MmxQword 132 bytes 620 LAMPs
52 cycles timing MmxDoomy 110 bytes 545 LAMPs ****
81 cycles timing dw2bin_ex 2140 bytes 3747 LAMPs
LAMPs = Lean And Mean Points = cycles * sqrt(size)
EDIT: Results for Celeron M - Doomy clearly in the lead (but BIN$ also ok for speed)
32 cycles timing BIN$ 180 bytes 429 LAMPs
42 cycles timing pbin2 147 bytes 509 LAMPs
54 cycles timing nwDw2Bin 101 bytes 543 LAMPs
57 cycles timing nwDw2BinJJ 102 bytes 576 LAMPs
39 cycles timing NightWare 204 bytes 557 LAMPs
36 cycles timing BinLingo 187 bytes 492 LAMPs
43 cycles timing b2aDrizzAt 235 bytes 659 LAMPs
33 cycles timing MmxQword 132 bytes 379 LAMPs
32 cycles timing MmxDoomy 110 bytes 336 LAMPs
60 cycles timing dw2bin_ex 2140 bytes 2776 LAMPs
[attachment deleted by admin]
3 cycles less for everybody - a bin$ is always 32 bytes long, so no need for poking a zero terminator. Celeron M timings:
30 cycles timing BIN$ 136 bytes 350 LAMPs
39 cycles timing NightWare 204 bytes 557 LAMPs
36 cycles timing BinLingo 187 bytes 492 LAMPs
33 cycles timing MmxQword 132 bytes 379 LAMPs
29 cycles timing MmxDoomy 106 bytes 299 LAMPs
[attachment deleted by admin]
Quote29 cycles timing BIN$ 136 bytes 338 LAMPs
45 cycles timing pbin2 144 bytes 540 LAMPs
49 cycles timing nwDw2Bin 101 bytes 492 LAMPs
43 cycles timing nwDw2BinJJ 102 bytes 434 LAMPs
28 cycles timing NightWare 200 bytes 396 LAMPs
27 cycles timing BinLingo 183 bytes 365 LAMPs
33 cycles timing b2aDrizzAt 235 bytes 506 LAMPs
21 cycles timing MmxQword 128 bytes 238 LAMPs
19 cycles timing MmxDoomy 106 bytes 196 LAMPs
52 cycles timing dw2bin_ex 2140 bytes 2406 LAMPs
Looking at the source thoughh, I should point that the algorithm
is using the same data resources as qWord's mmx, I'll include them as seperate.
By the way: I don't think it could be shortened any more than that - Here's my final code:
m_mmx_dw2bin macro Value:REQ, lpBuffer
LOCAL mmx_dw2bin_buffer
IFNDEF mmx_dw2bin_enabled
mmx_dw2bin_enabled equ <1>
.data
align 8
mmx_dw2bin_ascmsk db 8 dup (31h)
mmx_dw2bin_bitmsk db 80h,40h,20h,10h,08h,04h,02h,01h
ENDIF
.code
even
mov eax, Value
movq mm7, qword ptr [mmx_dw2bin_ascmsk]
movq mm6, qword ptr [mmx_dw2bin_bitmsk]
movd mm0, eax
punpcklbw mm0, mm0
punpckldq mm2, mm0
punpckhwd mm0, mm0
punpckldq mm1, mm0
punpckhwd mm2, mm2
punpckldq mm3, mm2
punpckhdq mm0, mm0
punpckhdq mm1, mm1
punpckhdq mm2, mm2
punpckhdq mm3, mm3
pandn mm0, mm6
pandn mm1, mm6
pandn mm2, mm6
pandn mm3, mm6
pcmpeqb mm0, mm6
pcmpeqb mm1, mm6
pcmpeqb mm2, mm6
pcmpeqb mm3, mm6
paddb mm0,mm7
paddb mm1,mm7
paddb mm2,mm7
paddb mm3,mm7
IFB <lpBuffer>
.data
mmx_dw2bin_buffer db 32 dup (0),0
align 4
.code
mov eax,offset mmx_dw2bin_buffer
ELSE
mov eax,lpBuffer
ENDIF
movq [eax+00],mm0
movq [eax+08],mm1
movq [eax+16],mm2
movq [eax+24],mm3
endm