I coded a binary string to dword conversion thinking I was going to need it for a project. It turned out I didn't, but I thought the result was worth posting.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.486 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
btodw PROTO :DWORD
bval MACRO lpstring ; binary string to unsigned 32 bit integer
invoke btodw, reparg(lpstring)
EXITM <eax>
ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buffer db "1010101010101010",0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
print uhex$(bval(chr$("0")))
print chr$(13,10)
print uhex$(bval(chr$("00000001")))
print chr$(13,10)
print uhex$(bval(chr$("100")))
print chr$(13,10)
print uhex$(bval(chr$("10000001")))
print chr$(13,10)
print uhex$(bval(chr$("10101010")))
print chr$(13,10)
print uhex$(bval(chr$("11111111")))
print chr$(13,10)
print uhex$(bval(chr$("1000000000000000")))
print chr$(13,10)
print uhex$(bval(chr$("10000000000000000000000000000000")))
print chr$(13,10)
print uhex$(bval(chr$("11111111111111111111111111111111")))
print chr$(13,10)
count equ 10000000
call GetTickCount
push eax
mov ebx, count
@@:
invoke btodw, ADDR buffer
dec ebx
jnz @B
call GetTickCount
pop ebx
sub eax, ebx
print ustr$(eax)
print chr$(13,10,"Press enter to exit...",13,10)
invoke StdIn, ADDR buffer, 1
exit
btodw proc String:DWORD
; ----------------------------------------
; Convert binary string into dword value
; return value in eax
; ----------------------------------------
mov edx, [String]
xor eax, eax
xor ecx, ecx
jmp @f
fini:
mov eax, ecx
ret
ALIGN 4
@@:
mov al, [edx]
sub al, 30h
js fini
shl ecx, 1
add edx, 1
or ecx, eax
jmp @B
btodw endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Hi MichaelW,
Your work looks nice, I will give a try.
Nice. You could replace eax&ecx, so you would get rid of the final 'mov'. ;)
This would be at least little smaller:
btodw proc String:DWORD
mov edx, String
xor eax, eax
@@: movzx ecx, byte ptr [edx]
sub ecx, 30h
js fini
add eax, eax
add edx, 1
or eax, ecx
jmp @B
fini:
ret
btodw endp
Don't know how this would benchmark:
btodw proc String:DWORD
mov edx, String
xor eax, eax
@@: movzx ecx, byte ptr [edx]
add edx, 1
test ecx, ecx
je fini
lea eax, [eax*2 + ecx - 30h]
jmp @B
fini:
ret
btodw endp
Unless it was very important that this function be as fast as possible, I would probably exit on any character that was not 0 or 1 -- something like:
...
sub ecx, 30h
test ecx, -2
jnz fini
...
Petroizki,
I tried swaping eax and ecx and eliminating the mov in my procedure, but doing so slowed it down. Using lea to combine the shift and the add into a single instruction is a good idea that didn't occur to me. Your second procedure is the fastest on my P3 and it would probably be the fastest on the earlier processors, but I wonder how it would do on a P4.
Jibz,
I can see the sense in making all conversion procedures handle invalid characters as the dominant HLLs do, but I have always felt that just ending the conversion is a less than optimum method of handling invalid characters. An invalid character in a numeric string usually indicates an error, and I feel it should be handled as such. But, the HLL conventions are what they are... I could not think of any way to implement your suggestion in Petroizki's second procedure that did not involve adding two or more instructions to the loop.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.486 ; create 32 bit code
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\kernel32.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\kernel32.lib
include \masm32\macros\macros.asm
btodw PROTO :DWORD
btodw0 PROTO :DWORD
btodw1 PROTO :DWORD
btodw2 PROTO :DWORD
btodw3 PROTO :DWORD
bval MACRO lpstring ; binary string to unsigned 32 bit integer
invoke btodw, reparg(lpstring)
EXITM <eax>
ENDM
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
buffer db "1010101010101010",0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
print uhex$(bval(chr$("0")))
print chr$(13,10)
print uhex$(bval(chr$("00000001")))
print chr$(13,10)
print uhex$(bval(chr$("100")))
print chr$(13,10)
print uhex$(bval(chr$("10000000")))
print chr$(13,10)
print uhex$(bval(chr$("10101010")))
print chr$(13,10)
print uhex$(bval(chr$("11111111")))
print chr$(13,10)
print uhex$(bval(chr$("1000000000000000")))
print chr$(13,10)
print uhex$(bval(chr$("10000000000000000000000000000000")))
print chr$(13,10)
print uhex$(bval(chr$("11111111111111111111111111111111")))
print chr$(13,10)
count equ 10000000
call GetTickCount
push eax
mov ebx, count
@@:
invoke btodw, ADDR buffer
dec ebx
jnz @B
call GetTickCount
pop ebx
sub eax, ebx
print ustr$(eax)
print chr$(13,10,13,10)
print uhex$(hval(chr$("12x34")))
print chr$(13,10)
print ustr$(val(chr$("12x34")))
print chr$(13,10,13,10,"Press enter to exit...",13,10)
invoke StdIn, ADDR buffer, 1
exit
btodw proc String:DWORD
mov edx, String
xor eax, eax
@@:
movzx ecx, byte ptr [edx]
add edx, 1
test ecx, 0FFFFFFCEh
jnz fini
test ecx, ecx
je fini
lea eax, [eax*2 + ecx - 30h]
jmp @B
fini:
ret
btodw endp
btodw0 proc String:DWORD
; ----------------------------------------
; Convert binary string into dword value
; return value in eax
; ----------------------------------------
mov edx, [String]
xor eax, eax
xor ecx, ecx
jmp @f
fini:
mov eax, ecx
ret
ALIGN 4
@@:
mov al, [edx]
sub al, 30h
js fini
shl ecx, 1
add edx, 1
or ecx, eax
jmp @B
btodw0 endp
btodw1 proc String:DWORD
mov edx, String
xor eax, eax
@@: movzx ecx, byte ptr [edx]
sub ecx, 30h
js fini
add eax, eax
add edx, 1
or eax, ecx
jmp @B
fini:
ret
btodw1 endp
btodw2 proc String:DWORD
mov edx, String
xor eax, eax
@@: movzx ecx, byte ptr [edx]
add edx, 1
test ecx, ecx
je fini
lea eax, [eax*2 + ecx - 30h]
jmp @B
fini:
ret
btodw2 endp
btodw3 proc String:DWORD
mov edx, String
xor ecx, ecx
@@: movzx eax, byte ptr [edx]
add edx, 1
test eax, eax
je fini
lea ecx, [ecx*2 + eax - 30h]
jmp @B
fini:
mov eax, ecx
ret
btodw3 endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
As posted (btodw0):
00000000
00000001
00000004
00000081
000000AA
000000FF
00008000
80000000
FFFFFFFF
1512
Petroizki 1 (btodw1):
00000000
00000001
00000004
00000081
000000AA
000000FF
00008000
80000000
FFFFFFFF
1783
Petroizki 2 (btodw2):
00000000
00000001
00000004
00000081
000000AA
000000FF
00008000
80000000
FFFFFFFF
1432
Petroizki 2, eax & ecx swapped, mov added (btodw3):
00000000
00000001
00000004
00000081
000000AA
000000FF
00008000
80000000
FFFFFFFF
1452
Petroizki 2 with my version of Jibz mod (btodw):
00000000
00000001
00000004
00000080
000000AA
000000FF
00008000
80000000
FFFFFFFF
2073
My contribution to this function is below. Notice that I read 4 bytes at a time. The loops are unrolled, but it would be easy to roll them up. By the way, how do you differentiate between '0', ' 00', '000', etc in the EAX value? Ratch
;*******************************************************************************
; ABS_ABDW: Converts a ASCII binary string into an arithmetic binary DWORD. *
; *
; Called by: PUSH <address of string> *
; CALL ABS_ABDW *
; *
; Returns: EAX=arithmetic binary value *
; *
; Notes: This subroutine conforms with WIN32 conventions regarding *
; register preservation and stack balancing. EBX,EBP,ESI,EDI are *
; not used. *
; *
; This subroutine makes the assumption that the string length will *
; not exceed 32 bytes, and will be terminated by a zero byte. *
; Furthermore, since it is a ASCII binary string, it is assumed *
; that the string will contain only ASCII zeros and ones. *
; *
; Coder: Ratch *
;*******************************************************************************
ABS_ABDW: ;it all begins here
MOV ECX,[ESP+DWORD] ;ECX=address of string
XOR EAX,EAX ;fresh beginning
@@:
MOV EDX,[ECX] ;EDX=4 bytes of string
TEST DL,DL ;is zero byte terminator?
JZ @F
SHL EAX,1 ;make room for next bit
SHR EDX,1 ;send 1 or 0 into the CF
ADC EAX,0 ;tack on CF to LSB of EAX
SHR EDX,7 ;move to the next byte
TEST DL,DL ;is zero byte terminator?
JZ @F
SHL EAX,1 ;make room for next bit
SHR EDX,1 ;send 1 or 0 into the CF
ADC EAX,0 ;tack on CF to LSB of EAX
SHR EDX,7 ;move to the next byte
TEST DL,DL ;is zero byte terminator?
JZ @F
SHL EAX,1 ;make room for next bit
SHR EDX,1 ;send 1 or 0 into the CF
ADC EAX,0 ;tack on CF to LSB of EAX
SHR EDX,7 ;move to the next byte
TEST DL,DL ;is zero byte terminator?
JZ @F
SHL EAX,1 ;make room for next bit
SHR EDX,1 ;send 1 or 0 into the CF
ADC EAX,0 ;tack on CF to LSB of EAX
SHR EDX,7 ;move to the next byte
ADD ECX,DWORD ;pointer to next word
JMP @B
@@:
RET DWORD ;return to sender
these are my timings on my P3 at work. I used RDTSC, and divided by the total number of loops. Other than that I just cut and pasted Michael's code.
btodw : 118
btodw0 : 83
btodw1 : 80
btodw2 : 68
btodw3 : 69
I am still trying other things. But I got it down to 57 cycles on my P3. The previous fastest way on my P3 was 68 cycles. Usually I wait to post my final code until I have the fastest possible version. But I figured I'd post this now in case it helps anyone see the light, and come up with another way that is a variation on my way. I just basically broke the inner loop up into handling two bytes at a time.
btodw4 proc String:DWORD
mov edx, String
xor eax, eax
align 4
@@:
movzx ecx, byte ptr [edx]
movzx esi, byte ptr [edx+1]
; add edx, 1
add edx,2
test ecx, ecx
je fini
lea eax, [eax*2 + ecx - 30h]
test esi, esi
je fini
lea eax, [eax*2 + esi - 30h]
jmp @B
fini:
ret
btodw4 endp
Here is something else I am trying. It still has a bug in it. I haven't figured out how to fixup it up properly at the end. But hopefully either someone will spot the fixup ( or I'll have time to do it), or someone will post a faster version using mine as a basis. I read a whole dword, and deal with 4 binary characters at a time. It runs in 29 cycles on my P3 at work.
btodw5 proc String:DWORD
mov edx, String
xor eax, eax
align 4
@@:
mov ecx, [edx]
add edx,4
sub ecx, 30303030h
shl eax,4 ;we deal with 4 bytes at a time, so we multiply by 16
or eax,ecx
;adjust for negative number from a 0 at the end
test ecx,80000000h
jnz fini
jmp @B
fini:
;need to do a fixup for EAX here
ret
btodw5 endp
Mark Larson,
Quote
I read a whole dword, and deal with 4 binary characters at a time.
Isn't that what the program I previously posted does? Ratch
mark,
Please note that your btodw4 uses esi, without restoring it. :eek
Mark, you don't compact the bits. So after your subtract you end up with (01010101h), not (0Fh)...
Easiest way to compact is to multiply by 01020408h, then shift right by 24, but this uses eax & edx :eek
Then you need to mask the unused bits, and shift eax accordingly (i.e. not always 4 if there are only 3 bits left you'll need to shift 3 :tdown ).
Mirno
As an aside, when I first saw the problem I mis-read it, and came up with an interesting version for strings that were always 32 characters long (i.e. don't deal with null terminators)...
mov edx, String ; EDX is our string pointer
mov eax, 1 ; Use 1, so we know when we're done
@@:
cmp BYTE PTR [edx], '1' ; CARRY flag is set for '0' and not for '1'
lea edx, [edx + 1] ; Increment EDX without touching the flags
rcl eax, 1 ; Add this bit to the result.
jnc @B ; If our marker hasn't fallen off, carry on
not eax ; Our result is inverted, so NOT to correct
ret ; ... And relax!
Mirno
Quote from: Ratch on January 05, 2005, 02:15:03 AM
Mark Larson,
Quote
I read a whole dword, and deal with 4 binary characters at a time.
Isn't that what the program I previously posted does? Ratch
What you did was read a whole dword at a time, but then handle it one byte at a time. I read a dword at a time, and then handle a dword at a time. See the difference? Also your code runs in 79 cycles on my P3. There were a few earlier posts that were faster than that.
Here's a repost of those previous timings:
btodw : 118
btodw0 : 83
btodw1 : 80
btodw2 : 68
btodw3 : 69
Quote from: Petroizki on January 05, 2005, 07:52:34 AM
mark,
Please note that your btodw4 uses esi, without restoring it. :eek
Thanks. I was just trying to get something up quick, without worrying about robustiness. You can add that later. That's the same reason I don't generally add comments to code while I am optimizing it.
Quote from: Mirno on January 05, 2005, 11:49:08 AM
Mark, you don't compact the bits. So after your subtract you end up with (01010101h), not (0Fh)...
Easiest way to compact is to multiply by 01020408h, then shift right by 24, but this uses eax & edx :eek
Then you need to mask the unused bits, and shift eax accordingly (i.e. not always 4 if there are only 3 bits left you'll need to shift 3 :tdown ).
Mirno
Thanks. I am actually trying something different at the moment. I had a bit of a brainstorm on the way home from work. I am hoping it is faster than the read and handle a dword at a time version.
Mirno, I didn't compact until the end. I didn't have time to take a close look at it to see if that introduced any bugs. I also didn't do any AND's at the end. The test value came out right. For the string that Michael provides it prints out a correct value. I really should do more testing to verify it works with different values. It runs in 30 cyclces on my P3 with doing the compaction only at the end. Modifying the compaction to doing it every loop would add a lot of overhead.
btodw5 proc String:DWORD
mov edx, String
xor eax, eax
align 4
@@:
mov ecx, [edx]
add edx,4
sub ecx, 30303030h
shl eax,4 ;we deal with 4 bytes at a time, so we multiply by 16
or eax,ecx
;adjust for negative number from a 0 at the end
test ecx,80000000h
jnz fini
jmp @B
fini:
;need to do a fixup for EAX here
mov esi,01020408h
mul esi ; result in EDX:EAX
shrd eax,edx,24
shr eax,24
ret
btodw5 endp
EDIT: It doesn't return the correct result. I made a boo-boo. I ran the wrong version when printing the result.
This appears to produce the correct results:
btodw5 proc uses ebx esi String:DWORD
mov esi, String
xor ebx, ebx
align 4
@@:
mov ecx, [esi]
sub ecx, 30303030h
test ecx, 80000000h
jnz dobytes
add esi, 4
bswap ecx
mov eax, 01020408h ; Good stuff Mirno :)
mul ecx
shl ebx, 4
shr eax, 24
or ebx, eax
jmp @B
dobytes:
movzx ecx, byte ptr [esi]
add esi, 1
test ecx, ecx
je fini
lea ebx, [ebx*2 + ecx - 30h]
jmp dobytes
fini:
mov eax, ebx
ret
btodw5 endp
But, for the cases where you are not handling a multiple of four bytes, I think it will be slower than this:
btodw4 proc uses ebx esi edi String:DWORD
mov edx, String
xor eax, eax
align 4
@@:
movzx ecx, byte ptr [edx]
movzx ebx, byte ptr [edx+1]
movzx esi, byte ptr [edx+2]
movzx edi, byte ptr [edx+3]
add edx, 4
test ecx, ecx
jz fini
lea eax, [eax*2 + ecx - 30h]
test ebx, ebx
jz fini
lea eax, [eax*2 + ebx - 30h]
test esi, esi
jz fini
lea eax, [eax*2 + esi - 30h]
test edi, edi
jz fini
lea eax, [eax*2 + edi - 30h]
jmp @B
fini:
ret
btodw4 endp
Mark,
Could you post the P4 clock counts for reference?
MichaelW,
I reworked your code to make it smaller, faster and fixed a bug. A test of 80808080H is needed to find the zero byte in a string such as BYTE '111111010',0,'111111'. Ratch
;-------------------------------------------------------------------------------
;*******************************************************************************
; ABS_ABDW: Converts a ASCII binary string into an arithmetic binary DWORD. *
; *
; Called by: PUSH <address of string> *
; CALL ABS_ABDW *
; *
; Returns: EAX=arithmetic binary value *
; *
; Notes: This subroutine conforms with WIN32 conventions regarding *
; register preservation and stack balancing. EBX,EBP,ESI,EDI are *
; preserved. *
; *
; This subroutine makes the assumption that the string length will *
; not exceed 32 bytes, and will be terminated by a zero byte. *
; Furthermore, since it is a ASCII binary string, it is assumed *
; that the string will contain only ASCII zeros and ones. *
; *
; Coder: Ratch *
;*******************************************************************************
ABS_ABDW: ;it all begins here
PUSH ESI ;guard register for Windows
XOR ECX,ECX ;clear the collector
MOV ESI,[ESP+2*DWORD] ;ESI=address of string
ALIGN 4 ;line it up
@@:
MOV EDX,[ESI] ;the whole word and nothing but the word
SUB EDX,'0000' ;convert to binary bytes
TEST EDX,80808080H ;any zero bytes?
JNZ @F ;yes, do one byte at a time
MOV EAX,08040201H ;compaction constant
MUL EDX ;compaction operation
SHL ECX,4 ;make room for a nibble
SHR EAX,24 ;line up nibble
ADD ESI,DWORD ;next DWORD
OR ECX,EAX ;tack on nibble
JMP @B ;do next DWORD
@@:
MOVZX EDX,BYTE PTR [ESI] ;one byte at a time
TEST EDX,EDX ;zero byte?
JE @F ;yes, all done
INC ESI ;advance pointer one byte
LEA ECX,[ECX*2+EDX-'0'] ;tack on nibble
JMP @B ;do next byte
@@:
MOV EAX,ECX ;hand off for Windows
POP ESI ;restore for Windows
RET DWORD ;return to sender and balance the stack
Ratch,
You are correct about needing to test with 80808080h. I ran some tests before I decided to use 80000000h, but I didn't test all possible combinations. I was not aware that reversing the compaction constant would reverse the order of the bytes at their final destination. Doing so allowed me to remove the bswap from my procedure, which did speed it up somewhat, but even so yours is faster by ~8%.
:U
The attachment contains the two best implementations (IMO) along with a new test app that uses my clock_ctr macros. I took the liberty of reformatting the code posted by Petroizki and Ratch (I hope neither of you will be offended).
[attachment deleted by admin]
Quote from: MichaelW on January 06, 2005, 12:03:17 AM
Mark,
Could you post the P4 clock counts for reference?
I've been bad. Played WOW ( World of Warcraft) all weekend ;). So didn't run anyting.
For the new code on my P3 the small version runs in 59 cycles ,and the fast version runs in 33 cycles.
When I cut and paste btodw_fast into the framework I already have for timing it runs in 40 cycles. My version sets the priority to REALTIME before doing any timings. I am not sure what the difference are. If I have time later I will post my code.
If I eliminate the code that compensates for the loop overhead, the cycle count goes from 39 to 45.
In the process of checking the above I noticed that the cycle counts for btodw_small increase from 59 to 72. The loop in the procedure is already aligned, and aligning the procedure, or the invoke that calls it, does not correct the problem. This leads me to suspect that something in the macros needs to be aligned.
Let me cut and paste the timing portion of my code. To calculate the 6 cycles of overhead that I subtract out below, I simply commented out the call to the procedure. You automated that part in your start macro.
I think I figured out your bug. Some people who ran your code posted negative times. You can't have a negative time unless the portion oof the start macro that calculates the loop overhead is calculating too large of a value.
When you get the macro's flushed out, we need to make that post a sticky. We get asked all the time for how to do timing code.
LOOP_CNT equ 10000000
invoke GetCurrentProcess
invoke SetPriorityClass,eax,REALTIME_PRIORITY_CLASS
cpuid
rdtsc
push edx
push eax
mov ecx,LOOP_CNT
align 4
outer_loop:
push ecx
invoke btodw_fast,ADDR buffer
pop ecx
dec ecx
jnz outer_loop
cpuid
rdtsc
pop esi
sub eax,esi
pop edi
sbb edx,edi
emms
mov esi,LOOP_CNT
div esi
;accumulates 6 cycles of overhead AFTER the division
sub eax,6
invoke wsprintf,ADDR time_msg,ADDR time_fmt,eax
invoke StdOut,ADDR time_msg
The only reason I can see for the loop overhead being wrong is alignment. And experimenting with adding variable numbers of NOPs at the start of the program, I determined that the result is definitely affected by the overall alignment. So I did essentially as Agner Fog did in his WTEST.ASM, and modified the clockctr_begin macro adding an ALIGN 16 in front of the test data, the reference loop, and the test loop. After doing so the cycle counts stabilized at 81 and 39, independent of the overall alignment. New code in the attachment.
[attachment deleted by admin]
Quote from: MichaelW on January 10, 2005, 10:29:29 PM
The only reason I can see for the loop overhead being wrong is alignment. And experimenting with adding variable numbers of NOPs at the start of the program, I determined that the result is definitely affected by the overall alignment. So I did essentially as Agner Fog did in his WTEST.ASM, and modified the clockctr_begin macro adding an ALIGN 16 in front of the test data, the reference loop, and the test loop. After doing so the cycle counts stabilized at 81 and 39, independent of the overall alignment. New code in the attachment.
WOO WOO. Good job. :) The 39 is only one cycle off what I got ( which was 40). So it sounds like you got it tweaked and working :)
Hi MichaelW,
The results on my PIV 2.66 GHz:
btodw_small : 63 clocks
btodw_fast : 61 clocks
Hi Vortex,
Interesting, a > 2:1 advantage on a P3 and a negligible advantage on a P4.
I had previously tried correcting what appeared to me to be dependency problems with the btodw_fast code (moving one instruction), and altering the alignment, but no improvement on a P3 so I discarded the changes. I have now added a btodw_fast_p4 procedure that includes these two changes. I also added a prompt at the start of the program, because a short delay after startup seems to improve the consistency of the results for the first test. Running on a P3 the clock cycle counts are 81, 39, 39. New code in the attachment.
[attachment deleted by admin]
Hi MichaelW,
Thanks for the new release, here the latest results:
1st run:61,62,70
2nd run:61,62,74
3rd run:60,63,68
Quote
1st run:61,62,70
2nd run:61,62,74
3rd run:60,63,68
Obviously, my optimizations weren't :bg I suspect the call to call variation in cycle counts indicates some serious problems with the code. I think the (original) fast version should be faster than the small version on any processor. I would like to know what the problems are, but unfortunately I gave my only P4 system to a nephew early last year.