Its an algo that was written and contributed about 10 years ago and I have never used it but while testing a range of similar algorithms recently, the "udw2str" algorithm fails over 1 gig and should not be used. It will be replaced with an equate to another algorithm that has been exhaustively tested over the unsigned DWORD range.
wow - surprising it wasn't found sooner :bg
It may be unsound, but it's still good enough for my financial calculations.
:bg
Michael,
That just means you have not been working on your first billion yet. :P
Seriously though, the algo is broken and I have 3 others that are both faster and exhaustively tested.
If you can be bothered, give this one a try.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
udw2str2 proc dwNumber:DWORD, pszString:DWORD
mov eax, [esp+4]
cmp eax, 9
ja @F
mov edx, 1
jmp nxt
@@:
cmp eax, 99
ja @F
mov edx, 2
jmp nxt
@@:
cmp eax, 999
ja @F
mov edx, 3
jmp nxt
@@:
cmp eax, 9999
ja @F
mov edx, 4
jmp nxt
@@:
cmp eax, 99999
ja @F
mov edx, 5
jmp nxt
@@:
cmp eax, 999999
ja @F
mov edx, 6
jmp nxt
@@:
cmp eax, 9999999
ja @F
mov edx, 7
jmp nxt
@@:
cmp eax, 99999999
ja @F
mov edx, 8
jmp nxt
@@:
cmp eax, 999999999
ja @F
mov edx, 9
jmp nxt
@@:
mov edx, 10
nxt:
mov eax, [esp+4] ; src
mov ecx, [esp+8] ; lpdest ; buffer address
push ebx
push esi
push edi
mov esi, 0CCCCCCCDh ; "magic number" multiplier for division by 10
mov ebx, 10
add ecx, edx ; add required buffer length to start address
mov BYTE PTR [ecx], 0 ; write the terminator
sub ecx, 1 ; decrement down to first write position
@@:
mul esi ; multiply by magic number
shrd eax, edx, 3 ; binary fractional "remainder" back in EAX
shr edx, 3 ; EDX = quotient
add eax, 1 ; precaution against occasional "underflow"
mov edi, edx ; save current quotient for next division
mul ebx ; x10 gets "decimal" remainder into EDX
add dl, 30h ; convert remainder to ascii
mov [ecx], dl ; insert it into buffer
sub ecx, 1 ; decrement buffer location
mov eax, edi ; retrieve current quotient
test eax, eax ; test if done
jz @F ; continue if not done
mul esi ; multiply by magic number
shrd eax, edx, 3 ; binary fractional "remainder" back in EAX
shr edx, 3 ; EDX = quotient
add eax, 1 ; precaution against occasional "underflow"
mov edi, edx ; save current quotient for next division
mul ebx ; x10 gets "decimal" remainder into EDX
add dl, 30h ; convert remainder to ascii
mov [ecx], dl ; insert it into buffer
sub ecx, 1 ; decrement buffer location
mov eax, edi ; retrieve current quotient
test eax, eax ; test if done
jnz @B ; continue if not done
@@:
pop edi
pop esi
pop ebx
mov eax, [esp+8]
ret 8
udw2str2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
i can do my taxes with five digits - and that includes pennies :lol
Quote from: dedndave on November 24, 2010, 02:38:19 AM
wow - surprising it wasn't found sooner :bg
lol, you must of been busy with the algo timings :lol
http://www.masm32.com/board/index.php?topic=14642.msg118452#msg118452
hutch, i tested it against mine...just a bit of fun :P
Quote
129 cycles for udw2str2
104 cycles for u32toa
use console, assemble and link to build the test
Looks good,
Here is the timing on my Core2 quad.
99 cycles for udw2str2
98 cycles for u32toa
Press any key to continue ...
Now the big question, have you done the exhaustive test over the whole unsigned range ? What beat the old version as that it failed over 1 gig, the older version was about 30% faster once it was optimised.
i tested it by using msvcrt ultoa and my algo to create 2 strings which i then strcmp'ed, starting at 0 and ending at 4294967295 (inclusive). my algos output matched ms ultoa over the full range so i'm certain it works for all unsigned values
Just ran the same test and its correct over the entire range. :U
Will have a play as this algo has room to make it faster.
i'll be interested to see how much faster you can make it :U
I also tested against the CRT _ultoa function over the entire 32-bit range and found no differences in the returned strings. And I tested a non-null filled destination buffer and dumped the first 20 bytes of the buffer after the test to verify that the procedure is not writing to the buffer beyond the 11th byte.
This tweak is testing up OK over the full unsigned range. Remove the PUSHAD POPAD pair dropped the time, removed the stack frame, unrolled it by 2 then added the code to calculate the result length then removed the reverse string loop at the end. In the benchmark I am using time dropped from 250 ms to 187 ms.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
ALIGN 16
; u32toa by brethren
utoa2 PROC, value:DWORD, szbuf:DWORD
push ebx
push esi
push edi
mov eax, [esp+4][12] ; value
mov esi, [esp+8][12] ; szbuf
cmp eax, 9
ja lb1
add esi, 1
jmp nxt
lb1:
cmp eax, 99
ja lb2
add esi, 2
jmp nxt
lb2:
cmp eax, 999
ja lb3
add esi, 3
jmp nxt
lb3:
cmp eax, 9999
ja lb4
add esi, 4
jmp nxt
lb4:
cmp eax, 99999
ja lb5
add esi, 5
jmp nxt
lb5:
cmp eax, 999999
ja lb6
add esi, 6
jmp nxt
lb6:
cmp eax, 9999999
ja lb7
add esi, 7
jmp nxt
lb7:
cmp eax, 99999999
ja lb8
add esi, 8
jmp nxt
lb8:
cmp eax, 999999999
ja lb9
add esi, 9
jmp nxt
lb9:
add esi, 10
nxt:
mov ebx, 0CCCCCCCDh
mov BYTE PTR [esi], 0
sub esi, 1
stlp:
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx]
sub ecx, edx
add cl, '0'
mov [esi], cl
sub esi, 1
test eax, eax
jbe lpqt
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx]
sub ecx, edx
add cl, '0'
mov [esi], cl
sub esi, 1
test eax, eax
ja stlp
lpqt:
pop edi
pop esi
pop ebx
ret 8
utoa2 ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
a 20% speed increase when run on my pc :clap: btw i really like how you've eliminated the need for reversing the string :wink
Quote131 cycles for udw2str2
104 cycles for u32toa
81 cycles for utoa2
the same improvements could be made to my dwtoa routine as its virtually identical apart from the sign test
.code
dwtoa PROC, val:DWORD, buf:PTR BYTE
pushad
mov eax, val
mov ebx, 0CCCCCCCDh
mov esi, buf
test eax, eax
jge @F
mov BYTE PTR [esi], '-'
add esi, 1
neg eax
@@:
mov edi, esi
@@:
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
add edx, edx
sub ecx, edx
add cl, '0'
mov [esi], cl
add esi, 1
test eax, eax
ja @B
mov BYTE PTR [esi], NULL
sub esi, 1
@@:
mov al, [edi]
mov bl, [esi]
mov [edi], bl
mov [esi], al
add edi, 1
sub esi, 1
cmp edi, esi
jb @B
popad
ret
dwtoa ENDP
brethren,
The same but faster coz I started with binary search: :wink
align 16
;u32toalin by lingo
db 11 Dup (0)
u32toalin PROC, value:DWORD, szbuf:DWORD
pop edx
pop eax
cmp eax, 99999
pop ecx
jbe L4
cmp eax, 9999999
push edx
jbe L2
cmp eax, 999999999
push ebx
ja L8
cmp eax, 99999999
ja L1
add ecx, 8
push esi
jne L9
L1 :
add ecx, 9
push esi
jne L9
L2:
cmp eax, 999999
push ebx
ja L3
add ecx, 6
push esi
jne L9
L3:
add ecx, 7
push esi
jne L9
L4:
cmp eax, 999
push edx
jbe L6
cmp eax, 9999
push ebx
ja L5
add ecx, 4
push esi
jne L9
L5:
add ecx, 5
push esi
jne L9
L6:
cmp eax, 99
push ebx
jbe L7
add ecx, 3
push esi
jne L9
L8:
push esi
add ecx, 10
L9:
mov byte ptr [ecx], 0
mov esi, 0CCCCCCCDh
@@:
mov ebx, eax
mul esi
shr edx, 3
sub ecx, 1
mov eax, edx
lea edx, [edx+edx*4-18h]
lea edx, [edx+edx]
sub ebx, edx
mov [ecx], bl
test eax, eax
je @f
mov ebx, eax
mul esi
shr edx, 3
sub ecx, 1
mov eax, edx
lea edx, [edx+edx*4-18h]
lea edx, [edx+edx]
sub ebx, edx
mov [ecx], bl
test eax, eax
jne @b
@@:
pop esi
mov eax, ecx
pop ebx
ret
L7:
cmp eax, 9
push esi
jbe L10
add ecx, 2
jne L9
L10:
add eax, 30h
pop esi
mov [ecx], ax
pop ebx
mov eax, ecx
ret
u32toalin ENDP
and results:
76 cycles for udw2str2-Hutch
65 cycles for u32toalin-Lingo
Press any key to continue ...
:bg
Use the later version posted in the Lab with this code then try string lengths from 1 to 10.
udw2str2 proc dwNumber:DWORD, pszString:DWORD
push ebx
push esi
mov eax, [esp+4][8] ; value
mov esi, [esp+8][8] ; szbuf
cmp eax, 9999
ja lb4
cmp eax, 999999
ja lb6
cmp eax, 9
ja lb1
add esi, 1
jmp nxt
lb1:
cmp eax, 99
ja lb2
add esi, 2
jmp nxt
lb2:
cmp eax, 999
ja lb3
add esi, 3
jmp nxt
lb3:
cmp eax, 9999
ja lb4
add esi, 4
jmp nxt
lb4:
cmp eax, 99999
ja lb5
add esi, 5
jmp nxt
lb5:
cmp eax, 999999
ja lb6
add esi, 6
jmp nxt
lb6:
cmp eax, 9999999
ja lb7
add esi, 7
jmp nxt
lb7:
cmp eax, 99999999
ja lb8
add esi, 8
jmp nxt
lb8:
cmp eax, 999999999
ja lb9
add esi, 9
jmp nxt
lb9:
add esi, 10
nxt:
mov ebx, 0CCCCCCCDh
mov BYTE PTR [esi], 0
sub esi, 1
stlp:
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx-48]
sub ecx, edx
mov [esi], cl
sub esi, 1
test eax, eax
jbe lpqt
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx-48]
sub ecx, edx
mov [esi], cl
sub esi, 1
test eax, eax
ja stlp
lpqt:
pop esi
pop ebx
ret 8
udw2str2 endp
wouldn't it make sense to test for longer values first ?
frequency of occurance
32 bit integers converted to decimal strings
unsigned (number of characters = number of digits)
1 digit 10
2 digits 90
3 digits 900
4 digits 9000
5 digits 90000
6 digits 900000
7 digits 9000000
8 digits 90000000
9 digits 900000000
10 digits 3294967296
:bg
Yes but it was not any faster. I think Lingo has the right idea, middle branching on a basic tree layout made the first run of the algo faster but it does not seem to effect the sustain rate.
it would be faster overall !!!
if you wrote a test piece to take the time for each of the 4294967296 possible values,
added those times together, then divided by the same number, you'd see an improvement
(that's why i posted the frequency of occurance chart)
yes - spliting it into groups reduces the number of CMP's/JMP's that are executed, i think
cmp eax,9999999
ja more_than_7_digits
cmp eax,9999
ja more_than_4_digits
;-------------------------------
cmp eax,999
ja have_4_digits
cmp eax,99
ja have_3_digits
cmp eax,9
ja have_2_digits
have_1_digit:
add esi,1
jmp nxt
have_2_digits:
add esi,2
jmp nxt
have_3_digits:
add esi,3
jmp nxt
have_4_digits:
add esi,4
jmp nxt
;-------------------------------
more_than_4_digits:
cmp eax,999999
ja have_7_digits
cmp eax,99999
ja have_6_digits
have_5_digit:
add esi,5
jmp nxt
have_6_digits:
add esi,6
jmp nxt
have_7_digits:
add esi,7
jmp nxt
;-------------------------------
more_than_7_digits:
cmp eax,999999999
ja have_10_digits
cmp eax,99999999
ja have_9_digits
have_8_digits:
add esi,8
jmp nxt
have_9_digits:
add esi,9
jmp nxt
have_10_digits:
add esi,10
;-------------------------------
nxt:
in this case, i had 9 digits max
i needed to scale the value - not just adjust a pointer
it might be better if i used 32-bit instructions for EBX and ECX, though
i know - i know - i can unroll it, too :bg
;determine the number of digits in the first dword
;ECX = 0 or 1
;ESI = active index pointer (offset of highest order accumulator dword)
;EBP = stack frame base pointer
Ascii0: mov cl,9 ;ECX = 9
mov eax,[esi] ;load the first dword
mov bl,3 ;BL = 3
push eax ;save the first dword
;scale the first dword by 1,000 to count digits
;we may make 0, 1, or 2 passes on this loop
Ascii1: cmp eax,1000000 ;is it >= 1,000,000 ?
jae Ascii2 ;yes - determine mod 3 count
sub cl,bl ;no - subtract 3 from count
imul eax,eax,1000 ;scale it by 1,000
cmp cl,bl ;last 3 digits ?
ja Ascii1 ;no - until it is
Dave,
here is a version that does the calculations top down with branch predictions backwards and its still no faster.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
ALIGN 16
; u32toa by brethren
utoa2 PROC, value:DWORD, szbuf:DWORD
push ebx
push esi
mov eax, [esp+4][8] ; value
mov esi, [esp+8][8] ; szbuf
cmp eax, 1000000000
jb lb9
add esi, 10
nxt:
mov ebx, 0CCCCCCCDh
mov BYTE PTR [esi], 0
sub esi, 1
align 4
stlp:
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx-48]
sub ecx, edx
mov [esi], cl
sub esi, 1
test eax, eax
jbe lpqt
mov ecx, eax
mul ebx
shr edx, 3
mov eax, edx
lea edx, [edx+edx*4]
lea edx, [edx+edx-48]
sub ecx, edx
mov [esi], cl
sub esi, 1
test eax, eax
ja stlp
lpqt:
pop esi
pop ebx
ret 8
align 4
lb1:
add esi, 1
jmp nxt
lb2:
cmp eax, 10
jb lb1
add esi, 2
jmp nxt
lb3:
cmp eax, 100
jb lb2
add esi, 3
jmp nxt
lb4:
cmp eax, 1000
jb lb3
add esi, 4
jmp nxt
lb5:
cmp eax, 10000
jb lb4
add esi, 5
jmp nxt
lb6:
cmp eax, 100000
jb lb5
add esi, 6
jmp nxt
lb7:
cmp eax, 1000000
jb lb6
add esi, 7
jmp nxt
lb8:
cmp eax, 10000000
jb lb7
add esi, 8
jmp nxt
lb9:
cmp eax, 100000000
jb lb8
add esi, 9
jmp nxt
utoa2 ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
well - it's no faster on your i7 :P
and - testing it with the same value over and over isn't going to reveal anything
i think the tree here is a good plan
http://www.masm32.com/board/index.php?topic=15427.msg126379#msg126379
i have to get my head into your test code - lol
"Use the later version posted in the Lab..."
I did it and my algo is still faster... :wink
results are:
69 cycles for udw2str2-Hutch
65 cycles for u32toalin-Lingo
Press any key to continue ...
lingos test4 on AMD Turion(tm) 64 X2 Mobile Technology TL-52
71 cycles for udw2str2-Hutch
77 cycles for u32toalin-Lingo
Press any key to continue ...
test2
80 cycles for udw2str2-Hutch
68 cycles for u32toalin-Lingo
Press any key to continue ...
suprisingly the only difference in the code in u32toalin between lingos test2 and test4(where's test3 :P) is:
test4
lea edx, [edx+edx*4-24]
lea edx, [edx+edx]
test2
lea edx, [edx+edx*4]
lea edx, [edx+edx-30h]
Quote from: brethren on November 25, 2010, 05:04:41 PM
suprisingly the only difference in the code between lingos test2 and test4(where's test3 :P) is:
Same size, same cycle count on the Celeron M. But Hutch's algo runs at 94 cycles instead of 104.
:bg
Dave,
I mainly use the Core2 Quad, the i7 Win64 is for watching TV while I am writing code on the other box. :P
It runs fine and is slightly faster here and there than the Core2 quad but I confess I don't like Win7 much.
The last algo I posted that does the lead calculation top down should have the lower overhead for 10 digit numbers as it just falls through into the conversion. I have my doubts with the benchmarks posted as they do not sweep the range down to small digit counts, that is why I am doing the timings with the dialog benchmark in the lab.
I played with middle range branching to the comparison list but it only seems to favour samples that are near that length and each leading branch slows the rest down. I could not get the algo faster when sweeping different length ranges by doing it so I gave up on the idea. I still think your logic of doing the calculation top down makes the most sense as the number counts are much higher for the large values.
even the Core2 Quad is fast on so many things
you need to run it on your ShitBox :bg (closely emulates what the rest of us can afford - lol)
i still think the tree in reply #18 is the way to fly, Hutch