I rewrote speed optimized
"Binary-to-ASCII Decimal Conversion Suppressing Leading Zeros" algo
by AMD (see AMD Software Optimization Guide 25112.pdf)
As a result I have 88 instructions vs 101 instructions from AMD and my time is:
P4 3.6 GHz, 2GB RAM, Vista Ultimate with SP1
156 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
Press any key to exit...
P4 3.6 GHz, 2GB RAM, Windows XP Pro with SP2
123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
Press any key to exit...
Pls, delete your old zip and download new one again.
Regards,
Lingo
[attachment deleted by admin]
134 cycles, bin2ascii-AMD
126 cycles, bin2ascii-lingo
XP Pro SP2
Intel 2.8GHz
1.25GB RAM
try again and again until you receive better results... :wink
Me trying hard, too :wink
134 cycles, bin2ascii-AMD
127 cycles, bin2ascii-lingo
XP SP1, 512 MB Ram
Intel P3, Win2k SP4:
72 cycles, bin2ascii-AMD
62 cycles, bin2ascii-lingo
51 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)
Ossa
3.2 gig Prescott 800 meg fsb.
123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
Press any key to exit...
LATER
Here is the timing on my 2.8 gig Northwood.
133 cycles, bin2ascii-AMD
121 cycles, bin2ascii-lingo
Press any key to exit...
It appears the code in your latest version favours the later Intel hardware, I would be interested to see how it performed on both late AMD and Intel Core2 Duos.
Thanks Hutch,
My CPU is Prescott too, so
I receive the similar results :wink
To All: Pls, reload the last one zip file
Compliments, Lingo!
XP SP2, Pentium 4 3.4 GHz, 1.5 G RAM
221 cycles, bin2ascii-AMD
147 cycles, bin2ascii-lingo
149 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
149 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo
149 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo
149 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
51 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
(same setup as before).
AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)
that's worse than before for me (it was 43 cycles last time).
Ossa
The new version is a little slower on my P3:
72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
Pentium D 940 (3.2 GHz)
123 cycles, bin2ascii-AMD
95 cycles, bin2ascii-lingo
123 - AMD
96 - lingo
120 cycles, bin2ascii-AMD
96 cycles, bin2ascii-lingo
Celeron 2.4 GHz, 448 MB RAM, XP SP1
The P4 on my office box (SP2) is slower for AMD, see above
Sempron 3000+
51 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
Quote from: hutch-- on March 27, 2008, 11:11:39 AM
I would be interested to see how it performed on both late AMD and Intel Core2 Duos.
on vista, core2 duo 2ghz
92 cycles, bin2ascii-AMD
60 cycles, bin2ascii-lingo
i can't compile it, but i'd like to see the speed with this modified version of the algo from michealw/pdixon :
ALIGN 16
;
; syntax :
; eax = la valeur
; edi = adresse de la chaine de caractères à créer
; call uDw2Ascii
;
uDw2Ascii PROC
push ebx ;; empiler ebx
push edi ;; empiler edi
mov ecx,eax ;; copier eax dans ecx
; on va diviser eax par 100000 de manière optimisée
mov edx,2814749768 ;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
sub edi,5 ;; enlever 5 caractères à edi
mul edx ;; multiplier eax par 2814749768
shr edx,16 ;; décaler edx de 16 bits à droite
mov eax,100000 ;; placer 100000 dans eax
mov ebx,edx ;; copier edx dans ebx
mul edx ;; multiplier edx par 100000
sub ecx,eax ;; soustraire le résultat en eax à ecx
test ebx,ebx ;; fixer les flags de ebx
mov edx,ebx ;; replacer ebx dans edx
mov ebx,10 ;; placer 10 (notre multiplicateur décimale) dans ebx
jz Label01 ;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
mov eax,429497 ;; on remplace eax par 429497
add edi,5 ;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
mul edx ;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label02 ;; s'il existe un dépassement, aller Label02
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label03 ;; s'il existe un dépassement, aller Label03
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label04 ;; s'il existe un dépassement, aller Label04
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label05 ;; s'il existe un dépassement, aller Label05
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label06 ;; s'il existe un dépassement, aller Label06
dec edi ;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01: mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label07 ;; s'il existe un dépassement, aller Label07
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label08 ;; s'il existe un dépassement, aller Label08
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label09 ;; s'il existe un dépassement, aller Label09
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label10 ;; s'il existe un dépassement, aller Label10
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jmp Label11 ;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label03: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+1],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label04: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+2],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label05: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+3],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label06: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+4],dl ;; placer l'octet en dl à l'adresse en edi
mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+5],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label08: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+6],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label09: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+7],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label10: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+8],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label11: add dx,30h ;; ajouter le caractère de base "0" à la valeur en dx
mov WORD PTR [edi+9],dx ;; placer l'octet en dl et le 0 final en dh à l'adresse en edi
pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
uDw2Ascii ENDP
P3:
123456789
123456789
123456789
72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
56 cycles, uDw2Ascii
[attachment deleted by admin]
Q6600
83 cycles, bin2ascii-AMD
72 cycles, bin2ascii-lingo
123456789
123456789
123456789
84 cycles, bin2ascii-AMD
70 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii
?
This version sets the process affinity mask after the first test, hopefully to limit the second test to a single core.
Requires Windows 2000 or later.
[attachment deleted by admin]
123456789
123456789
123456789
00001111
00001111
83 cycles, bin2ascii-AMD
73 cycles, bin2ascii-lingo
33 cycles, uDw2Ascii
84 cycles, bin2ascii-AMD
83 cycles, bin2ascii-lingo
47 cycles, uDw2Ascii
I remember having trouble with a timer overflow in a speed test, but can't test this because of 'timers.asm' - is that a part of masm32?
I don't know what version I've got - how can I tell?
Timers.asm is in TimingMacros.zip available here (http://www.masm32.com/board/index.php?topic=770.0). IIRC it was included in the macros directory for one release of the MASM32 package. Timers.asm was updated some time after the first post, but it has been stable for more than two years now.
These are with a counter of 10000000
bin2ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
b2ascii_
123456789
123456789
123456789
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
31 cycles, uDw2Ascii
b2ascii_pam
123456789
123456789
123456789
00001111
00001111
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii
So it would seem that the higher cycle counts with the processor affinity mask set to 1 were due to some temporary effect, and the code was already running on only one core?
Thinking about this more, it makes sense that the code would run on only one core because there is only one thread involved. It would be interesting to see the effects of splitting a suitable algorithm across two or more cores.
Doesn't the mask limit a thread from running on more than one CPU?
Then again, 55/47/32 matches 55/47/32 (ref b2ascii_pam)
I've run these from a batch file (without the inkey) and get similar results (+/- a unicycle).
Q6600,2GB,XPHSP2, 20 processes - although having NFS ProStreet minimised didn't change it by more than 5.
I think we need to change timing code (e.g. different TSC's - lots of games have been caught out by that one).
Many thanks guys for testing.
NightWare,
"to see the speed with this modified version of the algo from michealw/pdixon"
I don't know anything about michealw/pdixon algo, but ...
your version is not tested and works wrong!!! :lol
Example: for 0BBC8D09Fh result is 31505è9998 vs 3150499999 from my algo
MichaelW,
"The new version is a little slower on my P3:"
Yes, because I have a few new instructions (cmp eax, -1; je @f; )
for properly computing of just one digit -> 0FFFFFFFFh. :wink
Regards,
Lingo
I added a validity printout, converted uDw2Ascii to english (My apologies to NightWare) using babelfish so I could read it.
I fixed the error (I think), converted it to the same call as the other routines for fairness, and saved as uDw2Ascii2.
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
31505Φ9998 uDw2Ascii
3150499999 uDw2Ascii2
61 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
52 cycles, uDw2Ascii
59 cycles, uDw2Ascii2
Not much to choose from on an AMD. Surprising how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.
[attachment deleted by admin]
Jimg,
"I fixed the error (I think)..."
Your "tested" and "free of errors" algo is wrong too...
Example: for 186A0h-> result is 0000 vs 100000 from my algo
Regards,
Lingo
Quote from: lingo on March 29, 2008, 01:28:37 PM
your version is not tested and works wrong!!! :lol
:wink ok, originally it's a signed algo... i'm going to see if it's possible to fix it...
Quote from: Jimg on March 29, 2008, 06:02:25 PM
Surprising how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.
yep, invoke is always slower :wink
NightWare,
> yep, invoke is always slower
I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.
Quote from: hutch-- on March 29, 2008, 11:26:09 PM
> yep, invoke is always slower
I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.
Well, one is using the stack, the other is passing in registers (eax, edi). Naturally the stack method is slower:
- more ops to get them off the stack into regs
- using memory accesses where none are needed before
I do think it is surprising that you get *7* cycles slowdown, though.
[edit] Well, I guess there is little more slowdown for "ret 8" to sort the stack out too. [/edit]
Nightware- Can you reference a thread where this was originally presented? I can't seem to find it anywhere.
Lingo- Hi, how's it going? Nice to hear from you.
Quote from: hutch-- on March 29, 2008, 11:26:09 PM
I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.
push from mem = 3 cycles, vs mov reg,mem = 2 cycles
beside after that, you have to read again the values (1 or 2 cycles to put them in reg) in invoke method, and the values are already in the registers with "call", it's why i don't like invoke, even with esp use...
Quote from: Jimg on March 30, 2008, 12:18:23 AM
Nightware- Can you reference a thread where this was originally presented? I can't seem to find it anywhere.
if you speak about the original algo, if i remember well, i found it in donkey's library. but donkey's page appear not being accessible now...
I have only a vague memory of the thread. AFAIK Paul Dixon's current user name is dioxin. There is a version of the code here (http://www.masm32.com/board/index.php?topic=3051.0). I think the code was originally done as PowerBASIC inline asm, and I think this is the version that I started with, dated 2/14/2004:
num&=1234567890
k&=2^32/100000
k2&=2^32/10000
!mov esi,ptrans& ;point to answer
!mov ebx,10
!mov eax,num& ;get the number
!mov ecx,k& ;get 2^32/100000
!mul ecx ;div 100000
!mov edi,edx ;save top 5 digits
!mov eax,100000
!mul edx
!sub eax,num& ;get the number
!neg eax ;get the low 5 digits
!mov ecx,k2& ;about to divide 5 digit number by 10000
!mul ecx
!add edx,&h30 ;first digit is now in DL, convert to ASCII
!mov [esi+5],dl ;and store
!mul ebx ;multiply remainder by 10 to get next digit
!add edx,&h30 ;convert to ASCII
!mov [esi+6],dl ;and store
!mul ebx ;..etc..
!add edx,&h30
!mov [esi+7],dl
!mul ebx
!add edx,&h30
!mov [esi+8],dl
!mul ebx
!add edx,&h30
!mov [esi+9],dl
!mov eax,edi ;first 5 digits done, now get next 5 and do the same..
!mul ecx
!add edx,&h30
!mov [esi+0],dl
!mul ebx
!add edx,&h30
!mov [esi+1],dl
!mul ebx
!add edx,&h30
!mov [esi+2],dl
!mul ebx
!add edx,&h30
!mov [esi+3],dl
!mul ebx
!add edx,&h30
!mov [esi+4],dl
I think this code from 2/16/2004 was his MASM version:
_dwtoa proc dwValue:DWORD, lpBuffer:DWORD
push ebx
push esi
push edi
mov eax,dwValue
mov esi,lpBuffer
k4 = 429497 ;2^32/10000 rounded up
k9 = 2814749768 ;2^43/5^5 = 2814749767.28
mov ecx,k9 ;prepare to fast div by 100000
or eax,eax ;is number negative?
jns skip ;no, just leave it
neg eax ;yes, make it positive
mov byte ptr [esi],"-" ;put sign in place
inc esi ;leave room for sign
skip:
mov edi,eax ;save absolute value of number
; let's call it 1234567890
;#####fast div by 100000
mul ecx ;div 100000
mov ecx,100000 ;prepare to multiply up the top 5 digits
shr edx,16 ;shift top 5 digits into place edx=12345
mov eax,edx ;copy to eax to multiply up to real size again
mov ebx,edx ;save a copy of top 5 digits
mul ecx ;eax=1234500000
sub edi,eax ;subtract orignal number edi=0000067890
;##end fast div
mov eax,ebx ;get back high digits
mov ecx,k4 ;about to do div 10000 by reciprocal multiply
mov ebx,10 ;useful constant
or eax,eax ;if top 5 digit = 0 then skip that part
jz SkipTop5
;####################part 1
mul ecx ;div top 5 digits by 10000
jc digit1 ;if digit is not zero then process it,
; else try next digit
mul ebx ;multiply by 10 to get next digit into edx
jc digit2
mul ebx
jc digit3
mul ebx
jc digit4
mul ebx
jc digit5
SkipTop5:
mov eax,edi ;retrieve lower 5 digits
mul ecx ;div 10000
jc digit6
mul ebx ;multiply by 10 to get next digit in edx
jc digit7
mul ebx ;3rd digit
jc digit8
mul ebx ;4th digit
jc digit9
mul ebx ;5th digit
jmp digit10
;###############end part1
digit1:
add edx,30h ;top digit is left in edx, convert to ascii
mov [esi],dl ;store top digit
mul ebx ;multiply by 10 to get next digit into edx
inc esi
digit2:
add edx,30h ;convert to ascii
mov [esi],dl ;and store
mul ebx
inc esi
digit3:
add edx,30h ;convert to ascii
mov [esi],dl ;and store
mul ebx
inc esi
digit4:
add edx,30h ;convert to ascii
mov [esi],dl ;and store
mul ebx
inc esi
digit5:
add edx,30h ;convert to ascii
mov [esi],dl ;and store
mov eax,edi ;retrieve lower 5 digits
mul ecx
inc esi ;div 10000
digit6:
add edx,30h ;convert 1st digit to ascii
mov [esi],dl ;and store
mul ebx ;multiply by 10 to get next digit in edx
inc esi
digit7:
add edx,30h ;convert to ascii
mov [esi],dl ;and store
mul ebx ;3rd digit
inc esi
digit8:
add edx,30h
mov [esi],dl
mul ebx ;4th digit
inc esi
digit9:
add edx,30h
mov [esi],dl
mul ebx ;5th digit
inc esi
digit10:
add edx,30h
mov [esi],dx ;last digit, store dx not dl to give a
; zero termination for the result string
pop edi
pop esi
pop ebx
ret
_dwtoa endp
I couldn't get that one to work immediately, but here's results from another Paul Dixon routine from that thread-
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword
100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword
4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword
timing tests:
61 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
37 cycles, udword
Press any key to exit...
Of course speed costs, in this case, an extra 200 bytes for a lookup table.
He had one later that he claimed was faster still, but I couldn't get it to work.
[attachment deleted by admin]
GenuineIntel Family 6 Model 7 Stepping 3
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV FXSR MMX SSE
validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword
100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword
4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword
timing tests:
72 cycles, bin2ascii-AMD
62 cycles, bin2ascii-lingo
51 cycles, udword
here, it should be fixed, (i haven't tested strings, just the divisions), and i know... i'm a lazy coder to fix this like that... but why makes things complicate, when there is easy ways... :bg
ALIGN 16
;
; syntax :
; mov eax,value
; mov edi,StringPointer
; call uDw2Ascii
;
uDw2Ascii PROC
push ebx ;; empiler ebx
push edi ;; empiler edi
mov ecx,eax ;; copier eax dans ecx
inc eax
jz SpecialCase
; on va diviser eax par 100000 de manière optimisée
; mov edx,2814749768 ;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
mov edx,2814749767
sub edi,5 ;; enlever 5 caractères à edi
mul edx ;; multiplier eax par 2814749768
shr edx,16 ;; décaler edx de 16 bits à droite
mov eax,100000 ;; placer 100000 dans eax
mov ebx,edx ;; copier edx dans ebx
mul edx ;; multiplier edx par 100000
sub ecx,eax ;; soustraire le résultat en eax à ecx
test ebx,ebx ;; fixer les flags de ebx
mov edx,ebx ;; replacer ebx dans edx
mov ebx,10 ;; placer 10 (notre multiplicateur décimale) dans ebx
jz Label01 ;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
mov eax,429497 ;; on remplace eax par 429497
add edi,5 ;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
mul edx ;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label02 ;; s'il existe un dépassement, aller Label02
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label03 ;; s'il existe un dépassement, aller Label03
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label04 ;; s'il existe un dépassement, aller Label04
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label05 ;; s'il existe un dépassement, aller Label05
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label06 ;; s'il existe un dépassement, aller Label06
dec edi ;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01: mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label07 ;; s'il existe un dépassement, aller Label07
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label08 ;; s'il existe un dépassement, aller Label08
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label09 ;; s'il existe un dépassement, aller Label09
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label10 ;; s'il existe un dépassement, aller Label10
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jmp Label11 ;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label03: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+1],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label04: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+2],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label05: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+3],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label06: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+4],dl ;; placer l'octet en dl à l'adresse en edi
mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+5],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label08: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+6],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label09: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+7],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label10: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+8],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label11: add dx,30h ;; ajouter le caractère de base "0" à la valeur en dx
mov WORD PTR [edi+9],dx ;; placer l'octet en dl et le 0 final en dh à l'adresse en edi
pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
SpecialCase:
mov DWORD PTR [edi],"4924"
mov DWORD PTR [edi+4],"2769"
mov WORD PTR [edi+8],"59"
mov BYTE PTR [edi+10],0
pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
uDw2Ascii ENDP
NightWare,
"but why makes things complicate, when there is easy ways... "
Your easy ways are just to get lingo's decision of the problem... :lol
but your algo is still slow..
Jimg,
Nice code, and fastest... :wink
P4 3.6 GHz, 2GB RAM, Windows XP Pro with SP2
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
I wasn't aware that nightware was contrasting stack parameters to register passing so in part the comment makes sense as long as the algo the regs are passed to can use the data in the regs that are passed without running out of regs to do the work. I have rarely ever found this to be the case and while it starts to make sense on a 64 bit machine with twice as many registers, it is problematic on a 32 bit machine with 8 gp registers where the stack pointer is rarely ever available and EBP is only available with no stack frame.
Nice thread :lol
I optimized AMD algo and two P.Dixon's algos and achieved the best time again: :wink
P4 3.6 GHz, 2GB RAM, Windows XP Pro with SP2
123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
37 cycles, udwordl-lingo 3
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
P3:
72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
58 cycles, bi2ascii-lingo 2
56 cycles, uDw2Ascii - NightWare
50 cycles, udword-p.dixon
45 cycles, udwordl-lingo 3
Result on my old PIV. 2.8g Northwood, 800 meg fsb, ddr400.
123456789
123456789
123456789
123456789
123456789
123456789
145 cycles, bin2ascii-AMD
122 cycles, bin2ascii-lingo
143 cycles, bi2ascii-lingo 2
128 cycles, uDw2Ascii - NightWare
77 cycles, udword-p.dixon
74 cycles, udwordl-lingo 3
Press any key to exit...
Quote123456789
123456789
123456789
123456789
123456789
123456789
72 cycles, bin2ascii-AMD
61 cycles, bin2ascii-lingo
49 cycles, bi2ascii-lingo 2
48 cycles, uDw2Ascii - NightWare
42 cycles, udword-p.dixon
41 cycles, udwordl-lingo 3
Press any key to exit...
Nice job Lingo :U
123456789
123456789
123456789
123456789
123456789
123456789
63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...
I have a question however. This is the end of your proc-
movzx edx, word ptr chartab[edx*2] ; look up 2 digits
mov [esi], dh ; but only write the 1 we need, supress the leading zero
add esi, 1 ; update pointer by 1
jnz ZS2 ; continue with pairs of digits to the end
udwordl endp
align 16
udword proc lpbuf:DWORD, nNumber:DWORD ;unsigned DWORD to ASCII
push edi ;save registers that need to be saved
push esi
mov esi, [esp+12] ; sptr
mov eax, [esp+16] ; x
Clearly esi will never be zero. Is jnz faster than a jmp?
AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)
51 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
23 cycles, udwordl-lingo 3
Very nice.
Ossa
lingo-
I like your stack handling (now why didn't I think of that?).
Applying your stack handling to the original p.dixon routine cuts off 2 cycles with no other changes.
We're getting down into the noise level here, however.
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
timing tests:
60 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
38 cycles, udword-p.dixon
35 cycles, udwordl-lingo3
36 cycles, udwordx2-p.dixon3
Press any key to exit...
Quote from: lingo on March 30, 2008, 05:50:54 AM
Your easy ways are just to get lingo's decision of the problem... :lol
in fact, i speaking of HOW i've fixed the code :lol
123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
37 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
19 cycles, udwordl-lingo 3
Press any key to exit...
For what little it's worth, I tested a version that used the FPU to do the conversion to decimal, and while it worked OK, it was *very* slow. On my P3, just the necessary fild and fbstp instructions took 168 cycles.
fild not really slow, but look at fbstp in agner fog's instructions tables... :lol
and it's not only on p3
I tested them together because I was guessing that I could not get a valid cycle count from a thousand iterations of fbstp. Now that I try I get 106 cycles for fbstp alone, and 106 cycles for fild alone, so I'll assume that my guess was right.
:eek 106 cycles for fild alone ? for fbstp yes, it use the (slow) bcd technic on a ten byte value (it's why fpu is slow, everything is made with ten byte values...) but for fild it's abnormal, it should be between 2 and 12
I assume it's abnormal because for either one of the instructions used alone, 992 of the 1000 iterations generate an exception (in this case internal to the FPU).
1) Is there a collection of procs like this where the fastest get put inside a header for maybe the next version of MASM?
2) I see that MMX is mostly ignored from performance routines like this. Is this because of the challenge or because it's not practical to test MMX versions of these procs?
Oh, an my tests were:
GenuineIntel Family 15 Model 4 Stepping 7
FPU TSC CX8 CMOV CLFSH FXSR HTT MXX SSE SSE2 SSE3
Pentium D 2.8, 3.5gb RAM
122 cycles, bin2ascii-AMD
97 cycles, bin2ascii-lingo
64 cycles, bi2ascii-lingo 2
85 cycles, uDw2Ascii - NighwWare
48 cycles, udword-p.dixon
39 cycles, udwordl-lingo 3
Heh, I pulled out my own version of this from the old thread the other day to see if I could optimise it any more. That was a repeated-subtraction approach, hideous code length but had the advantage of being fastest for small numbers compared to Paul Dixon's approach, because it processes only the digits it needs to. To my chagrin, every trick I tried couldn't improve on the code speed at all and generally made it far worse! I tried passing params in registers to allow a "JMP proc ... JMP ebx" type of pseudo-call to remove the repeated loops, but this was terrible speedwise too. However bad the code looks, it was still beating Pauls' MUL-based approach for most values. I'll be interested to try the new Lingo code, tho.
FWIW, no-one has yet posted a QWORD-based solution (value to convert held in EDX:EAX). My old code, however bad, is still a benchmark there... I'd be interested to see if anyone can come up with a better algorithm for that.
thomas remkus,
MMX or SSE+ are usefull for parallel ops, but here you need to know a value, depending of another value, depending of another value, etc... so it's inappropriate here, (it's possible if you cut the value in 2*5 digits, but) after that, you will have to test/shift the bytes to remove useless zeros, and all thoses ops on 64/128 bits registers... not really efficient... :(
michaelw,
if i understand well, you try to rework your timing macro ? if you ask me, you loose your time, because of cpu hardware features... i explain : on p4- we had to do things by ourself, but on p4 and + the automatic harware data prefetcher has been implemented, so all thoses +1 extra uops due to mem access are removed, for example, if you take an algo with 40 cycles (where there is +9 uops due to mem access) and 1000 iterations in loop test you will obtain 40 cycles on P3 and 31.009 cycles on P4+ (not exactly that, but you've understood the concept). the cache is heavilly used, and automatic, on P4+. it depend of the usage you make of your algo. if the algo is used once with many other algos, then you will prefere the algo with the best result on P3. in the other hand, if like me you group the tasks to use recent cpu features, you will prefere the algo with the best result on P4+... it really depend of the usage. now make your own opinion...
a bit cosmetics and new time again: :wink
P4 3.6 GHz, 2GB RAM, Windows Vista with SP1
123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
lingo-
Your latest is consistently 5 cycles slower on my
AuthenticAMD Family 7 Model 10 Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE
Previous version:
63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...
New Version:
62 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
40 cycles, udwordl-lingo 3
Press any key to exit...
AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)
52 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
3 cycles faster than last time.
123456789
123456789
123456789
123456789
123456789
123456789
93 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
36 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...
aproximatively same speed than the previous one, for me
"FWIW, no-one has yet posted a QWORD-based solution"
Here is P.Dixon's algo which combined 32 and 64 bit code with my modest attempt for improvement... :wink
P4 3.6 GHz, 2GB RAM, Vista Ultimate SP1
18446744073709551614
18446744073709551614
4294967294
4294967294
137 cycles, uqword-with 64 bit number - p.dixon
125 cycles, b2a3264-with 64 bit number - lingo
45 cycles, uqword-with 32 bit number - p.dixon
39 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...
Regards,
Lingo
[attachment deleted by admin]
Quote18446744073709551614
18446744073709551614
4294967294
4294967294
87 cycles, uqword-with 64 bit number - p.dixon
77 cycles, b2a3264-with 64 bit number - lingo
43 cycles, uqword-with 32 bit number - p.dixon
44 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit...
core2duo 2.66Ghz 1333mhz FSB/4M - Winxp Pro - sp2
bin2ascii
------------
74 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
Press any key to exit...
lastbin2asci
-----------------
74 cycles, bin2ascii-AMD
63 cycles, bin2ascii-lingo
47 cycles, bi2ascii-lingo 2
46 cycles, uDw2Ascii - NightWare
29 cycles, udword-p.dixon
26 cycles, udwordl-lingo 3
Press any key to exit...
b2a3264
-------------
86 cycles, uqword-with 64 bit number - p.dixon
76 cycles, b2a3264-with 64 bit number - lingo
32 cycles, uqword-with 32 bit number - p.dixon
29 cycles, b2a3264-with 32 bit number - lingo
Press any key to exit..
Intel Core2Duo E6850, 3Ghz, 4 GB installed RAM, XP Pro 32bit SP3
---------------------------------------------
1st run:
123456789
123456789
123456789
123456789
123456789
123456789
56 cycles, bin2ascii-AMD
48 cycles, bin2ascii-lingo
37 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...
---------------------------------------------
2nd run:
123456789
123456789
123456789
123456789
123456789
123456789
84 cycles, bin2ascii-AMD
71 cycles, bin2ascii-lingo
54 cycles, bi2ascii-lingo 2
49 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...
---------------------------------------------
3rd run:
123456789
123456789
123456789
123456789
123456789
123456789
84 cycles, bin2ascii-AMD
73 cycles, bin2ascii-lingo
53 cycles, bi2ascii-lingo 2
34 cycles, uDw2Ascii - NightWare
22 cycles, udword-p.dixon
20 cycles, udwordl-lingo 3
Press any key to exit...
---------------------------------------------
CUL8'er.