News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster Bin2ASCII

Started by lingo, March 26, 2008, 01:07:37 PM

Previous topic - Next topic

NightWare

Quote from: hutch-- on March 27, 2008, 11:11:39 AM
I would be interested to see how it performed on both late AMD and Intel Core2 Duos.
on vista, core2 duo 2ghz
92 cycles, bin2ascii-AMD
60 cycles, bin2ascii-lingo


i can't compile it, but i'd like to see the speed with this modified version of the algo from michealw/pdixon :
ALIGN 16
;
; syntax :
; eax = la valeur
; edi = adresse de la chaine de caractères à créer
; call uDw2Ascii
;
uDw2Ascii PROC
push ebx ;; empiler ebx
push edi ;; empiler edi

mov ecx,eax ;; copier eax dans ecx
; on va diviser eax par 100000 de manière optimisée
mov edx,2814749768 ;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
sub edi,5 ;; enlever 5 caractères à edi
mul edx ;; multiplier eax par 2814749768
shr edx,16 ;; décaler edx de 16 bits à droite
mov eax,100000 ;; placer 100000 dans eax
mov ebx,edx ;; copier edx dans ebx
mul edx ;; multiplier edx par 100000
sub ecx,eax ;; soustraire le résultat en eax à ecx
test ebx,ebx ;; fixer les flags de ebx
mov edx,ebx ;; replacer ebx dans edx
mov ebx,10 ;; placer 10 (notre multiplicateur décimale) dans ebx
jz Label01 ;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
mov eax,429497 ;; on remplace eax par 429497
add edi,5 ;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
mul edx ;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label02 ;; s'il existe un dépassement, aller Label02
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label03 ;; s'il existe un dépassement, aller Label03
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label04 ;; s'il existe un dépassement, aller Label04
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label05 ;; s'il existe un dépassement, aller Label05
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label06 ;; s'il existe un dépassement, aller Label06
dec edi ;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01: mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label07 ;; s'il existe un dépassement, aller Label07
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label08 ;; s'il existe un dépassement, aller Label08
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label09 ;; s'il existe un dépassement, aller Label09
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label10 ;; s'il existe un dépassement, aller Label10
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jmp Label11 ;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label03: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+1],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label04: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+2],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label05: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+3],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label06: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+4],dl ;; placer l'octet en dl à l'adresse en edi
mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+5],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label08: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+6],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label09: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+7],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label10: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+8],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label11: add dx,30h ;; ajouter le caractère de base "0" à la valeur en dx
mov WORD PTR [edi+9],dx ;; placer l'octet en dl et le 0 final en dh à l'adresse en edi

pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
uDw2Ascii ENDP

MichaelW

P3:

123456789
123456789
123456789
72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
56 cycles, uDw2Ascii



[attachment deleted by admin]
eschew obfuscation

sinsi

Q6600

83 cycles, bin2ascii-AMD
72 cycles, bin2ascii-lingo

123456789
123456789
123456789
84 cycles, bin2ascii-AMD
70 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii

?
Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

This version sets the process affinity mask after the first test, hopefully to limit the second test to a single core.

Requires Windows 2000 or later.

[attachment deleted by admin]
eschew obfuscation

sinsi


123456789
123456789
123456789
00001111
00001111
83 cycles, bin2ascii-AMD
73 cycles, bin2ascii-lingo
33 cycles, uDw2Ascii
84 cycles, bin2ascii-AMD
83 cycles, bin2ascii-lingo
47 cycles, uDw2Ascii

I remember having trouble with a timer overflow in a speed test, but can't test this because of 'timers.asm' - is that a part of masm32?
I don't know what version I've got - how can I tell?
Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

Timers.asm is in TimingMacros.zip available here. IIRC it was included in the macros directory for one release of the MASM32 package. Timers.asm was updated some time after the first post, but it has been stable for more than two years now.
eschew obfuscation

sinsi

#21
These are with a counter of 10000000

bin2ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo

b2ascii_
123456789
123456789
123456789
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
31 cycles, uDw2Ascii

b2ascii_pam
123456789
123456789
123456789
00001111
00001111
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii
55 cycles, bin2ascii-AMD
47 cycles, bin2ascii-lingo
32 cycles, uDw2Ascii

Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

So it would seem that the higher cycle counts with the processor affinity mask set to 1 were due to some temporary effect, and the code was already running on only one core?

Thinking about this more, it makes sense that the code would run on only one core because there is only one thread involved. It would be interesting to see the effects of splitting a suitable algorithm across two or more cores.

eschew obfuscation

sinsi

Doesn't the mask limit a thread from running on more than one CPU?
Then again, 55/47/32 matches 55/47/32 (ref b2ascii_pam)

I've run these from a batch file (without the inkey) and get similar results (+/- a unicycle).
Q6600,2GB,XPHSP2, 20 processes - although having NFS ProStreet minimised didn't change it by more than 5.

I think we need to change timing code (e.g. different TSC's - lots of games have been caught out by that one).
Light travels faster than sound, that's why some people seem bright until you hear them.

lingo

Many thanks guys for testing.

NightWare,

"to see the speed with this modified version of the algo from michealw/pdixon"

I don't know anything about michealw/pdixon algo, but ...

your version is not tested and works wrong!!!  :lol

Example: for 0BBC8D09Fh result is 31505è9998 vs 3150499999 from my algo


MichaelW,

"The new version is a little slower on my P3:"

Yes, because I have a few new instructions (cmp eax, -1; je @f; )
for properly computing of  just one digit -> 0FFFFFFFFh.  :wink

Regards,
Lingo

Jimg

#25
I added a validity printout, converted uDw2Ascii to english (My apologies to NightWare) using babelfish so I could read it.
I fixed the error (I think), converted it to the same call as the other routines for fairness, and saved as uDw2Ascii2.

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
31505Φ9998 uDw2Ascii
3150499999 uDw2Ascii2

61 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
52 cycles, uDw2Ascii
59 cycles, uDw2Ascii2

Not much to choose from on an AMD.  Surprising  how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.

[attachment deleted by admin]

lingo

Jimg,

"I fixed the error (I think)..."

Your "tested" and "free of errors" algo is wrong too...
Example: for 186A0h-> result is 0000 vs 100000 from my algo
Regards,
Lingo

NightWare

Quote from: lingo on March 29, 2008, 01:28:37 PM
your version is not tested and works wrong!!!  :lol
:wink ok, originally it's a signed algo... i'm going to see if it's possible to fix it...

Quote from: Jimg on March 29, 2008, 06:02:25 PM
Surprising how much difference between uDw2Ascii and uDw2Ascii2 there is just caused by converting to the same call as the other test routines.
yep, invoke is always slower  :wink

hutch--

NightWare,

> yep, invoke is always slower

I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ossa

Quote from: hutch-- on March 29, 2008, 11:26:09 PM
> yep, invoke is always slower

I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.

Well, one is using the stack, the other is passing in registers (eax, edi). Naturally the stack method is slower:


  • more ops to get them off the stack into regs
  • using memory accesses where none are needed before

I do think it is surprising that you get *7* cycles slowdown, though.

[edit] Well, I guess there is little more slowdown for "ret 8" to sort the stack out too. [/edit]
Website (very old): ossa.the-wot.co.uk