News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Faster Bin2ASCII

Started by lingo, March 26, 2008, 01:07:37 PM

Previous topic - Next topic

Jimg

Nightware-  Can you reference a thread where this was originally presented?  I can't seem to find it anywhere.

Lingo-  Hi, how's it going?  Nice to hear from you.

NightWare

Quote from: hutch-- on March 29, 2008, 11:26:09 PM
I would be interested to hear your reasoning behind this statement, the built in invoke operator is very well known in what it does and in most instances it delivers identical code. Exceptions are things like nested function calls and calculations performed during the push sequence before the call.
push from mem = 3 cycles, vs mov reg,mem = 2 cycles
beside after that, you have to read again the values (1 or 2 cycles to put them in reg) in invoke method, and the values are already in the registers with "call", it's why i don't like invoke, even with esp use...

Quote from: Jimg on March 30, 2008, 12:18:23 AM
Nightware- Can you reference a thread where this was originally presented? I can't seem to find it anywhere.
if you speak about the original algo, if i remember well, i found it in donkey's library. but donkey's page appear not being accessible now...

MichaelW

I have only a vague memory of the thread. AFAIK Paul Dixon's current user name is dioxin. There is a version of the code here. I think the code was originally done as PowerBASIC inline asm, and I think this is the version that I started with, dated 2/14/2004:

num&=1234567890
k&=2^32/100000
k2&=2^32/10000

!mov esi,ptrans&    ;point to answer
!mov ebx,10

!mov eax,num&       ;get the number
!mov ecx,k&         ;get 2^32/100000
!mul ecx            ;div 100000
!mov edi,edx        ;save top 5 digits

!mov eax,100000
!mul edx

!sub eax,num&         ;get the number
!neg eax              ;get the low 5 digits



!mov ecx,k2&        ;about to divide 5 digit number by 10000
!mul ecx

!add edx,&h30       ;first digit is now in DL, convert to ASCII
!mov [esi+5],dl     ;and store

!mul ebx            ;multiply remainder by 10 to get next digit
!add edx,&h30       ;convert to ASCII
!mov [esi+6],dl     ;and store

!mul ebx         ;..etc..
!add edx,&h30
!mov [esi+7],dl

!mul ebx
!add edx,&h30
!mov [esi+8],dl

!mul ebx
!add edx,&h30
!mov [esi+9],dl


!mov eax,edi     ;first 5 digits done, now get next 5 and do the same..
!mul ecx

!add edx,&h30
!mov [esi+0],dl

!mul ebx
!add edx,&h30
!mov [esi+1],dl

!mul ebx
!add edx,&h30
!mov [esi+2],dl

!mul ebx
!add edx,&h30
!mov [esi+3],dl

!mul ebx
!add edx,&h30
!mov [esi+4],dl

I think this code from 2/16/2004 was his MASM version:

_dwtoa proc dwValue:DWORD, lpBuffer:DWORD
   
    push ebx
    push esi
    push edi
    mov eax,dwValue
    mov esi,lpBuffer
   
    k4 = 429497             ;2^32/10000  rounded up
    k9 = 2814749768         ;2^43/5^5 = 2814749767.28

    mov ecx,k9              ;prepare to fast div by 100000
    or eax,eax              ;is number negative?
    jns skip                ;no, just leave it
    neg eax                 ;yes, make it positive
    mov byte ptr [esi],"-"  ;put sign in place
    inc esi                 ;leave room for sign
skip:
    mov edi,eax             ;save absolute value of number
                            ; let's call it 1234567890
    ;#####fast div by 100000
    mul ecx                 ;div 100000
    mov ecx,100000          ;prepare to multiply up the top 5 digits
    shr edx,16              ;shift top 5 digits into place edx=12345
    mov eax,edx             ;copy to eax to multiply up to real size again
    mov ebx,edx             ;save a copy of top 5 digits
    mul ecx                 ;eax=1234500000
    sub edi,eax             ;subtract orignal number  edi=0000067890
    ;##end fast div
    mov eax,ebx             ;get back high digits
    mov ecx,k4              ;about to do div 10000 by reciprocal multiply
    mov ebx,10              ;useful constant

    or eax,eax              ;if top 5 digit = 0 then skip that part
    jz SkipTop5

    ;####################part 1
    mul ecx                 ;div top 5 digits by 10000
    jc digit1               ;if digit is not zero then process it,
                            ; else try next digit
    mul ebx                 ;multiply by 10 to get next digit into edx
    jc digit2

    mul ebx
    jc digit3

    mul ebx
    jc digit4

    mul ebx
    jc digit5
SkipTop5:
    mov eax,edi             ;retrieve lower 5 digits
    mul ecx                 ;div 10000
    jc digit6

    mul ebx                 ;multiply by 10 to get next digit in edx
    jc digit7

    mul ebx                 ;3rd digit
    jc digit8

    mul ebx                 ;4th digit
    jc digit9

    mul ebx                 ;5th digit
    jmp digit10

    ;###############end part1

digit1:
    add edx,30h             ;top digit is left in edx, convert to ascii
    mov [esi],dl            ;store top digit

    mul ebx                 ;multiply by 10 to get next digit into edx
    inc esi
digit2:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit3:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit4:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx
    inc esi
digit5:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mov eax,edi             ;retrieve lower 5 digits
    mul ecx
    inc esi                 ;div 10000
digit6:
    add edx,30h             ;convert 1st digit to ascii
    mov [esi],dl            ;and store

    mul ebx                 ;multiply by 10 to get next digit in edx
    inc esi
digit7:
    add edx,30h             ;convert to ascii
    mov [esi],dl            ;and store

    mul ebx                 ;3rd digit
    inc esi
digit8:
    add edx,30h
    mov [esi],dl

    mul ebx                 ;4th digit
    inc esi
digit9:
    add edx,30h
    mov [esi],dl

    mul ebx                 ;5th digit
    inc esi
digit10:
    add edx,30h
    mov [esi],dx            ;last digit, store dx not dl to give a
                            ; zero termination for the result string
    pop edi
    pop esi
    pop ebx
    ret
_dwtoa endp

eschew obfuscation

Jimg

I couldn't get that one to work immediately, but here's results from another Paul Dixon routine from that thread-

AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword

100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword

4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword


timing tests:
61 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
37 cycles, udword

Press any key to exit...


Of course speed costs, in this case, an extra 200 bytes for a lookup table.

He had one later that he claimed was faster still, but I couldn't get it to work.

[attachment deleted by admin]

MichaelW


GenuineIntel  Family 6  Model 7  Stepping 3
Intel Brand String: NA, processor is Pentium III
Features: FPU TSC CX8 CMOV FXSR MMX SSE

validity tests:
3150499999 bin2ascii-AMD
3150499999 bin2ascii-lingo
3150499999 udword

100000 bin2ascii-AMD
100000 bin2ascii-lingo
100000 udword

4294967294 bin2ascii-AMD
4294967294 bin2ascii-lingo
4294967294 udword


timing tests:
72 cycles, bin2ascii-AMD
62 cycles, bin2ascii-lingo
51 cycles, udword
eschew obfuscation

NightWare

here, it should be fixed, (i haven't tested strings, just the divisions), and i know... i'm a lazy coder to fix this like that... but why makes things complicate, when there is easy ways...  :bg
ALIGN 16
;
; syntax :
; mov eax,value
; mov edi,StringPointer
; call uDw2Ascii
;
uDw2Ascii PROC
push ebx ;; empiler ebx
push edi ;; empiler edi

mov ecx,eax ;; copier eax dans ecx
inc eax
jz SpecialCase
; on va diviser eax par 100000 de manière optimisée
; mov edx,2814749768 ;; placer 2814749768 (remplacement de 2814749767 et eax+1) dans edx
mov edx,2814749767
sub edi,5 ;; enlever 5 caractères à edi
mul edx ;; multiplier eax par 2814749768
shr edx,16 ;; décaler edx de 16 bits à droite
mov eax,100000 ;; placer 100000 dans eax
mov ebx,edx ;; copier edx dans ebx
mul edx ;; multiplier edx par 100000
sub ecx,eax ;; soustraire le résultat en eax à ecx
test ebx,ebx ;; fixer les flags de ebx
mov edx,ebx ;; replacer ebx dans edx
mov ebx,10 ;; placer 10 (notre multiplicateur décimale) dans ebx
jz Label01 ;; si c'est égal à 0, aller Label01
; sinon, on traite la partie supérieure du nombre
mov eax,429497 ;; on remplace eax par 429497
add edi,5 ;; ici, il y a plus de 5 caractères alors on rajoute les 5 qu'on avait enlevé
mul edx ;; multiplier edx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label02 ;; s'il existe un dépassement, aller Label02
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label03 ;; s'il existe un dépassement, aller Label03
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label04 ;; s'il existe un dépassement, aller Label04
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label05 ;; s'il existe un dépassement, aller Label05
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label06 ;; s'il existe un dépassement, aller Label06
dec edi ;; décrémenter l'adresse
; ici, on traite la partie inférieure du nombre
Label01: mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
jc Label07 ;; s'il existe un dépassement, aller Label07
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label08 ;; s'il existe un dépassement, aller Label08
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label09 ;; s'il existe un dépassement, aller Label09
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jc Label10 ;; s'il existe un dépassement, aller Label10
dec edi ;; décrémenter l'adresse
mul ebx ;; multiplier eax par 10
jmp Label11 ;; aller Label11
; ici, on va placer XXXXXXXXXX et le zéro final
Label02: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label03: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+1],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label04: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+2],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label05: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+3],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label06: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+4],dl ;; placer l'octet en dl à l'adresse en edi
mov eax,429497 ;; on remplace eax par 429497
mul ecx ;; multiplier ecx par 429497 (pour pouvoir extraire correctement les 5 décimales supérieures)
Label07: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+5],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label08: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+6],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label09: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+7],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label10: add dl,30h ;; ajouter le caractère de base "0" à la valeur en edx
mov BYTE PTR [edi+8],dl ;; placer l'octet en dl à l'adresse en edi
mul ebx ;; multiplier eax par 10
Label11: add dx,30h ;; ajouter le caractère de base "0" à la valeur en dx
mov WORD PTR [edi+9],dx ;; placer l'octet en dl et le 0 final en dh à l'adresse en edi

pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)

SpecialCase:
mov DWORD PTR [edi],"4924"
mov DWORD PTR [edi+4],"2769"
mov WORD PTR [edi+8],"59"
mov BYTE PTR [edi+10],0

pop edi ;; désempiler edi
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
uDw2Ascii ENDP


lingo

NightWare,

"but why makes things complicate, when there is easy ways... "

Your easy ways are just to get lingo's decision of the problem... :lol
but your algo is still slow..

Jimg,
Nice code, and fastest... :wink
P4 3.6 GHz, 2GB RAM,  Windows XP Pro with SP2 

123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92   cycles, bin2ascii-lingo
61   cycles, bi2ascii-lingo 2
83   cycles, uDw2Ascii - NightWare
47   cycles, udword-p.dixon
Press any key to exit...


Regards,
Lingo 

[attachment deleted by admin]

hutch--

I wasn't aware that nightware was contrasting stack parameters to register passing so in part the comment makes sense as long as the algo the regs are passed to can use the data in the regs that are passed without running out of regs to do the work. I have rarely ever found this to be the case and while it starts to make sense on a 64 bit machine with twice as many registers, it is problematic on a 32 bit machine with 8 gp registers where the stack pointer is rarely ever available and EBP is only available with no stack frame.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Nice thread :lol
I optimized AMD algo and two P.Dixon's algos and achieved the best time again: :wink
P4 3.6 GHz, 2GB RAM,  Windows XP Pro with SP2 

123456789
123456789
123456789
123456789
123456789
123456789
153 cycles, bin2ascii-AMD
92 cycles, bin2ascii-lingo
61 cycles, bi2ascii-lingo 2
83 cycles, uDw2Ascii - NightWare
47 cycles, udword-p.dixon
37 cycles, udwordl-lingo 3
Press any key to exit...

Regards,
Lingo



[attachment deleted by admin]

MichaelW

P3:

72 cycles, bin2ascii-AMD
65 cycles, bin2ascii-lingo
58 cycles, bi2ascii-lingo 2
56 cycles, uDw2Ascii - NightWare
50 cycles, udword-p.dixon
45 cycles, udwordl-lingo 3
eschew obfuscation

hutch--

Result on my old PIV. 2.8g Northwood, 800 meg fsb, ddr400.


123456789
123456789
123456789
123456789
123456789
123456789
145 cycles, bin2ascii-AMD
122 cycles, bin2ascii-lingo
143 cycles, bi2ascii-lingo 2
128 cycles, uDw2Ascii - NightWare
77 cycles, udword-p.dixon
74 cycles, udwordl-lingo 3
Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

six_L

Quote123456789
123456789
123456789
123456789
123456789
123456789
72 cycles, bin2ascii-AMD
61 cycles, bin2ascii-lingo
49 cycles, bi2ascii-lingo 2
48 cycles, uDw2Ascii - NightWare
42 cycles, udword-p.dixon
41 cycles, udwordl-lingo 3
Press any key to exit...
regards

Jimg

Nice job Lingo :U
123456789
123456789
123456789
123456789
123456789
123456789
63 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
55 cycles, bi2ascii-lingo 2
54 cycles, uDw2Ascii - NightWare
39 cycles, udword-p.dixon
35 cycles, udwordl-lingo 3
Press any key to exit...

I have a question however.   This is the end of your proc-
movzx edx, word ptr chartab[edx*2] ; look up 2 digits
mov [esi], dh ; but only write the 1 we need, supress the leading zero
add esi, 1 ; update pointer by 1
jnz ZS2 ; continue with pairs of digits to the end
udwordl endp   


align 16
udword proc lpbuf:DWORD, nNumber:DWORD    ;unsigned DWORD to ASCII
    push edi                   ;save registers that need to be saved
    push esi

    mov  esi, [esp+12]      ; sptr
    mov  eax, [esp+16]      ; x


Clearly esi will never be zero.  Is jnz faster than a jmp? 

Ossa

AMD 64 Mobility Athlon 3400+
512MB RAM
Win XP Home (SP2) (32-bit)

51 cycles, bin2ascii-AMD
43 cycles, bin2ascii-lingo
31 cycles, bi2ascii-lingo 2
28 cycles, uDw2Ascii - NightWare
25 cycles, udword-p.dixon
23 cycles, udwordl-lingo 3


Very nice.

Ossa
Website (very old): ossa.the-wot.co.uk

Jimg

lingo-

I like your stack handling (now why didn't I think of that?).

Applying your stack handling to the original p.dixon routine cuts off 2 cycles with no other changes.

We're getting down into the noise level here, however.
AuthenticAMD  Family 7  Model 10  Stepping 0
AMD Name String: AMD Athlon(tm) XP 3000+
Features: FPU TSC CX8 CMOV FXSR MMX SSE

timing tests:
60 cycles, bin2ascii-AMD
55 cycles, bin2ascii-lingo
38 cycles, udword-p.dixon
35 cycles, udwordl-lingo3
36 cycles, udwordx2-p.dixon3

Press any key to exit...