Notification on the "udw2str" algo as unsound.

hutch-- · November 25, 2010, 09:27:03 AM

:bg

Use the later version posted in the Lab with this code then try string lengths from 1 to 10.



udw2str2 proc dwNumber:DWORD, pszString:DWORD

    push ebx
    push esi

    mov eax, [esp+4][8]       ; value
    mov esi, [esp+8][8]       ; szbuf

    cmp eax, 9999
    ja lb4

    cmp eax, 999999
    ja lb6

    cmp eax, 9
    ja lb1
    add esi, 1
    jmp nxt
  lb1:
    cmp eax, 99
    ja lb2
    add esi, 2
    jmp nxt
  lb2:
    cmp eax, 999
    ja lb3
    add esi, 3
    jmp nxt
  lb3:
    cmp eax, 9999
    ja lb4
    add esi, 4
    jmp nxt
  lb4:
    cmp eax, 99999
    ja lb5
    add esi, 5
    jmp nxt
  lb5:
    cmp eax, 999999
    ja lb6
    add esi, 6
    jmp nxt
  lb6:
    cmp eax, 9999999
    ja lb7
    add esi, 7
    jmp nxt
  lb7:
    cmp eax, 99999999
    ja lb8
    add esi, 8
    jmp nxt
  lb8:
    cmp eax, 999999999
    ja lb9
    add esi, 9
    jmp nxt
  lb9:
    add esi, 10

  nxt:

    mov ebx, 0CCCCCCCDh
    mov BYTE PTR [esi], 0
    sub esi, 1

  stlp:
    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    jbe lpqt

    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    ja stlp

  lpqt:
    pop esi
    pop ebx

    ret 8

udw2str2 endp

dedndave · November 25, 2010, 10:48:36 AM

wouldn't it make sense to test for longer values first ?

Code Select

frequency of occurance
32 bit integers converted to decimal strings

unsigned (number of characters = number of digits)

 1 digit  10
 2 digits 90
 3 digits 900
 4 digits 9000
 5 digits 90000
 6 digits 900000
 7 digits 9000000
 8 digits 90000000
 9 digits 900000000
10 digits 3294967296

hutch-- · November 25, 2010, 11:08:28 AM

:bg

Yes but it was not any faster. I think Lingo has the right idea, middle branching on a basic tree layout made the first run of the algo faster but it does not seem to effect the sustain rate.

dedndave · November 25, 2010, 11:15:40 AM

it would be faster overall !!!
if you wrote a test piece to take the time for each of the 4294967296 possible values,
added those times together, then divided by the same number, you'd see an improvement
(that's why i posted the frequency of occurance chart)

yes - spliting it into groups reduces the number of CMP's/JMP's that are executed, i think

Code Select

        cmp     eax,9999999
        ja      more_than_7_digits

        cmp     eax,9999
        ja      more_than_4_digits

;-------------------------------

        cmp     eax,999
        ja      have_4_digits

        cmp     eax,99
        ja      have_3_digits

        cmp     eax,9
        ja      have_2_digits

have_1_digit:
        add     esi,1
        jmp     nxt

have_2_digits:
        add     esi,2
        jmp     nxt

have_3_digits:
        add     esi,3
        jmp     nxt

have_4_digits:
        add     esi,4
        jmp     nxt

;-------------------------------

more_than_4_digits:
        cmp     eax,999999
        ja      have_7_digits

        cmp     eax,99999
        ja      have_6_digits

have_5_digit:
        add     esi,5
        jmp     nxt

have_6_digits:
        add     esi,6
        jmp     nxt

have_7_digits:
        add     esi,7
        jmp     nxt

;-------------------------------

more_than_7_digits:
        cmp     eax,999999999
        ja      have_10_digits

        cmp     eax,99999999
        ja      have_9_digits

have_8_digits:
        add     esi,8
        jmp     nxt

have_9_digits:
        add     esi,9
        jmp     nxt

have_10_digits:
        add     esi,10

;-------------------------------

nxt:

dedndave · November 25, 2010, 11:23:56 AM

in this case, i had 9 digits max
i needed to scale the value - not just adjust a pointer
it might be better if i used 32-bit instructions for EBX and ECX, though
i know - i know - i can unroll it, too :bg

Code Select

;determine the number of digits in the first dword

;ECX = 0 or 1
;ESI = active index pointer (offset of highest order accumulator dword)
;EBP = stack frame base pointer

Ascii0: mov     cl,9                    ;ECX = 9
        mov     eax,[esi]               ;load the first dword
        mov     bl,3                    ;BL = 3
        push    eax                     ;save the first dword

;scale the first dword by 1,000 to count digits
;we may make 0, 1, or 2 passes on this loop

Ascii1: cmp     eax,1000000             ;is it  >= 1,000,000 ?
        jae     Ascii2                  ;yes - determine mod 3 count

        sub     cl,bl                   ;no - subtract 3 from count
        imul    eax,eax,1000            ;scale it by 1,000
        cmp     cl,bl                   ;last 3 digits ?
        ja      Ascii1                  ;no - until it is

hutch-- · November 25, 2010, 01:18:39 PM

Dave,

here is a version that does the calculations top down with branch predictions backwards and its still no faster.

Code Select


;  ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

ALIGN 16
; u32toa by brethren
utoa2 PROC, value:DWORD, szbuf:DWORD

    push ebx
    push esi

    mov eax, [esp+4][8]       ; value
    mov esi, [esp+8][8]       ; szbuf

    cmp eax, 1000000000
    jb lb9
    add esi, 10

  nxt:
    mov ebx, 0CCCCCCCDh
    mov BYTE PTR [esi], 0
    sub esi, 1

  align 4
  stlp:
    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    jbe lpqt

    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    ja stlp

  lpqt:
    pop esi
    pop ebx

    ret 8

  align 4
  lb1:
    add esi, 1
    jmp nxt
  lb2:
    cmp eax, 10
    jb lb1
    add esi, 2
    jmp nxt
  lb3:
    cmp eax, 100
    jb lb2
    add esi, 3
    jmp nxt
  lb4:
    cmp eax, 1000
    jb lb3
    add esi, 4
    jmp nxt
  lb5:
    cmp eax, 10000
    jb lb4
    add esi, 5
    jmp nxt
  lb6:
    cmp eax, 100000
    jb lb5
    add esi, 6
    jmp nxt
  lb7:
    cmp eax, 1000000
    jb lb6
    add esi, 7
    jmp nxt
  lb8:
    cmp eax, 10000000
    jb lb7
    add esi, 8
    jmp nxt
  lb9:
    cmp eax, 100000000
    jb lb8
    add esi, 9
    jmp nxt

utoa2 ENDP

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;  ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

dedndave · November 25, 2010, 01:54:59 PM

well - it's no faster on your i7 :P
and - testing it with the same value over and over isn't going to reveal anything

i think the tree here is a good plan
http://www.masm32.com/board/index.php?topic=15427.msg126379#msg126379

i have to get my head into your test code - lol

lingo · November 25, 2010, 02:52:44 PM

"Use the later version posted in the Lab..."

I did it and my algo is still faster... :wink
results are:
69 cycles for udw2str2-Hutch
65 cycles for u32toalin-Lingo
Press any key to continue ...

brethren · November 25, 2010, 05:04:41 PM

lingos test4 on AMD Turion(tm) 64 X2 Mobile Technology TL-52

Code Select

71 cycles for udw2str2-Hutch
77 cycles for u32toalin-Lingo
Press any key to continue ...

test2

Code Select

80 cycles for udw2str2-Hutch
68 cycles for u32toalin-Lingo
Press any key to continue ...

suprisingly the only difference in the code in u32toalin between lingos test2 and test4(where's test3 :P) is:
test4

Code Select


lea		edx, [edx+edx*4-24]
lea		edx, [edx+edx]

test2

Code Select


lea		edx, [edx+edx*4]
lea		edx, [edx+edx-30h]

jj2007 · November 25, 2010, 06:21:10 PM

Quote from: brethren on November 25, 2010, 05:04:41 PM
suprisingly the only difference in the code between lingos test2 and test4(where's test3 :P) is:

Same size, same cycle count on the Celeron M. But Hutch's algo runs at 94 cycles instead of 104.

hutch-- · November 25, 2010, 10:38:22 PM

:bg

Dave,

I mainly use the Core2 Quad, the i7 Win64 is for watching TV while I am writing code on the other box. :P

It runs fine and is slightly faster here and there than the Core2 quad but I confess I don't like Win7 much.

The last algo I posted that does the lead calculation top down should have the lower overhead for 10 digit numbers as it just falls through into the conversion. I have my doubts with the benchmarks posted as they do not sweep the range down to small digit counts, that is why I am doing the timings with the dialog benchmark in the lab.

I played with middle range branching to the comparison list but it only seems to favour samples that are near that length and each leading branch slows the rest down. I could not get the algo faster when sweeping different length ranges by doing it so I gave up on the idea. I still think your logic of doing the calculation top down makes the most sense as the number counts are much higher for the large values.

dedndave · November 26, 2010, 12:09:35 AM

even the Core2 Quad is fast on so many things
you need to run it on your ShitBox :bg (closely emulates what the rest of us can afford - lol)

i still think the tree in reply #18 is the way to fly, Hutch

News:

Notification on the "udw2str" algo as unsound.