News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Notification on the "udw2str" algo as unsound.

Started by hutch--, November 24, 2010, 01:16:07 AM

Previous topic - Next topic

hutch--

 :bg

Use the later version posted in the Lab with this code then try string lengths from 1 to 10.



udw2str2 proc dwNumber:DWORD, pszString:DWORD

    push ebx
    push esi

    mov eax, [esp+4][8]       ; value
    mov esi, [esp+8][8]       ; szbuf

    cmp eax, 9999
    ja lb4

    cmp eax, 999999
    ja lb6

    cmp eax, 9
    ja lb1
    add esi, 1
    jmp nxt
  lb1:
    cmp eax, 99
    ja lb2
    add esi, 2
    jmp nxt
  lb2:
    cmp eax, 999
    ja lb3
    add esi, 3
    jmp nxt
  lb3:
    cmp eax, 9999
    ja lb4
    add esi, 4
    jmp nxt
  lb4:
    cmp eax, 99999
    ja lb5
    add esi, 5
    jmp nxt
  lb5:
    cmp eax, 999999
    ja lb6
    add esi, 6
    jmp nxt
  lb6:
    cmp eax, 9999999
    ja lb7
    add esi, 7
    jmp nxt
  lb7:
    cmp eax, 99999999
    ja lb8
    add esi, 8
    jmp nxt
  lb8:
    cmp eax, 999999999
    ja lb9
    add esi, 9
    jmp nxt
  lb9:
    add esi, 10

  nxt:

    mov ebx, 0CCCCCCCDh
    mov BYTE PTR [esi], 0
    sub esi, 1

  stlp:
    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    jbe lpqt

    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    ja stlp

  lpqt:
    pop esi
    pop ebx

    ret 8

udw2str2 endp
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

wouldn't it make sense to test for longer values first ?
frequency of occurance
32 bit integers converted to decimal strings

unsigned (number of characters = number of digits)

1 digit  10
2 digits 90
3 digits 900
4 digits 9000
5 digits 90000
6 digits 900000
7 digits 9000000
8 digits 90000000
9 digits 900000000
10 digits 3294967296

hutch--

 :bg

Yes but it was not any faster. I think Lingo has the right idea, middle branching on a basic tree layout made the first run of the algo faster but it does not seem to effect the sustain rate.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

it would be faster overall !!!
if you wrote a test piece to take the time for each of the 4294967296 possible values,
added those times together, then divided by the same number, you'd see an improvement
(that's why i posted the frequency of occurance chart)

yes - spliting it into groups reduces the number of CMP's/JMP's that are executed, i think
        cmp     eax,9999999
        ja      more_than_7_digits

        cmp     eax,9999
        ja      more_than_4_digits

;-------------------------------

        cmp     eax,999
        ja      have_4_digits

        cmp     eax,99
        ja      have_3_digits

        cmp     eax,9
        ja      have_2_digits

have_1_digit:
        add     esi,1
        jmp     nxt

have_2_digits:
        add     esi,2
        jmp     nxt

have_3_digits:
        add     esi,3
        jmp     nxt

have_4_digits:
        add     esi,4
        jmp     nxt

;-------------------------------

more_than_4_digits:
        cmp     eax,999999
        ja      have_7_digits

        cmp     eax,99999
        ja      have_6_digits

have_5_digit:
        add     esi,5
        jmp     nxt

have_6_digits:
        add     esi,6
        jmp     nxt

have_7_digits:
        add     esi,7
        jmp     nxt

;-------------------------------

more_than_7_digits:
        cmp     eax,999999999
        ja      have_10_digits

        cmp     eax,99999999
        ja      have_9_digits

have_8_digits:
        add     esi,8
        jmp     nxt

have_9_digits:
        add     esi,9
        jmp     nxt

have_10_digits:
        add     esi,10

;-------------------------------

nxt:

dedndave

in this case, i had 9 digits max
i needed to scale the value - not just adjust a pointer
it might be better if i used 32-bit instructions for EBX and ECX, though
i know - i know - i can unroll it, too   :bg
;determine the number of digits in the first dword

;ECX = 0 or 1
;ESI = active index pointer (offset of highest order accumulator dword)
;EBP = stack frame base pointer

Ascii0: mov     cl,9                    ;ECX = 9
        mov     eax,[esi]               ;load the first dword
        mov     bl,3                    ;BL = 3
        push    eax                     ;save the first dword

;scale the first dword by 1,000 to count digits
;we may make 0, 1, or 2 passes on this loop

Ascii1: cmp     eax,1000000             ;is it  >= 1,000,000 ?
        jae     Ascii2                  ;yes - determine mod 3 count

        sub     cl,bl                   ;no - subtract 3 from count
        imul    eax,eax,1000            ;scale it by 1,000
        cmp     cl,bl                   ;last 3 digits ?
        ja      Ascii1                  ;no - until it is

hutch--

Dave,

here is a version that does the calculations top down with branch predictions backwards and its still no faster.


;  ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

ALIGN 16
; u32toa by brethren
utoa2 PROC, value:DWORD, szbuf:DWORD

    push ebx
    push esi

    mov eax, [esp+4][8]       ; value
    mov esi, [esp+8][8]       ; szbuf

    cmp eax, 1000000000
    jb lb9
    add esi, 10

  nxt:
    mov ebx, 0CCCCCCCDh
    mov BYTE PTR [esi], 0
    sub esi, 1

  align 4
  stlp:
    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    jbe lpqt

    mov ecx, eax
    mul ebx
    shr edx, 3
    mov eax, edx
    lea edx, [edx+edx*4]
    lea edx, [edx+edx-48]
    sub ecx, edx
    mov [esi], cl
    sub esi, 1
    test eax, eax
    ja stlp

  lpqt:
    pop esi
    pop ebx

    ret 8

  align 4
  lb1:
    add esi, 1
    jmp nxt
  lb2:
    cmp eax, 10
    jb lb1
    add esi, 2
    jmp nxt
  lb3:
    cmp eax, 100
    jb lb2
    add esi, 3
    jmp nxt
  lb4:
    cmp eax, 1000
    jb lb3
    add esi, 4
    jmp nxt
  lb5:
    cmp eax, 10000
    jb lb4
    add esi, 5
    jmp nxt
  lb6:
    cmp eax, 100000
    jb lb5
    add esi, 6
    jmp nxt
  lb7:
    cmp eax, 1000000
    jb lb6
    add esi, 7
    jmp nxt
  lb8:
    cmp eax, 10000000
    jb lb7
    add esi, 8
    jmp nxt
  lb9:
    cmp eax, 100000000
    jb lb8
    add esi, 9
    jmp nxt

utoa2 ENDP

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

;  ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

well - it's no faster on your i7   :P
and - testing it with the same value over and over isn't going to reveal anything

i think the tree here is a good plan
http://www.masm32.com/board/index.php?topic=15427.msg126379#msg126379

i have to get my head into your test code - lol

lingo

"Use the later version posted in the Lab..."

I did it and my algo is still faster... :wink
results are:
69 cycles for udw2str2-Hutch
65 cycles for u32toalin-Lingo
Press any key to continue ...

brethren

#23
lingos test4 on AMD Turion(tm) 64 X2 Mobile Technology TL-52
71 cycles for udw2str2-Hutch
77 cycles for u32toalin-Lingo
Press any key to continue ...

test2
80 cycles for udw2str2-Hutch
68 cycles for u32toalin-Lingo
Press any key to continue ...

suprisingly the only difference in the code in u32toalin between lingos test2 and test4(where's test3 :P) is:
test4

lea edx, [edx+edx*4-24]
lea edx, [edx+edx]

test2

lea edx, [edx+edx*4]
lea edx, [edx+edx-30h]



jj2007

Quote from: brethren on November 25, 2010, 05:04:41 PM
suprisingly the only difference in the code between lingos test2 and test4(where's test3 :P) is:

Same size, same cycle count on the Celeron M. But Hutch's algo runs at 94 cycles instead of 104.

hutch--

 :bg

Dave,

I mainly use the Core2 Quad, the i7 Win64 is for watching TV while I am writing code on the other box.  :P

It runs fine and is slightly faster here and there than the Core2 quad but I confess I don't like Win7 much.

The last algo I posted that does the lead calculation top down should have the lower overhead for 10 digit numbers as it just falls through into the conversion. I have my doubts with the benchmarks posted as they do not sweep the range down to small digit counts, that is why I am doing the timings with the dialog benchmark in the lab.

I played with middle range branching to the comparison list but it only seems to favour samples that are near that length and each leading branch slows the rest down. I could not get the algo faster when sweeping different length ranges by doing it so I gave up on the idea. I still think your logic of doing the calculation top down makes the most sense as the number counts are much higher for the large values.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

even the Core2 Quad is fast on so many things
you need to run it on your ShitBox   :bg  (closely emulates what the rest of us can afford - lol)

i still think the tree in reply #18 is the way to fly, Hutch