News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MASM for FUN - #0.2 [string reversing]

Started by frktons, May 20, 2010, 08:24:50 PM

Previous topic - Next topic

dedndave

i was looking at Hutch's proc
i noticed he had the capability to specify a seperate output buffer
but, combining the 2 capabilities isn't as simple as one might think   :P
probably best to just do in-place reversal
if they want a copy - let them make it before the call (or call it again - lol)

hutch--

This is what it looks like without the extra code for the alternate buffer. It is faster but it needs to have to Strlen algo unrolled directly into it to try aned pick up a bit more pace. Either aligning the label or unrollinng it makes it slower. I fixed the 1st character being chopped off, (removed 1 SUB) but it still does not test for a character length of 1 or null.


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

szRev1 proc src:DWORD

    push esi
    push edi

    mov esi, [esp+4][8]                 ; src
    mov edi, esi

    invoke StrLen,esi

    sub eax, 1
    add edi, eax
    shr eax, 1                          ; int divide length by 2
    neg eax                             ; invert sign
    sub esi, eax                        ; sub half len from ESI

  @@:
    movzx ecx, BYTE PTR [esi+eax]       ; load end pairs
    movzx edx, BYTE PTR [edi]
    mov [esi+eax], dl                   ; swap end pairs
    mov [edi], cl
    sub edi, 1
    add eax, 1
    jnz @B                              ; exit on half length

    pop edi
    pop esi

    ret 4

szRev1 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i don't understand why it is better to use register-index addressing rather than using eax for the count
@@:
    movzx ecx, BYTE PTR [esi]           ; load end pairs
    movzx edx, BYTE PTR [edi]
    mov [edi], cl                       ; swap end pairs
    mov [esi], dl
    sub edi, 1
    add esi, 1
    sub eax, 1
    jnz @B                              ; exit on half length

sinsi


sinsi:
    push [esp+4]
    call StrLen
    mov ecx,[esp+4]
    lea edx,[ecx+eax]
@@: cmp edx,ecx
    jbe @f
    mov al,[ecx]
    mov ah,[edx-1]
    mov [ecx],ah
    mov [edx-1],al
    add ecx,1
    sub edx,1
    jmp @b
@@: ret 4

Any length, even 0, but chokes on a 1000 character string.
Same time as hutch's latest algo on my q6600.
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

QuoteAny length, even 0, but chokes on a 1000 character string.

what do you mean by "chokes" ?????
try not to be so technical  :P  i don't get it

sinsi

chokes -> is a bit slower than the others
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

ahhhh - ok
that is strange behaviour - but at least it keeps ticking
i was just looking at the StrLen procedure
trying to make sense of it
i don't know where Hutch came up with some of this stuff  :bg

i think, in the real world, this is more likely to be used for shorter strings

hutch--

Dave,

the calculations and negating EAX before the loop was designed to remove the extra ADD that you had to put in the loop. Somewhere around the PIII era older hardware did not like more complex addressing but the PIV generation and later don't seem to have a problem with it.

The real problem with an opposite end swapper is cache distance, over a certain source length you break the cache which would impinge on its speed very badly. If you had to reverse strings where the ends were larger than the cache, you would do it as buffer copy so you at least got linear read then linear write.

You cam blame Agner Fog for the StrLen algo, its a classic that still has the legs over most others and its 15 years old.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

well - that's good to know
i have tried to avoid such complex addressing, unless the situation requires it - like multidimensional arrays or something

as for StrLen - i am not saying anything bad about it - it is obviously fast
i am just trying to make it work in my head - lol

jj2007

Quote from: dedndave on May 23, 2010, 05:17:47 AM
as for StrLen - i am not saying anything bad about it - it is obviously fast

It is pretty fast for a non-SSE2 algo, although I vaguely remember we had a faster one already.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Testing string len for a 100 character string, aligned
132     cycles for Masm32 len
27      cycles for MasmBasic Len
102     cycles for StrLen

Testing string len for a 100 character string, unaligned
133     cycles for Masm32 len
27      cycles for MasmBasic Len
97      cycles for StrLen

hutch--

Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 23, 2010, 07:59:45 AM
JJ,

Can't buld it, missing file.

Hutch,

Either disable the MasmBasic stuff (line 2, 32, 60) or install the library from http://www.masm32.com/board/index.php?topic=12460

Anyway, StrLen is pretty fast, even faster than crt_strlen. Is there a good reason why you still use szLen for the len() macro? There is one risk in exchanging them: len() aka szLen does not alter ecx and edx, StrLen does. That might break some code, but 2 push/pop pairs cost about 2 cycles only.

hutch--

JJ,

With szLen() its a point of diminishing returns for where it is normally used as a macro. StrLen is the aligned "go fast" version, szLen is indifferent to alignment and easily fast enough in common usages. The type of usage is so common if I can't find the lib in a hurry I write a simple loop to do the job. If you have a task that is (1) aligned correctly and (2) big enough a genuinely fast sse version makes sense but the application for it would be very rare.

I will test the last algo with the lines removed, I cannot interfere with the setup adding extra stuff into it.

PS: It still won't build with the lines commented out. Too much of the example depends on code not present.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on May 23, 2010, 10:28:48 AM
If you have a task that is (1) aligned correctly and (2) big enough a genuinely fast sse version makes sense
Hutch,
SSE2 also works for small unaligned strings, see below. Typical string len for an asm source is around 60 bytes. The only valid argument not to use it is pre-SS2 hardware.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Testing string len for a 100 character string, aligned
138     cycles for Masm32 len
119     cycles for crt_strlen
32      cycles for MasmBasic Len
108     cycles for StrLen

Testing string len for a 100 character string, unaligned
136     cycles for Masm32 len
120     cycles for crt_strlen
27      cycles for MasmBasic Len
94      cycles for StrLen

Testing string len for a 30 character string, aligned
52      cycles for Masm32 len
41      cycles for crt_strlen
16      cycles for MasmBasic Len
35      cycles for StrLen

Testing string len for a 30 character string, unaligned
45      cycles for Masm32 len
41      cycles for crt_strlen
14      cycles for MasmBasic Len
31      cycles for StrLen



> PS: It still won't build with the lines commented out. Too much of the example depends on code not present.

See line 2: useMB = 0

frktons

I used the pgm JJ posted and in my core duo I get:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
Testing string len for a 100 character string, aligned
104     cycles for Masm32 len
80      cycles for crt_strlen
16      cycles for MasmBasic Len
82      cycles for StrLen

Testing string len for a 100 character string, unaligned
105     cycles for Masm32 len
80      cycles for crt_strlen
17      cycles for MasmBasic Len
67      cycles for StrLen

Testing string len for a 30 character string, aligned
34      cycles for Masm32 len
26      cycles for crt_strlen
9       cycles for MasmBasic Len
23      cycles for StrLen

Testing string len for a 30 character string, unaligned
34      cycles for Masm32 len
25      cycles for crt_strlen
10      cycles for MasmBasic Len
26      cycles for StrLen



I am just back from the week-end, no time to see anything so far,
but a lot of experimentation has been done, I guess. Very well done friends!!!  :P

Let's see if tomorrow I have time to print and see what has happened.
Almost sure I'll understand a small portion of the whole.  :lol

Thanks everybody for entering the game.  :U

Frank
Mind is like a parachute. You know what to do in order to use it :-)