News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Reverse string

Started by lostcauz, October 01, 2005, 05:19:45 PM

Previous topic - Next topic

Jimg


hutch--

Guys,

Can we keep this something like friendly as these extended discussions about algo design are often useful and produce some very good code.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo


Jimg

Sorry Hutch, I wasn't trying to be contentious in the slightest.  I guess my people skills are right up there with the average nerd. :'(

Ratch

Jimg,
     You have nothing to apologize about.  You tried to do the right thing with your timing program.  Just remember, no good deed goes unpunished.  Ratch

NightWare

return of the son of the algos...

a small speedup for AnotherStringReverse, very small (i expected more... because it avoid stalls in many case... but...)


ALIGN 16
AnotherStringReverse PROC  StrAddr:DWORD,StrSize:DWORD
push ebx
push esi
push edi

mov esi,StrAddr
mov eax,StrSize
lea edi,[esi+eax] ; if you use SIZEOF => lea edi,[esi+eax-1]
mov ebx,eax
and ebx,11111111111111111111111111111000b
jz Label2
and eax,00000000000000000000000000000111b

Label1: add edi,-4
mov ecx,DWORD PTR [esi]
mov edx,DWORD PTR [edi] ; big string algo
bswap ecx
bswap edx
mov DWORD PTR [edi],ecx
mov DWORD PTR [esi],edx
add esi,4
add ebx,-8
jnz Label1

Label2: shr eax,1 ; to eliminate impair values
jz Label4
cmp eax,1
jne Label3

mov cl,BYTE PTR [esi] ; very small string algo
mov ch,BYTE PTR [edi-1]
mov BYTE PTR [esi],ch
mov BYTE PTR [edi-1],cl
jmp Label4

Label3: mov ecx,DWORD PTR [esi] ; small string algo
mov edx,DWORD PTR [edi-4]
bswap ecx
bswap edx
mov DWORD PTR [edi-4],ecx
mov DWORD PTR [esi],edx
Label4:
pop edi
pop esi
pop ebx
ret
AnotherStringReverse ENDP


here the fixed FastReverseString, and it's a bit faster... this is the one i've chosen to use... a bit slower on small string, but faster on bigger :


ALIGN 16
FastReverseString PROC StrAddr:DWORD,StrSize:DWORD

push ebx
push esi
push edi

mov esi,StrAddr
mov eax,StrSize

lea edi,[esi+eax-4] ; ) put the middle of the string in edi
shr eax,1 ; )
sub edi,eax ; )
mov ebx,eax
and eax,00000000000000000000000000000011b ; ) add the shift
add edi,eax ; )
and ebx,11111111111111111111111111111100b
jz Label2

Label1: mov ecx,DWORD PTR [esi] ; big string algo
mov edx,DWORD PTR [edi+ebx]
bswap ecx
bswap edx
mov DWORD PTR [edi+ebx],ecx
mov DWORD PTR [esi],edx
add esi,4
add ebx,-4
jnz Label1

Label2: test eax,eax ; nothing to do ?
jz Label4
cmp eax,1 ; wich algo ?
jne Label3

mov cl,BYTE PTR [edi+3] ; very small string algo
mov ch,BYTE PTR [esi]
mov BYTE PTR [edi+3],ch
mov BYTE PTR [esi],cl
jmp Label4

Label3: mov ecx,DWORD PTR [esi] ; small string algo
mov edx,DWORD PTR [edi+ebx]
bswap ecx
bswap edx
mov DWORD PTR [edi+ebx],ecx
mov DWORD PTR [esi],edx
Label4:
pop edi
pop esi
pop ebx
ret
FastReverseString ENDP


now, i will not spend time anymore for any reverse procedure... it makes me sick now...


note : strange, it seams (on my celeron) it's a bit faster to use :
   mov ecx,DWORD PTR [esi]
   mov edx,DWORD PTR [edi]
   bswap ecx
   bswap edx
   mov DWORD PTR [edi],ecx
   mov DWORD PTR [esi],edx
instead of :
   mov edx,DWORD PTR [edi]
   mov ecx,DWORD PTR [esi]
   bswap ecx
   bswap edx
   mov DWORD PTR [esi],edx
   mov DWORD PTR [edi],ecx
here this is not a question of alignment, so it seams it's better to not use the same registers of the previous line (when it's possible)

Ratch

NightWare,

Quote
a small speedup for AnotherStringReverse, very small (i expected more... because it avoid stalls in many case... but...)

     Both your latest submissions of AnotherStringReverse and FastReverseString appear to work correctly.  If you really want to speed things up, do as Lingo does.  Make them write in second buffer distant from the source buffer, as I pointed out in a previous post to Jimg.

Quote
here this is not a question of alignment, so it seams it's better to not use the same registers of the previous line (when it's possible)

     Of course, haven't I been talking about write/read stalls?  Ratch

OceanJeff32

Reverse String? Can't you just PUSH everything on the stack, and then POP it off, and it's reversed?

later guys,

jeff c
:U
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

Ratch

OceanJeff32,

Quote
Reverse String? Can't you just PUSH everything on the stack, and then POP it off, and it's reversed?

     Yes, but you first have to write the DWORD into a register for a BSWAP.  And PUSH/POP are not as fast as MOV.  And you need two loops, one for PUSHes and one for POPs.  And if your string is really long, you need to augment the stack. And you need to synchronize the string end if the string length is not divisible by 4.  You would ibe transferring the contents of one memory location (source string) into a register, then transferring the register to stack memory, then  transferring the contents of the stack to still another memory location (destination string).  Not very efficient.  Ratch

lingo

OceanJeff32, :lol

small and slow...

OPTION PROLOGUE:NONE                   ; turn it off
OPTION EPILOGUE:NONE                   ;
StrRev proc lpString:DWORD             ;
                                       ;
       pop  edx                        ; edx->return address
       pop  eax                        ; eax->lpString
       push edx                        ; edx->return address
       push esi                        ; save esi
       cld                             ; clears the Direction Flag
       xor  esi, esi                   ; esi = 0
       push edi                        ; save edi
       mov  edi, eax                   ; edi->lpString
       xchg eax, esi                   ; esi->lpString; eax = 0
@@:                                    ; saving the string in the stack
       push eax                        ;
       lodsb                           ; mov al, [esi] -> inc esi
       test eax, eax                   ; is it end of the string?
       jne  @b                         ;
@@:                                    ; restoring the string  from the stack
       pop  eax                        ;
       stosb                           ; mov [edi], al -> inc edi
       dec  eax                        ; is it end of the string from the stack?
       jns  @b                         ;
       pop  edi                        ; restore esi and edi
       pop  esi                        ;
       ret                             ; 25 bytes
StrRev endp                            ;
OPTION PROLOGUE:PROLOGUEDEF            ; turn back on the defaults
OPTION EPILOGUE:EPILOGUEDEF            ;

Regards,
Lingo

NightWare

ratch,

Quote from: Ratch on October 12, 2005, 01:47:28 AM
Of course, haven't I been talking about write/read stalls?  Ratch

i wasn't aware about that... for me, stalls only happened when you change the size of registers... as i said before, i'm just playing with coding technics... i'm not a good coder... (but i think i'm able to beat the previous algo posted by lingo,  :toothy  :toothy  :toothy )

jimg,

:dazzled: only 10 loops ?... you really think the result reflect the real speed ? coz the results i've obtained are totally different of your post... and also quite strange, here an example...


Str1 Size        1     9    18    27    28    29    36   100   400  1000
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== =====

NightWare       13    20    57    62    51    56    57   122   418  1018


same speed for 18 bytes and 36 bytes, but :

18 = 8*2 + 2 =>  (2 main loop) + (1 byte exchange)
36 = 8*4 + 4 =>  (4 main loop) + (2 byte exchange)

and like ratch, i'm unable to assemble it, so i can't see the result with more loops...

Jimg

Hi NightWare-

The problem with assembling it is probably the path to your library and or include files.  I use WinAsm Studio and have my path as part of the WinAsm settings.  You will probably have to add your path specs up in the includes.  If this doesn't fix it, let me know what the error comment you are getting is.

Yes, I've tested it extensively, and I think the numbers are much more representative and repeatable (and much faster) than running it a million times.  You can change the code to average and loop more times if you want.  The important part of the process is calling from and running from the exact same spot in memory, and using the exact same input and output memory locations.  I'll have to step through your code to see what the problem is, but the 27, and 29 were specifically chosen to be three and one extra bytes from an even dword and so it takes longer to process.  Since 36 is an even dword number of bytes, it make sense that it would be fast compared to 18 which has an extra two bytes to take care of.  Likewise for 28 vs. 27 or 29.  Again, I'll have to step through your code to see what exactly is going on.  The beauty of Lingo is that it uses the same code for an odd number of bytes, it just processes the extras twice, using the main code, which is much faster than loading the code to do the exceptions.

Of course, your numbers will probably be quite different than mine because it is cpu dependent, and there is a really big difference between AMD and Intel chips.


Jimg

I can't seem to work through your code. As I read it, I keep thinking of other things to try which destroys my train of thought as far as stepping through yours.  But there is definately a big penalty somewhere when the input is not an even multiple of four.  Yours is faster on large strings of dwords, but look at this penalty for being off a byte-
Str1 Size        1     9    18    27    28    29    36   100   401  1000  1001
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====

0 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
NightWare       11    19    43    57    37    54    42    93   562   647  1667
NightWare2      10    18    36    48    60    67    64    96   666   522  1659
Algo3            9    19    23    27    26    28    31    82   273   642   648

1 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
NightWare       11    19    47    53    66    66    81   192   687  1735  1735
NightWare2      10    18    44    46    76    76    87   202   695  1708  1708
Algo3            9    20    23    28    28    28    32    83   272   643   647

2 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
NightWare       11    19    39    55    66    66    81   192   687  1735  1735
NightWare2      10    18    36    48    76    76    87   202   695  1708  1708
Algo3            9    20    22    28    28    28    32    83   272   643   647

3 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
NightWare       11    16    45    55    67    52    81   192   667  1735  1697
NightWare2      10    14    37    48    76    69    87   202   669  1708  1765
Algo3            9    18    24    28    28    28    32    83   273   643   647

Press enter to exit...

(Algo3 is a mangle of Lingo's code I've been looking at, NIghtWare2 is your latest code posting)

Jimg

Well, so far I had just done timing tests and not really played with the algorithms other than mangling Lingo's code.  But I decided to start with a very simple loop as a basis to see what the effect of various schemes would be.  I used the first and last part of Lingo's code since I was really only interested in the speeding up the main loop and I couldn't see anything obvious or worth while to do to the first or last part.  So here's the code I was going to use as the base timing to see how the other schemes would speed it up-

Algo4 proc      ; first and last parts are modified Lingo, thank you Lingo, main part is all jimg.
    push esi
    push edi
    mov esi,Str1Offset  ; pickup left side
    mov edi,Str2Offset  ; store left side
    mov ecx,Str1Size    ; will be offset from right
    xor eax,eax         ; offset from left
    cmp ecx,4
    mov byte ptr [edi+ecx],al ; zero byte terminator
    jl  Bytes3
    push ebx
    sub eax,4


@@:                     ;  Main loop
    add eax,4
    add ecx,-4
    mov edx,[esi+eax]
    mov ebx,[esi+ecx]
    bswap edx
    bswap ebx
    mov [edi+eax],ebx
    mov [edi+ecx],edx
    cmp eax,ecx
    jl  @b
   
   
    pop ebx
    pop edi
    pop esi
    ret
align   16
Bytes3:                      ; possible input    a     ab     abc     
                             ;            ecx=   1      2      3
                             ; currrent output  .0..  ..0.   ...0
    movzx edx, byte  ptr [esi]  ;[str1]   edx=   a      a      a
    cmp ecx, 1
    mov byte ptr [edi+ecx-1],dl ;[str2+ecx-1]   a0..  .a0.   ..a0 
    jle End3                    ;               done       
    mov dl, byte  ptr [esi+1]   ;[str1+1] edx=          b      b       
    cmp ecx, 2
    mov byte ptr [edi+ecx-2],dl ;[str2+ecx-2]         ba0.   .ba0 
    jle End3                    ;                     done
    mov dl, byte  ptr [esi+2]   ;[str1+2] edx=                 c
    mov byte ptr [edi+ecx-3],dl ;[str2+ecx-3]                cba0
End3:
    pop edi
    pop esi
    ret
Algo4 endp       


To my amazement, here's the timing results-
Str1 Size        1     9    18    27    28    29    36   100   401  1000  1001
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====

0 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
Ratch           14    18    39    55    32    66    37    90   597   649  1424
NightWare2      10    18    36    48    60    67    64    96   666   522  1659
Algo3            9    19    23    27    26    28    31    82   273   642   648
Algo4            9    18    22    24    23    29    26    67   223   477   530

1 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
Ratch           14    18    46    56    69    69    85   208   736  1806  1806
NightWare2      10    18    49    51    76    76    87   202   695  1708  1708
Algo3            9    20    23    28    28    28    32    83   272   643   647
Algo4            9    19    25    24    25    29    29    73   243   566   566

2 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
Ratch           14    18    40    58    69    69    85   208   736  1806  1806
NightWare2      10    18    36    56    76    76    87   202   695  1708  1708
Algo3            9    20    22    28    28    28    32    83   272   643   647
Algo4            9    19    19    25    25    29    29    73   243   566   566

3 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0     0
Ratch           14    18    47    57    69    70    78   208   695  1806  1797
NightWare2      10    14    37    56    76    69    87   202   669  1708  1765
Algo3            9    18    24    28    28    28    32    83   273   643   647
Algo4            9    16    21    25    25    28    29    73   222   566   552

Press enter to exit...


I guess "KISS" is the answer :P

Ratch

Jimg,

Quote
I guess "KISS" is the answer

     I don't think so.  Remember what I described before?  To be fair, all the programs to be compared must write to the same buffer so as to measure the cache clashes.  If you write to two separated buffers, you give an advantage to that program.  That is the one thing that can account for the 3 to 1 disparity of times.  Ratch