News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Reverse string

Started by lostcauz, October 01, 2005, 05:19:45 PM

Previous topic - Next topic

Ratch

lingo,
     I find your speed tests interesting, but suspicious.  Does it make sense that your program using the same loop algo, with 2 read/write stalls, and 2 instructions with double register indexing, would be almost twice as fast as NightWare's routine and mine?  Does that pass the smell test?  Not to belittle your efforts, but your routine does not reverse the same buffer either.  Ratch

lingo

If you don't understand  the usage of the attached testing program
will be better to ask the author: MichaelW
I like his program and just use it...
Unfortunately I have no enough free time to open your eyes..

Ratch

lingo,

Quote
If you don't understand  the usage of the attached testing program
will be better to ask the author: MichaelW

     I understand how to use the program.  That's not the issue.  I will even concede that the test is implemented correctly.  I just do not understand the results.  Our code is similar enough so as to report approximately equal times.

Quote
I like his program and just use it...

     That's irrelevant with respect to the significantly different results.

Quote
Unfortunately I have no enough free time to open your eyes..

     At  least concede that my point is valid, even if you do not know the reason for the descrepancy.  Ratch

NightWare

here the result on my old celeron 700 mghz :

Please terminate any high-priority tasks and press ENTER to begin.


Input string ->abcdefghijklmnopqrstuvwxyzabc

RevStr - lostcauz  : 55 clocks. Result: cbazyxwvutsrqponmlkjihgfedcba
RevStr - lingo     : 28 clocks. Result: cbazyxwvutsrqponmlkjihgfedcba
RevStr - Ratch     : 59 clocks. Result: cbazyxwvutsrqponmlkjihgfedcba
RevStr - NightWare : 46 clocks. Result: cbazyxwvutsrqponmlkjihgfedcba

Press ENTER to exit...


:eek cool, i pass from the last position to the second...  :P seriouly, i don't know you guys, but i already had some strange result with the speed test macros of MichealW, in my case, i'm not an ultrafast code generator, i'm just playing with coding technics... and like you i use these macros, coz there's nothing better (without downgrading the os to win98, and i don't want to do that)... but it's true there is strange results sometime... an example : if you change the order of the routines, the results change, sometimes with big differencies... and i don't know why (the only difference is the length of the jumps)
these problems appear both on my Celeron and my P4 (it worst on P4). i'd like to know the reason... anyone know why ?

note : don't use FastReverseString, coz there's a bug with the small string algo every 8 multiples +5...


Ratch

Nightware,
Quote
if you change the order of the routines, the results change, sometimes with big differencies... and i don't know why (the only difference is the length of the jumps)
these problems appear both on my Celeron and my P4 (it worst on P4). i'd like to know the reason... anyone know why ?

     Yes, after a flurry timing runs for STRLEN, I gave up on code timing, because I too got inconsistent and/or inexplainable results.  I have had substantial timing changes occur after rebooting, running the identical run on two different days, or just by adding an innocuous instruction somplace.  So now I just code by implementating the best algo the best way I can.  If I see large variations between runs using of the same basic code, I dismiss the results unless they can be explained.  See pages 10 and 11 of this link Ratch

http://www.masmforum.com/simple/index.php?topic=1807.135

hutch--

Ratch,

The method I use to get around timing variations is to use a reference algo that I am clocking others against. For what its worth I tend to get them running at very similar times from day to day but of course this can vary with workload on the box.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

ratch, hutch,

thanks for your answers

ratch, it seams you're right, the comparison of my old code can't work correctly  :'( , apologies, but your correction don't works anyway... :green  it's strange i've obtained some result with it...

concerning lingo's algo, i think he has already prooved in different test he's good... so i'm not surpise his code is faster, i've made some tests with different length, and lingo's algo is always the winner, aproximatively 2 time faster, for the others algos, it's the dance  :dance:, every algo is 2nd, 3rd or 4th... it depends of the size of the string

concerning my algos, here a very small speedup (it should, coz there is 1+ instruction less, but i don't see the difference...  :toothy ):


AnotherStringReverse PROC StrAddr:DWORD,StrSize:DWORD
ALIGN 16
;db 0cch ???
push ebx
push esi
push edi

mov esi,StrAddr
mov eax,StrSize
lea edi,[esi+eax] ; if you use SIZEOF => lea edi,[esi+eax-1]
mov ebx,eax
and ebx,11111111111111111111111111111000b
; test ebx,ebx ; no dword ? (useless coz the previous and fix the zero flag)
jz Label2
and eax,00000000000000000000000000000111b

Label1: sub edi,4 ; big string algo
mov edx,DWORD PTR [edi]
bswap edx
mov ecx,DWORD PTR [esi]
bswap ecx
mov DWORD PTR [edi],ecx
mov DWORD PTR [esi],edx
add esi,4
sub ebx,8
jnz Label1

Label2: shr eax,1 ; to eliminate impair values
; test eax,eax ; no byte ? (useless coz the previous and fix the zero flag)
jz Label4

Label3: sub edi,1 ; small string algo
mov dh,BYTE PTR [edi]
mov dl,BYTE PTR [esi]
mov BYTE PTR [edi],dl
mov BYTE PTR [esi],dh
add esi,1
sub eax,1
jnz Label3
Label4:
pop edi
pop esi
pop ebx
ret
AnotherStringReverse ENDP


and to finish another approch, there is 1 instrution less in the loops (so it should be a bit faster (in theory  :boohoo: ) on very big strings) :


FastReverseString PROC StrAddr:DWORD,StrSize:DWORD
ALIGN 16
push ebx
push esi
push edi

mov esi,StrAddr
mov eax,StrSize
lea edi,[esi+eax-4] ; ) put the middle of the string in edi
shr eax,1 ; )
sub edi,eax ; )

mov ebx,eax
and ebx,11111111111111111111111111111100b
; test ebx,ebx ; no dword ? (useless coz the previous and fix the zero flag)
jz Label2

and eax,00000000000000000000000000000011b
add edi,eax ; necessary shift

Label1: mov edx,DWORD PTR [edi+ebx] ; big string algo
bswap edx
mov ecx,DWORD PTR [esi]
bswap ecx
mov DWORD PTR [edi+ebx],ecx
mov DWORD PTR [esi],edx
add esi,4
sub ebx,4
jnz Label1

test eax,eax
jz Label3
sub edi,eax ; cancel the shift

Label2: mov dh,BYTE PTR [edi+eax+3] ; small string algo
mov dl,BYTE PTR [esi]
mov BYTE PTR [edi+eax+3],dl
mov BYTE PTR [esi],dh
add esi,1
sub eax,1
jnz Label2
Label3:
pop edi
pop esi
pop ebx
ret
FastReverseString ENDP


i think i can't go beyond... of course it's possible to unroll the last loop, and win 1 to 2/3 instructions... but it takes to much memory for an insignifiant speedup...

lingo, there is just a small change to made with the implementation, just change the :
lenstr4 EQU $-str4-1 to lenstr4 EQU $-str4
lenstr4 EQU $-str5-1 to lenstr4 EQU $-str5
coz the result of REVBSTR and AnotherReverseString appear wrong if you change the length of the string

Ratch

Nightware,
Quote
ratch, it seams you're right, the comparison of my old code can't work correctly   , apologies, but your correction don't works anyway...   it's strange i've obtained some result with it...

     I beg to differ.  I tested my patch for your code, and it does work.

Quote
concerning lingo's algo, i think he has already prooved in different test he's good...so i'm not surpise his code is faster

     That's no criteria for faster code.  I consider myself a good coder also.  Doesn't mean my code is faster, does it?

Quote
so i'm not surpise his code is faster, i've made some tests with different length, and lingo's algo is always the winner, aproximatively 2 time faster, for the others algos, it's the dance  , every algo is 2nd, 3rd or 4th... it depends of the size of the string

     You SHOULD be surprised, since he uses the same basic algorithm as the rest of us, especially in the main loop.  I think you are saying that you get erratic results.  That should make you suspect your test is failing to measure what you want.  All erratic results need to be explained and validated in order to inspire confidence.

Quote
concerning my algos, here a very small speedup (it should, coz there is 1+ instruction less, but i don't see the difference...   

     It depends where the instruction is. Generally speaking, if it is in the main loop, that should speed things up.  If it is outside the loop, then it won't make much difference.  If you can't tell the difference in your test, be suspicious, be very suspicious of your test.

     Your latest AnotherStringReverse works correctly, but it has 3 read/write register stalls in the loop.  Your latest FastReverseString errors out if the string is 1 character long.  Ratch

   

NightWare

Quote from: Ratch on October 09, 2005, 02:54:26 AM
I beg to differ.  I tested my patch for your code, and it does work.

i'd made test with different size, and it didn't work... but to be honest we you, i have the bad tendancy to change codes (to optimize it) before testing... so maybe you're right again...  :red

Quote from: Ratch on October 09, 2005, 02:54:26 AM
That's no criteria for faster code.  I consider myself a good coder also.  Doesn't mean my code is faster, does it?

no, of course, but you must admit his routines are fast... even if there is strange result with speed macros sometimes, the results must be close to reality... i make my codes on my old celeron, and believe it or not, but i don't need clocks cycles to see the difference...

Quote from: Ratch on October 09, 2005, 02:54:26 AM
You SHOULD be surprised, since he uses the same basic algorithm as the rest of us, especially in the main loop.  I think you are saying that you get erratic results.  That should make you suspect your test is failing to measure what you want.  All erratic results need to be explained and validated in order to inspire confidence.

if you ask me, there is some reasons to explain why his code is faster... firstly he doesn't have to deal with the last bytes as we have to... so it's faster (outside the main loop) and he unroll the last loop (a small speed up again). secondly in the main loop he makes addition in every case, it's not the case for us, we alternate add and sub (it's the price we have to paid when we work directly on the string...) and if i remember well sub is a bit slower...

the thing i don't understand is why he uses :
cmp
mov
jcc

instead of the classic :
mov
cmp
jcc

Quote from: Ratch on October 09, 2005, 02:54:26 AM
Your latest FastReverseString errors out if the string is 1 character long.  Ratch

nice, i haven't saw it !

Ratch

#39
NightWare,

Quote
no, of course, but you must admit his routines are fast... even if there is strange result with speed macros sometimes, the results must be close to reality... i make my codes on my old celeron, and believe it or not, but i don't need clocks cycles to see the difference...

     All of our routines are fast.  There is fast and then there is faster.  I don't believe you can tell the difference unless you have a humongous large string.

Quote
if you ask me, there is some reasons to explain why his code is faster... firstly he doesn't have to deal with the last bytes as we have to... so it's faster (outside the main loop) and he unroll the last loop (a small speed up again).

     Last bytes are processed outside the main loop.  Any improvements made outside the main loop are insignificant.  My program has only one loop.

Quote
secondly in the main loop he makes addition in every case, it's not the case for us, we alternate add and sub (it's the price we have to paid when we work directly on the string...) and if i remember well sub is a bit slower...

     I think you remember wrong.  Why would subtraction be slower?  Can you document that?  His loop also has 1 register stall.

Quote
the thing i don't understand is why he uses :
cmp
mov
jcc

instead of the classic :
mov
cmp
jcc

     MOVs do not affect the status flags, so it does not matter which of the above code sequences you use.  Unless you need to use a particular sequence prevent a register stall.  Notice in his main loop, there are 3 MOVs and 2 BSWAPs between the ADD and the conditional jump. His ECX register indexes the DWORD and also determines the loop duration.

     I made some minor changes to my routine.  Here is the latest.  Ratch


;*******************************************************************************
; REVBSTR:   Reverses the bytes of an ASCII string.                            *
;                                                                              *
; Called by: PUSH <string length>                                              *
;            PUSH <address of string>                                          *
;            CALL REVBSTR                                                      *
;                                                                              *
; Returns:   Nothing                                                           *
;                                                                              *
; Notes:     This subroutine conforms with WIN32 conventions regarding         *
;            register preservation and stack balancing.                        *
;                                                                              *
; Coder:     Ratch                                                             *
;*******************************************************************************
REVBS$ STRUC
  RETURN    DWORD ?
RV1$ = $
  SADR      DWORD ?
  STRLEN    DWORD ?
RV2$ = $
REVBS$ ENDS

RVBS$ EQU ESP.REVBS$            ;save some typing

REVBSTR:                        ;it all begins here
PUSH ESI                       ;save register
PUSH EDI                       ;save register
MOV EAX,[RVBS$.STRLEN+2*DWORD] ;string length
MOV ESI,[RVBS$.SADR+2*DWORD]   ;string start address
SHRD EDX,EAX,3                 ;mod 8
PUSH EBX                       ;save register
LEA EDI,[ESI+EAX-DWORD]        ;string end address - DWORD
SHR EDX,32-3                   ;EDX = remainder
SHR EAX,3                      ;divide by 8

.IF EDX > 3
   INC EAX                      ;increase loop count
   XOR EDX,EDX                  ;remainder is set to zero
.ENDIF

.IF EAX
   .REPEAT                        ;8 bytes moved each time through loop
     MOV ECX,[ESI]                ;pick up 4 bytes
     MOV EBX,[EDI]                ;pick up 4 bytes
     BSWAP ECX
     BSWAP EBX
     MOV [EDI],ECX                ;put down 4 bytes
     MOV [ESI],EBX                ;put down 4 bytes
     SUB EDI,DWORD                ;move string right end pointer
     ADD ESI,DWORD                ;move string left end pointer
     DEC EAX                      ;loop counter
   .UNTIL ZERO?
.ENDIF

.IF EDX > 1                    ;if remainder is 2 or 3
   MOV CL,BYTE PTR [ESI]        ;pick up byte
   MOV BL,BYTE PTR [EDI+DWORD-BYTE] ;pick up byte
   MOV BYTE PTR [EDI+DWORD-BYTE],CL ;put down byte
   MOV BYTE PTR [ESI],BL        ;put down byte
.ENDIF

POP ESI                        ;restore register
POP EDI                        ;restore register
POP EBX                        ;restore register
RET RV2$-RV1$                  ;return home

Jimg

#40
I played around with these routines a bit.

NightWare  ( and others) -

To get consistant timing results, you have to run all the routines under test with identical conditions.  This means executing from the exact same location in memory, both the calling routine and the routine being tested.  We must also use the exact same input string, and output to the exact same output address.  Since some of the routines here worked in place, I could not accomplish the last of these restrictions, but we can come close enough.

To make the routines have the same call, I had to slightly modify a few.  I settled on making all of them with no proc parameters, and placing the offsets to the input and output strings and the length of the string in variables.

So the procedure is:

1.   copy the routine to a common location so all routines run from the same memory.
2.   create the input string before running each routine to be sure the input is correct.
3.   copy the output string to the input string for routines that work in place.
3.   call the rountine.
4.   verify the output.


By doing this, we can get good consistant timings with only 10 iterations.  There is no need to run a million times.  We simply select the smallest execution time from the 10 trys.  We can also test with various string lengths, and with different byte misalignments.

Here are my results:
Str1 Size        1     9    18    27    28    29    36   100   400  1000
============ ===== ===== ===== ===== ===== ===== ===== ===== ===== =====

0 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0
lostcauz32      20    18    26    34    26    33    32    89   314   764
Anonymous32      9    20    24    28    27    28    31    82   268   642
Ratch           14    18    39    55    32    66    37    90   274   649
NightWare       11    19    46    62    37    54    42    93   272   647

1 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0
lostcauz32      20    18    26    35    26    33    32    89   315   765
Anonymous32      9    20    24    27    33    33    36    88   282   693
Ratch           14    18    53    58    69    69    85   208   736  1806
NightWare       11    19    47    53    65    66    81   192   692  1735

2 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0
lostcauz32      20    18    26    35    28    34    34    91   318   768
Anonymous32      9    20    22    33    33    33    36    88   282   693
Ratch           14    18    40    54    69    69    85   208   736  1806
NightWare       11    19    39    55    65    66    81   192   685  1735

3 byte misalignment
OVH            467     0     0     0     0     0     0     0     0     0
lostcauz32      20    18    26    35    28    34    34    91   318   768
Anonymous32      9    17    24    33    33    27    36    88   282   693
Ratch           14    18    47    58    69    70    85   208   736  1806
NightWare       11    16    45    55    67    52    81   192   685  1735

Press enter to exit...


The OVH routine is an empty routine, just to get the overhead of setting up the input strings and calling the routines.  This overhead is subtracted from all other results.

Warning:  This was done on an Athlon, Pentiums will give different results.


[attachment deleted by admin]

Ratch

#41
Jimg,
     Quite a job you did.  I also tried some timing.  I made lingo's and my subroutine flip the alphabet 100,000,000 times.  It took about 6 seconds for my subroutine to do it and about 3 sec for lingo's subroutine.  That is hardly worth worrying about, but it makes a good argument.

Quote
We must also use the exact same input string, and output to the exact same output address.  Since some of the routines here worked in place, I could not accomplish the last of these restrictions, but we can come close enough.

     Well, when I set my EDI output pointer to a different buffer in a .DATA? segment as lingo does, then my times were comparable to his, as can be expected.  Therefore to be fair, you have to make lingo's subroutine write in the same source buffer, as the other programs do.  I think this causes a "cache clash", because when reads and writes are very close to each other in memory, the CPU has to stall until the write is complete before reading again.  My AMD Athlon has a moderate L1 cache size, but a 64 byte cache line.  I think that means it can isolate 64 bytes within its cache buffer.  If it has to read/write within that limit, a clash can be expected.  Now I realize that the output is going to be all mixed up because his routine is not written for a single buffer.  My output was hashed also when I changed my destination to a different buffer. However, both programs will do the same work as before.

     I tried to assemble your test program, but I failed because I was missing a library or macro.  Maybe you can try it again using my suggestions.  Ratch

lingo

Hutch,
I'm wondering why you tolerate such offences towards me
"To make the routines have the same call, I had to slightly modify a few "
and
"The worst modification was in Lingo's routine,"
and
"I used esi and edi for this, but had to do an extra add and subtract as some of the instructions"
and my proc from:
comment * ReverseStr by Lingo *
align 16
db  0A4h, 24h, 0, 0, 0, 0 , 8Dh, 9Bh,  0, 0, 0, 0,90
Lingo32 proc
mov ecx, sizeof str1-1
xor eax, eax
cmp ecx, 4
mov byte ptr [str2+ecx], al
jl Bytes3
push ebx
@@:
mov edx, dword ptr [str1+eax]
add eax, 4
add ecx, -8
mov ebx, dword ptr [str1+eax+ecx]
bswap   edx
mov dword ptr [str2+eax+ecx], edx
bswap   ebx
mov dword ptr [str2+eax-4], ebx
jg @b
pop ebx
ret

become garbage:
comment * ReverseStr by Lingo *   
align   16
db  13 dup (0)    ;0A4h, 24h, 0, 0, 0, 0 , 8Dh, 9Bh,  0, 0, 0, 0,90
Lingo32 proc
    push esi
    push edi
    mov esi,Str1Offset
    mov edi,Str2Offset
    ;mov  ecx, sizeof Str1-1
    mov   ecx, Str1Size
    xor   eax, eax
    cmp   ecx, 4
    mov   byte ptr [edi+ecx], al   ;[str2+ecx]
    jl    Bytes3
    push  ebx
@@:
    mov     edx, dword ptr [esi+eax] ;[str1+eax] 
    add     eax, 4
    add     ecx, -8
   
    add     ecx,eax ; temporarily compute eax+ecx

    mov    ebx, dword ptr [esi+ecx] ;[str1+eax+ecx]
    bswap  edx
    mov    dword ptr [edi+ecx], edx ;[str2+eax+ecx]

    sub     ecx,eax ; restore ecx
   
    bswap   ebx
    mov     dword ptr [edi+eax-4], ebx ;[str2+eax-4]
    jg      @b   ; from the add ecx,-8 instruction
    pop     ebx
    pop     edi
    pop     esi
    ret

and this garbage continue to have my nickname
I don't know what to do; to laugh or to cry....

Regards,
Lingo


Jimg

Lingo-

You have my sincerest appology.  I had no idea you would be so offended by this.  The basic algorythm is still yours, I only had to change a few instructions to fit the scheme to make the comparison.  And you STILL are the quickest.  Is it that you no longer want your name associated with the routine, or want some disclaimer with the code, or what.  Please let us know.

lingo

"Is it that you no longer want your name associated with the routine "

Exactly, pls fill free to change and use what you want but delete my nickname from YOUR job