News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

Ratch

 kunt0r,
Quote
I ran it through Olly with a whole lot of different strings and it always returned the proper length.

     It should.  It uses the same algo as most of the other GPR routines.  For instance, its decimal SUB constant 16843009 decimal converts to 010101010H, and its AND constant 2155905152 decimal converts to 080808080H.  It is sensitive to string alignment; 7575 ticks aligned, vs. 8856 ticks with 3 byte misalignment for a 10003 byte long string.  It has no provision to prevent over reading its memory.  Its code to search the last word for a zero byte is clunky, but that does not affect its speed too much, because to only executes it once.  It uses two instructions instead of RET DWORD, and it leaves its return address vulnerable on the stack for a possible overwrite.  It speed is comparable to my GPR routine, except my routine does not slow up for string misalignment on long strings.  Not that it matters much, but my code is also shorter.   Ratch

Brett Kuntz

ah I didn't know about "ret 4" doing the same thing as those two lines of code, that sped it up slightly.

Ratch

 kunt0r,

Quote
ah I didn't know about "ret 4" doing the same thing as those two lines of code, that sped it up slightly.

     Can't be much of a difference because it only gets executed once.  The search loop is where the subprogram it spends most of its time.  As I mentioned before,  it leaves its return address vulnerable to a possible overwrite.  One would expect something better from Borland. Ratch

Jimg,

     Did you use my latest STRLEN?  What was the switch was set for, the 8-bit or 7-bit seach?  Ratch

http://www.masmforum.com/simple/index.php?topic=2442.0

Brett Kuntz

(edit-Ratch, Borland didn't use that kind of stack stuff, I did it myself to shave a few cycles. Borlands actually has more code which checks for misaligned data with a test al, 3 line which leads to more code ect..I basically used Borlands as a basis and tried to make it faster then they did)

I updated my borland one and got these results:


Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
borland      9   11   10   10   11   12   15   21   49   61  87  129  200 774
szLength     8    8    9    9   12   15   17   24   48   63   89  131  202  779
Ratch        9   11   12   15   14   14   20   31   64   78  100  143  231  855
Jens_fast   20   20   19   20   21   26   29   36   58   69   99  145  219  923
roticvSSE    4   28   29   28   28   32   32   35   39   53   79  112  160  576
lszLenSSE   25   25   25   25   25   28   28   32   38   47   84  116  166  591
Jens_mmx2    6   31   31   32   37   43   48   55   78   45   93   60  124  288
BiteRider   27   26   26   26   26   28   29   32   37   46   82  112  161  569



    .686p
    .model flat, stdcall
    option casemap :none

    include \masm32\include\windows.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\kernel32.lib
    include \masm32\include\user32.inc
    includelib \masm32\lib\user32.lib

    include \masm32\kinc\strlen.inc
    include \masm32\kinc\timer.inc

.data
    align 4
    teststr db 999 dup("a"), 0
    msgstr db "Cycles: %d", 0
.data?
    time dd ?
    buff db 64 dup(?)
.code
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    counter_begin 1000000, REALTIME_PRIORITY_CLASS
    invoke strlen, addr teststr
    counter_end

    invoke wsprintf, addr buff, addr msgstr, eax
    invoke MessageBox, 0, addr buff, 0, 0
    invoke ExitProcess, 0

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start



    strlen proto :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

strlen proc xstr:dword

    mov eax, dword ptr [esp+4]
d1: mov edx, dword ptr [eax]
    mov ecx, edx
    add eax, 4
    sub edx, 16843009
    and edx, 2155905152
    jz d1
    not ecx
    and edx, ecx
    jz d1
    test dl, dl
    jnz d2
    test dh, dh
    jnz d3
    test edx, 16711680
    jnz d4
    jmp d5
d2: dec eax
d3: dec eax
d4: dec eax
d5: dec eax
    sub eax, dword ptr [esp+4]
    ret 4

strlen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################

hutch--

If anyone has the time to test this out, it would be appreciated, I needed a DWORD type strlen algo that had auto aligning code at its beginning so I added a front end to Agner Fog's strlen algo ad it appears to be working OK at the moment. Some rough benchmarking with a mixed set of samples from a few bytes to 16k shows this hybrid to be about 2.5 times faster than a classic byte scanner.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

slen2 proc item:DWORD

    push edi
    push esi

    mov eax, [esp+12]
    mov ecx, eax                ; copy EAX to ECX
    add ecx, 3                  ; align up by 4
    and ecx, -4       
    sub ecx, eax                ; calculate any misalignment in ecx
    mov esi, ecx                ; store ECX in ESI
    jz proceed

    sub eax, 1
  @@:
    add eax, 1
    cmp BYTE PTR [eax], 0       ; scan for terminator for
    je quit                     ; up to the 1st 3 bytes
    sub ecx, 1
    jns @B
    jmp proceed

  quit:
    sub eax, [esp+12]           ; calculate length if terminator
    jmp outa_here               ; is found in 1st 3 bytes

  ; ----------------

  proceed:                      ; proceed with the rest
    lea edx, [eax+3]            ; pointer+3 used in the end
  align 4
  @@:     
    mov edi, [eax]              ; read first 4 bytes
    add eax, 4                  ; increment pointer
    lea ecx, [edi-01010101h]    ; subtract 1 from each byte
    not edi                     ; invert all bytes
    and ecx, edi                ; and these two
    and ecx, 80808080h   
    jz @B                       ; no zero bytes, continue loop
    test ecx, 00008080h         ; test first two bytes
    jnz @F
    shr ecx, 16                 ; not in the first 2 bytes
    add eax, 2
  @@:
    shl cl, 1                   ; use carry flag to avoid branch
    sbb eax, edx                ; compute length
    add eax, esi

  outa_here:
    pop esi
    pop edi

    ret 4

slen2 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Codewarp wrote: :P

" Re: Improved STRLEN
« Reply #1 on: August 08, 2005, 10:40:48 pm »

Ratch,
Isn't this a rehash of what we already covered this in the szLen thread, ad nausium?  Also back in that thread, I suggested a super optimization that you are almost doing in your routine above. 

The idea is this: search 7-bit ascii, since it's faster than 8-bit.  But when you find a "zero", check it for bit7=1: 1=>resume the search for 8-bit ascii, 0=>you are done.  In other words use 7-bit search as far as you can take it, then ride the rest of the way using 8-bit search as needed.  In most text, the 8-bit part will never be needed, but when required, it covers 8-bit as well--the best of both worlds..."


I agree with bolded text and you can test the result (see timelen1.asm)
I improved my algos LingoSSE2 and  LingoMMX too.. (see timelen.asm)
Here is the results on my P4 3.6 GHz Prescott:



A. Timelen.asm -> test

Test routines for correctness:
0 byte misalignment
Borland   0 1 2 3 5 8 13 22 39 55 89 144 239 999
szLength  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Ratch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Hutch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_fast 0 1 2 3 5 8 13 22 39 55 89 144 239 999
lszLenSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
roticvSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Biterider 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_mmx2 0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoMMX  0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoSSE2 0 1 2 3 5 8 13 22 39 55 89 144 239 999

1 byte misalignment
Borland   0 1 2 3 5 8 13 22 39 55 89 144 239 999
szLength  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Ratch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Hutch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_fast 0 1 2 3 5 8 13 22 39 55 89 144 239 999
lszLenSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
roticvSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Biterider 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_mmx2 0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoMMX  0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoSSE2 0 1 2 3 5 8 13 22 39 55 89 144 239 999

2 byte misalignment
Borland   0 1 2 3 5 8 13 22 39 55 89 144 239 999
szLength  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Ratch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Hutch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_fast 0 1 2 3 5 8 13 22 39 55 89 144 239 999
lszLenSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
roticvSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Biterider 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_mmx2 0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoMMX  0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoSSE2 0 1 2 3 5 8 13 22 39 55 89 144 239 999

3 byte misalignment
Borland   0 1 2 3 5 8 13 22 39 55 89 144 239 999
szLength  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Ratch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Hutch     0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_fast 0 1 2 3 5 8 13 22 39 55 89 144 239 999
lszLenSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
roticvSSE 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Biterider 0 1 2 3 5 8 13 22 39 55 89 144 239 999
Jens_mmx2 0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoMMX  0 1 2 3 5 8 13 22 39 55 89 144 239 999
LingoSSE2 0 1 2 3 5 8 13 22 39 55 89 144 239 999


Proc/Byte  0  1  2 3  5  8  13 22  39  55  89 144 239  999
===========================================================

0 byte misalignment
Borland   19 20 20 20 22 25 32 42  70 122 167 232 344 1988
szLength  15 15 16 16 19 22 24 31  52  72 128 174 248  866
Ratch     12 15 18 20 19 21 24 35  59  82 140 182 284  993
Hutch     21 21 23 23 24 30 33 40  56  81 149 187 279  989
Jens_fast 19 19 19 19 20 23 43 80  88 101 126 168 252  981
lszLenSSE 28 29 28 28 28 32 32 37  52  64 101 200 289  925
roticvSSE  8 13 19 18 25 36 36 41  59  73 109 159 275  924
Biterider 14 15 15 15 14 25 23 37  49  61 102 219 272  922
Jens_mmx2  8 44 45 45 50 53 57 67  78  63  99  90 155  466
LingoMMX  14 14 14 14 14 24 24 36  47  66 100 117 143  402
LingoSSE2 14 14 14 14 14 14 14 25  39  46  69  98 111  301

1 byte misalignment
Borland   19 20 20 20 22 25 31 42  69 165 189 281 437 2535
szLength  19 19 19 23 23 27 29 35  50  63 131 176 256  882
Ratch     12 15 18 20 19 20 25 34  59  92 170 247 397 1465
Hutch     20 21 24 23 38 40 42 45  60  83 154 187 298 1002
Jens_fast 20 20 19 20 23 32 27 72  85 131 162 215 321 1362
lszLenSSE 28 28 28 28 28 32 32 36  51  76 118 226 331 1084
roticvSSE 10 14 16 18 24 58 60 64  83  97 122 168 308  978
Biterider 14 14 14 14 14 24 24 37  55  72  97 165 283  932
Jens_mmx2  8 12 17 21 31 77 84 95 119 100 131 153 183  490
LingoMMX  14 14 14 14 14 24 24 36  51  67 101 121 145  403
LingoSSE2 14 14 14 14 14 14 14 25  39  49  69  98 117  304

2 byte misalignment
Borland   19 20 20 20 22 26 43 42  69 121 192 284 455 2418
szLength  19 18 21 21 21 24 27 37  51  63 131 176 253  872
Ratch     12 15 19 21 20 20 37 36  59  81 166 247 401 1472
Hutch     19 21 20 22 33 36 41 51  60  80 159 212 321 1006
Jens_fast 19 19 19 19 20 27 62 77  85  99 156 209 328 1366
lszLenSSE 29 29 28 28 28 32 44 37  52  64 124 253 343 1091
roticvSSE  8 12 16 18 25 55 54 64  76  90 117 164 297  956
Biterider 14 14 15 14 15 24 24 44  56  73  97 160 284  936
Jens_mmx2  8 14 17 20 33 76 77 92 106  89 123 144 177  497
LingoMMX  13 15 15 15 14 24 24 38  56  69  97 118 145  403
LingoSSE2 14 14 14 14 14 14 14 25  39  50  74 101 117  301

3 byte misalignment
Borland   20 23 23 23 23 26 31 45  83 150 196 328 443 2381
szLength  20 21 21 22 24 24 31 37  92 104 137 176 254  872
Ratch     13 15 17 20 20 20 24 35  72  95 167 282 405 1438
Hutch     19 19 22 28 32 33 40 65  70  94 158 264 367  998
Jens_fast 19 19 19 19 20 27 59 86 117 128 150 198 301 1377
lszLenSSE 30 28 29 28 28 32 34 40  59  76 111 245 354 1128
roticvSSE  8 12 16 19 24 51 55 60  72  89 115 161 293  962
Biterider 14 14 15 15 24 24 37 44  55  73  97 156 283  935
Jens_mmx2  8 12 17 20 30 66 74 82 101  82 118 142 180  490
LingoMMX  13 14 14 14 14 24 25 41  57  71  97 117 146  402
LingoSSE2 14 14 14 14 14 14 14 25  39  46  69  98 117  301

Press enter to exit...


B. Timelen1.asm -> test

Test routines for correctness:
0 byte misalignment
RatchN  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Lingo32 0 1 2 3 5 8 13 22 39 55 89 144 239 999

1 byte misalignment
RatchN  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Lingo32 0 1 2 3 5 8 13 22 39 55 89 144 239 999

2 byte misalignment
RatchN  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Lingo32 0 1 2 3 5 8 13 22 39 55 89 144 239 999

3 byte misalignment
RatchN  0 1 2 3 5 8 13 22 39 55 89 144 239 999
Lingo32 0 1 2 3 5 8 13 22 39 55 89 144 239 999


Proc/Byte 0  1  2  3  5  8 13 22 39 55  89 144 239  999

========================================================

0 byte misalignment
RatchN   12 18 22 23 21 20 26 36 65 80 135 178 258 1392
Lingo32  10 14 17 19 18 17 20 28 39 55 111 141 198 1047

1 byte misalignment
RatchN    9 15 21 32 38 38 44 57 75 101 146 192 270 1513
Lingo32   9 14 18 22 27 27 31 32 46  53 113 146 203 1054

2 byte misalignment
RatchN    9 15 25 29 32 32 37 48 70 93 144 189 268 1493
Lingo32  10 15 18 20 26 26 27 33 46 54 113 147 206 1052

3 byte misalignment
RatchN    9 22 22 23 23 31 30 42 64 87 139 184 257 1411
Lingo32   9 15 17 19 17 28 28 34 47 55 112 148 200 1053

Press enter to exit...



Regards,
Lingo



[attachment deleted by admin]

Ratch

Lingo,
     Ok, let's talk about string length subroutines.  I am going to limit my remarks to GPR routines only.  Well written MMX/SSE code is going to beat the pants off any GPR implementation.  If MMX/SSE instructions did not, they would not exist. 

     The most important part of the subroutine is the core loop where most of the execution time is spent, especially on a long string.  The code before and after this core loop is either executed once, or only on special conditions such as aligning on a dword, locating the zero byte within the word and testing for a legitimate zero byte.  Let's define the core loops.

The following loop I will call the Agner Fog loop.  It is widely attributed to Agner, but I have my doubts.  I keep seeing it elsewhere, and I suspect it is old and was well known to computer science academics before Agner.  It is a 7 line loop. http://www.cl.cam.ac.uk/~am/progtricks.html


  .REPEAT                   ;searching string ....
    MOV EDX,[EAX]           ;next 4 byte gulp (DWORD)
    ADD EAX,DWORD           ;EAX=character pointer
    LEA ECX,[EDX-01010101H] ;propagate if byte is zero
    NOT EDX                 ;set up test pattern
    AND EDX,ECX             ;leftmost bit of zero byte should now be set
    AND EDX,080808080H      ;sieve out zero bytes
  .UNTIL !ZERO?             ;check the next DWORD
endif


     This I will call the Lingo loop.  It is a 7 line loop.


  .REPEAT
    LEA EDX,[ECX+0FEFEFEFFH]
    NOT ECX
    AND ECX,080808080H
    ADD EAX,4
    AND EDX,ECX
    MOV ECX,[EAX]
  .UNTIL !ZERO?             ;check the next DWORD


     Below is the Ratch8 loop. It is a 6 line loop, and used for 8-bit extended ASCII.


  .REPEAT
    MOV EDX,[EAX]           ;next 4 byte gulp (DWORD)
    AND EDX,07F7F7F7FH      ;mask out bit 8
    ADD EAX,DWORD           ;EAX=character pointer
    SUB EDX,01010101H       ;make those zero bytes shine
    AND EDX,ECX             ;sieve out zero bytes
  .UNTIL !ZERO?             ;check the next DWORD


     And finally is the Ratch7 loop.  It is a 5 line loop used for 7-bit ASCII.


  .REPEAT
    MOV EDX,[EAX]           ;next 4 byte gulp (DWORD)
    ADD EAX,DWORD           ;EAX=character pointer
    SUB EDX,01010101H       ;make those zero bytes shine
    AND EDX,ECX             ;sieve out zero bytes
  .UNTIL !ZERO?             ;check the next DWORD


     Theoretically, the Ratch7 loop will execute the fastest, and according to my timings, it does.  If 8-bit is needed, my timings show the Ratch8, Lingo, and Agner loops in a dead heat.  I do not know why I do not get faster speeds on Ratch8 since it only has 6 lines in the loop vs. 7 for Lingo and Agner.  It appears that timing is a tricky thing and I do not pretend to understand it.  I executed your timelen1 routine on my 1 ghz AMD Athlon and as you can see, the results are different than yours were.  By the way, you did not test my 7-bit version, and you transcribed the MOV ECX 80808080H in my 8-bit version to the wrong spot in timelen1.  Also you used an old version of my STRLEN in timelen.


Test routines for correctness:
0 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
1 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
2 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
3 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
RatchN      10   11   14   17   14   15   19   30   61   73   95  136  220 1029
Lingo32      8    8    9   10   12   13   15   22   53   60   80  120  179 1066

1 byte misalignment
RatchN       4    7   12   18   26   26   33   42   62   76  108  152  227 1050
Lingo32      9   13   17   17   18   20   23   28   58   66   91  136  199 1061

2 byte misalignment
RatchN       4    8   14   18   26   25   32   31   63   75  109  148  222 1048
Lingo32      9   12   14   14   16   18   24   28   57   66   85  120  184 1062

3 byte misalignment
RatchN       5   12   15   18   14   24   20   30   63   76   96  150  219 1028
Lingo32      8   12   14   14   14   17   19   24   53   65   86  124  186 1055

Press enter to exit...


Quote
Ratch,
Isn't this a rehash of what we already covered this in the szLen thread, ad nausium?  Also back in that thread, I suggested a super optimization that you are almost doing in your routine above. 

The idea is this: search 7-bit ascii, since it's faster than 8-bit.  But when you find a "zero", check it for bit7=1: 1=>resume the search for 8-bit ascii, 0=>you are done.  In other words use 7-bit search as far as you can take it, then ride the rest of the way using 8-bit search as needed.  In most text, the 8-bit part will never be needed, but when required, it covers 8-bit as well--the best of both worlds..."

I agree with bolded text and you can test the result (see timelen1.asm)

     If it was discussed before, I missed it.  There are presently 10 pages to this thread.  That's a lot of material to plow through.  I guess my contribution to that idea is a 6 line implementation, if that is any achievement.  Ratch

lingo

#142
Ratch, :lol

"The following loop I will call the Agner Fog loop.  It is widely attributed to Agner, but I have my doubts.  I keep seeing it
elsewhere, and I suspect it is old and was well known to computer science academics before Agner.  It is a 7 line loop"


Who cares about the first "author"?
We just use it

If we talk about "doubts" just for example
what about your "new" strlen algo and the similar very old
algo of the Paul Hsieh here:
http://www.azillionmonkeys.com/qed/asmexample.html

Let's compare them:
A. "Update!

While discussing sprite data copying (see next example) I realized that there is a significant improvement for 32-bit x86's that have

slow branching (P-IIs and Athlon.) "
; by Paul Hsieh

    lea     ecx,[ebx-1]
l1: inc     ecx
    test    ecx,3
    jnz     l3
l2: mov     edx,[ecx]        ; U
    mov     eax,07F7F7F7Fh   ;   V
    and     eax,edx          ; U
    add     ecx,4            ;   V
    add     eax,07F7F7F7Fh   ; U
    or      eax,edx          ; U
    and     eax,080808080h   ; U
    cmp     eax,080808080h   ; U
    je      l2               ;   V +1brt
    sub     ecx,4
l3: cmp     byte ptr [ecx],0
    jne     l1
    sub     ecx,ebx

B.  final version of the improved STRLEN by Ratch:
http://www.masmforum.com/simple/index.php?topic=2442.0

"OK, here is my final version of STRLEN, unless someone finds a bug.  It can now detect the obscure 8-bit byte 080H.  It does

this by checking the byte for 080H at the end of the subroutine, and returning to the beginning of the subroutine if that value is

detected. If a lot of 080H bytes are present in the string, a performance penalty will be incurred.  Ratch"

   
004097D1 8B 44 24 04      mov     eax,dword ptr [esp+4]
004097D5 B9 80 80 80 80    mov     ecx,80808080h
Labe_0:
004097DA A8 03            test      al,3
004097DC 74 08            je         004097E6
004097DE F6 00 FF          test      byte ptr [eax],0FFh
004097E1 74 26            je         00409809
Labe_Begin:
004097E3 40                inc       eax 
004097E4 EB F4            jmp      004097DA
;Labe_1:
;mov         ecx, 80808080h -> From 1st version
Labe_2: 
004097E6 8B 10            mov      edx,dword ptr [eax]
004097E8 81E27F7F7F7F and      edx,7F7F7F7Fh
004097EE 83 C0 04          add      eax,4
004097F1 81EA01010101 sub      edx,1010101h
004097F7 23 D1            and      edx,ecx
004097F9 74 EB            je         004097E6  ; Labe_2 

004097FB 83 E8 05          sub      eax,5

Labe_3:
004097FE 40                inc       eax 
004097FF C1 CA 08          ror       edx,8
00409802 73 FA            jae       004097FE  ; Labe_3
00409804 F6 00 FF          test      byte ptr [eax],0FFh
00409807 78 DA            js         004097E3  ; Labe_Begin

Labe_4:
00409809 2B 44 24 04      sub      eax,dword ptr [esp+4]
0040980D C2 04 00          ret       4



"Theoretically, the Ratch7 loop will execute the fastest,
and according to my timings, it does"


Ratch7 is the buliaNaza's algo:
and some time ago I improved it:
http://board.win32asmcommunity.net/index.php?topic=8330.msg77056#msg77056

So "my 7-bit algo" is improved "buliaNaza's algo" and it is faster then Ratch7(buliaNaza)
Pls,try to change inc eax with add eax,1 in your algo too...


"...and according to my timings, it does"
Pls, attach your test files (like me)    :lol

"By the way, you did not test my 7-bit version..."
Because I CAN'T...
It is the most important point in "my algo"
and it is the main reason for me to create the timelen1.asm file
It is the Codewarp's point of view too and I created timelen1.asm file
just as an answer to him rather then to "offend" your algo:
"In other words use 7-bit search as far as you can take it,
then ride the rest of the way using 8-bit search as needed.
In most text, the 8-bit part will never be needed, but when required,
it covers 8-bit as well--the best of both worlds..." by Codewarp


"My algo" starts the job with 7-bit code search part and if it failed
"automatically" switch to 8-bit code search part and tiil to end...

"Your algo" works with 7-bit code search OR with 8-bit code
search but can't switch "automatically" if we have "mixed" string with
7 AND and 8-bit simbols in it


...and you transcribed the MOV ECX 80808080H
in my 8-bit version to the wrong spot in timelen1 "



A. your 1st variant

STRLEN:                     ;it all begins here
  MOV EAX,[ESP+DWORD]       ;address of string

  .WHILE TRUE               ;check DWORD alignment

   TEST AL,DWORD-1          ;is DWORD aligned
   .BREAK .IF ZERO?         ;yes, DWORD aligned

   TEST BYTE PTR [EAX],0FFH ;not aligned, check  for zero byte
   JZ @F                    ;jmp if end of string

   INC EAX                  ;prepare to check next byte
  .ENDW                     ;around the horn

  MOV ECX,080808080H        ;sieve mask

  .REPEAT
.......

B. your 2nd variant

STRLEN:                     ;it all begins here
  MOV EAX,[ESP+DWORD]       ;address of string
  MOV ECX,80808080h
......


Ok, the technical error  is mine but it is not so
important here because it isn't in the main loop...
It is an obvious example how the macros "hide"
the pure code..


"Also you used an old version of my STRLEN in timelen."
I downloaded timelen and added
Biterider, Hutch, LingoMMX  and  LingoSSE2 algos ONLY!
I didn't touch your algo (with macros) there


"I guess my contribution to that idea is a 6 line implementation..."
I agree with you that it is your contribution and I used it
as an idea in my new timelen1.asm file...  :U

I preferred Codewarp's point of view about 5&7 line implementation
rather then 6 line implementation and just tried to proof it...

Right now I prefer new couple 5&6 line implementation (see my new timelen1.asm)
rather then 5&7 line implementation   :lol
   

In conclusion please:
- feel free to edit the test files and algos in your way
  and post them for us...(like me)  :lol
- use the pure code rather then macros (if you can)
- answer the question who will uses my or your GPR algos with
  5&7-5&6 or 6 line implementations if we have similar faster MMX/SSE/SSE2 algos  :lol
- try to optimize my new lingo32 algo  (if you can) (see my new timelen1.asm)   :lol

Here are new results:

Test routines for correctness:
0 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
1 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
2 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
3 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
RatchN      17   20   20   22   20   22   28   45   61   84  149  186  276 1546
Lingo32     12   15   16   20   17   21   21   31   43   59  128  148  226 1197

1 byte misalignment
RatchN      21   16   23   36   39   40   66   61   86  109  157  216  301 1650
Lingo32      9    15   18   23   24   28   27   39   49   67  128  179  234 1204

2 byte misalignment
RatchN      10   15   28   29   41   33   46   55   86   96  168  203  284 1676
Lingo32     12   15   20   22   26   29   32   32   51   55  125  174  235 1187

3 byte misalignment
RatchN       9   21   32   25   23   32   36   47   79   87  149  196  291 1550
Lingo32     10   17   18   19   18   28   26   31  43   54   125  157  225 1215

Press enter to exit...



Regards,
Lingo

[attachment deleted by admin]

Ratch

Lingo,

Quote
If we talk about "doubts" just for example
what about your "new" strlen algo and the similar very old
algo of the Paul Hsieh here:
http://www.azillionmonkeys.com/qed/asmexample.html

     The code you present from Paul Hsieh's site is about sprites (whatever they are).  It is a spaghetti code of two or more internal loops that have no resemblance to my search loop.

Quote
Ratch7 is the buliaNaza's algo:
and some time ago I improved it:
http://board.win32asmcommunity.net/index.php?topic=8330.msg77056#msg77056

So "my 7-bit algo" is improved "buliaNaza's algo" and it is faster then Ratch7(buliaNaza)
Pls,try to change inc eax with add eax,1 in your algo too...

     I believe your code referred to by the link was written as a 8-bit algo, because it returns to the loop if a zero is not found.  And it uses a LEA instruction instead of a SUB like my 7-bit algo does, so it is not quite the same.  They are both 5 line core loops, so they should show equal times.  By the way, any byte over 081H will kick your code out of the core loop and slow it up greatly. 

Quote
"By the way, you did not test my 7-bit version..."
Because I CAN'T...
It is the most important point in "my algo"
and it is the main reason for me to create the timelen1.asm file
It is the Codewarp's point of view too and I created timelen1.asm file
just as an answer to him rather then to "offend" your algo:

     If you could test my 8-bit version, I don't see why you cannot test my 7-bit version also.  But no matter. 

Quote
"In other words use 7-bit search as far as you can take it,
then ride the rest of the way using 8-bit search as needed.
In most text, the 8-bit part will never be needed, but when required,
it covers 8-bit as well--the best of both worlds..." by Codewarp

"My algo" starts the job with 7-bit code search part and if it failed
"automatically" switch to 8-bit code search part and tiil to end...
"In other words use 7-bit search as far as you can take it,
then ride the rest of the way using 8-bit search as needed.
In most text, the 8-bit part will never be needed, but when required,
it covers 8-bit as well--the best of both worlds..." by Codewarp

"Your algo" works with 7-bit code search OR with 8-bit code
search but can't switch "automatically" if we have "mixed" string with
7 AND and 8-bit simbols in it

     If you are referring to your code in timelen1, that appears to be a 8-bit search code period.  My 7-bit code expects all characters to be 7-bit ASCII.  If they are not, an error in counting will occur.

Quote
"Your algo" works with 7-bit code search OR with 8-bit code
search but can't switch "automatically" if we have "mixed" string with
7 AND and 8-bit simbols in it

     That is absolutely wrong.  Eight bit code by definition can be from value 0 to 0FFH.  My program only goes out of the loop and returns when the value is 080H, which should be rare.  It can evaluate all the 8-bit values in any order.

Quote
Ok, the technical error  is mine but it is not so
important here because it isn't in the main loop...
It is an obvious example how the macros "hide"
the pure code..

     It becomes important only if there are a lot of 080H bytes in the string.

Quote
In conclusion please:
- feel free to edit the test files and algos in your way
  and post them for us...(like me) 
- use the pure code rather then macros (if you can)
- answer the question who will uses my or your GPR algos with
  5&7-5&6 or 6 line implementations if we have similar faster MMX/SSE/SSE2 algos 
- try to optimize my new lingo32 algo  (if you can) (see my new timelen1.asm)   

     Below are the results of the most recent timelin1.ece run on my machine.  I don't know why your algo shows a little better time than mine, because they both use a 6 line loop.  The difference is not as great as your machine, however.  As I said previously, there is something about timing that is mysterious.  You refer to my using macros.  If you mean .REPEAT, .WHILE, etc., those are not macros.  They are built in directives of MASM, but if they confuse you, I will translate them to jumps instead.  The GPR implementations of STRLENs are only good for old CPUs that do not have MMX/SSE instructions, or if MMX is not available for some reason.  Correct me if I am wrong, but doesn't MMX use the same registers as the FPU?  Ratch


Test routines for correctness:
0 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
1 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
2 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999
3 byte misalignment
RatchN       0    1    2    3    5    8   13   22   39   55   89  144  239  999
Lingo32      0    1    2    3    5    8   13   22   39   55   89  144  239  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
RatchN      10   11   13   18   15   16   18   29   61   81   95  136  221 1000
Lingo32      7    7   10   13   12   12   15   22   48   56   77  118  175  883

1 byte misalignment
RatchN       5   10   15   17   23   24   29   39   60   72  106  145  216 1025
Lingo32      9   12   17   17   19   19   23   30   57   68   86  124  187  896

2 byte misalignment
RatchN       5   10   14   17   24   23   30   31   60   74  105  147  218 1021
Lingo32      9   13   13   13   19   18   22   23   54   66   93  123  187  919

3 byte misalignment
RatchN       5   12   14   17   15   24   21   29   62   74   96  148  220  999
Lingo32      8   11   12   13   13   16   19   23   53   65   86  121  185  900

Press enter to exit...


hutch--

Guys,

Let me ask you this question, I know Lingo is developing on a very late PIV, what box are you using Ratch ? I ask the question because you bot appear not to agree on the benchmarking results and it just may be AMD/Intel differences.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

Ratch,

To be honest I can't understand what is your problem
and what you expect from me...

Hutch,
" it just may be AMD/Intel differences.

f0dder was so kind to test the same files
on his box with  Socket 939 AMD64 3500+,
TWINX1024-3200C2 DDR-400 (2x512MB),
MSI K8N-NEO4-Platinum nForce4 chipset.

http://board.win32asmcommunity.net/index.php?topic=21565.0


Regards,
Lingo

Ratch

Lingo,

Quote
Ratch,

To be honest I can't understand what is your problem
and what you expect from me...

     Why do you preceive there to be a problem between us?  You first made a comment about my reply to Codewarp, which I answered.  Then you answered back and so on.  I do not expect anything from you.

Hutch,

Quote
Guys,

Let me ask you this question, I know Lingo is developing on a very late PIV, what box are you using Ratch ? I ask the question because you bot appear not to agree on the benchmarking results and it just may be AMD/Intel differences.

     Both our algos use a 6 line core loop.  I was wondering why there was so much discrepancy between the results when run on his and my machines, especially on long strings.  This is true even when I run Lingo's timelen1.exe test on my machine.  I fear there is something about timing that none of us know about, or worse, nothing we can do even if we did know.  I use a 1 Ghz AMD Athlon which is a few years old now.  But no matter how fast or slow the machine, the same length of loop should be somewhat the same timing.  Any comments or speculation would be appreciated.  Ratch

Jimg

Ratch, Lingo-

I love this stuff myself, but it has become painfully obvious that athlons and pentiums do not work the same.  You can optimize for one or the other.  I have an athlon myself, and I can optimize all day to save a few cycles based on some carefully selected instructions or strategically placed nop's, and a pentium user would find that the new code runs slower than the old.  It's really no use to argue with each other if we're not using the same cpu, just present your best and let the rest of us pick whichever we want.

That being said, I do think it's really important to have general purpose routines.  There are very few cpus left which don't have mmx capability, but the number that can do sse2 is still a small percentage of the total.  SSE2 is fun and very appealing, but I certainly wouldn't use it without checking if the cpu was capable first.  And checking if the cpu is capable takes longer than any savings in the code itself on these little gp routines.

Also, I agree, I really think our timing assumptions are seriously flawed.  Unfortunately, it's all we have.  There needs to be some way to find the real world performance of a routine as it is normally used.  We seldom call any of these routines over and over a million times in a tight loop.  The real timing test is how long does your code take when it is called just once and is not in the cache.  Does it take longer to execute a routine the first time because it is long and complicated than the time it saves by being tricky?  How do we test such a thing?  I've been playing with calling each routine once in sequence, then repeating the sequence and averaging the results, but I can't get any consistency at all.  Anyone have any ideas?

But all together, most of us are enjoying the competition, just keep it civil and keep those great ideas coming!

Mark Jones

Ratch, if I may butt in with my two cents, I've heard it mentioned on numerous occasions that trying to get a quantitive, definite value on CPU timings is a sure-fire way to drive yourself insane. This is due to the fact that virtually all processor architectures are different - some use more "pipelines" for concurrently executing or staging (or optimizing) instructions, others have a bigger/smaller L2 cache, some physically handle ALU/MMX/SSE differently (or not at all...) yadda, yadda, yadda. It's like measuring harmonic cabin vibrations of various models of Chrysler vehicles... :bg

The easiest solution is as Michael had said, simply time the proc on as many hardware(s) as you can, and just accept the results. :)

Of course if you're looking to optimize the code for a specific processor, check out Agner Fog's and Michael's optimization guides. Maybe if you compared Intel datasheets with AMD datasheets you might find a minimum and maximum execution time for a specific instruction, but even so, cache and optimization(s) are going to skew the actual timings. Keep It Simple. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Ratch

Jimg

Quote
It's really no use to argue with each other if we're not using the same cpu, just present your best and let the rest of us pick whichever we want.....

Also, I agree, I really think our timing assumptions are seriously flawed.  Unfortunately, it's all we have....


    Thanks Jimg, your observations are the better than both Lingo's and mine put together.  Ratch