News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

roticv

I think it is wrong to call donkey's routine MMX though it uses MMX register as pmovmskb is an SSE opcode and not part of the original MMX instruction set. Therefore older processors without SSE would be unable to use the routine.

Phil

#31
roticv: Thanks for pointing that out that it does require SSE. I hadn't realized that until I started reading about the instructions to figure out how it worked and then I failed to mention it in the commentary. It would also have be nice if I had posted the info link since it clearly identified it as an SSE instruction.

Codewarp

Now that the contest for the main loop seems to be decided in favor of
the aligned dword search implementation, how about a little different
spin on the code around it--namely aligning the search pointer, and
locating the zero in the last dword.

The idea (as always) is to do things in the biggest chunks possible
and eliminate 8-bit operations completely from the code.  The cpu does
like it when you switch between 8/32 bit modes in her registers, and
will take it out on your clock counts without telling you.

So, to align the search, we back up :eek to the dword we are starting in,
load the dword, stuff 1's into the leading bytes to skip over, then drop
into the middle of the normal dword search and continue.  It is faster than
the byte search method on three bytes, it is likely slower than a 1-byte
search.

To locate the zero in the last dword, we handle the upper and lower halves
of the dword separately, computing the final lengths with straight 32-bit
operations directly.  This new method seems to have lower clocks than any
other method I have seen.

Incidentally, I should point out that strlen( ) is a special case of memchr( ),
that searches memory for any character, not just zero.  This work on strlen( )
can easily be extended for a fast memchr( ) implementation.

Please forgive the C++ decoration, I use this routine as a direct
replacement for strlen( ) in my C++ work--enjoy...


// search to find length of a null-terminated string
// fast performance: strings of 1/10/100/1000 bytes require 7/13/103/776 cycles (Athlon64)
int __declspec( naked ) szLength (const void* src) {
_asm {
mov  edx, [esp + 4]     ; point edx to start of string
test  edx, 3
jz   lenscan            ; branch ahead if already aligned
mov  ecx, edx
and  edx, ~3            ; edx points to aligned addr
and  ecx, 3             ; ecx = bytes to skip
shl  ecx, 3             ; ecx = bits to skip
mov  eax, 1
shl  eax, cl            ; put cl 1's into eax from the bottom up
sub  eax, 1
or   eax, [edx]         ; combine with first four bytes
jmp  resume             ; catch up with the aligned search...

align 4
lenscan: mov  eax, [edx]        ; load next four bytes into eax
resume: add  edx, 4             ; advance ptr
lea  ecx, [eax - 1010101h]
not  eax                ; for each byte in ecx: (charval-1) & ~charval & 80h
and  eax, ecx
and  eax, 80808080h
jz   lenscan            ; repeat while no zeros found

sub  edx, [esp + 4]     ; subtract the base address
test  eax, 8080h        ; test first two bytes
jz   upper2             ; jmp if not found in the lower 2 bytes
shr  eax, 8             ; set carry from bit7 of 1st byte
sbb  edx, 3             ; edx = (edx-4) + (1-carry)
mov  eax, edx           ; return as the result
ret

upper2:  shl  eax, 9            ; set carry from bit7 of 3rd byte
sbb  edx, 1         ; edx = (edx-2) + (1-carry)
mov  eax, edx           ; return as the result
ret
}
}

]

Phil

Codewarp: I haven't taken the time yet to have a close look at your algorithm but I stripped the headings and added it to the test set. I also removed the other tests that are not listed here. The FStrLen figures may not be a fair comparison because it only does 7-bit ASCII ... Ratch, and lszLenSSE both do 8-bit extended ASCII. Here are the results for various string sizes.
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    16   16   16   16   17   19   20   22   28   48   63   86  116
FStrLen       7    7   10    9   10   13   16   37   47   59   86  128  194
Ratch        18   25   32   39   29   25   35   51   66   92  105  144  227
szLength     19   20   19   19   23   25   30   51   60   81  121  176  264
szLen         8   10   12   14   19   22   30   56   77  115  175  270  429


Tune it up a bit if you like and see if you can top Ratch! I'll have a closer look at your algorithm later. I used it to print the lengths that are displayed in the header so I know it works, at least with 7-bit ASCII.


[attachment deleted by admin]

Jimg

I think Codewarp's idea was to make a faster routine for unaligned strings.  I inserted 1-3 bytes before each test string using for example    %FOR len,<SIZES>
         align 16
         db 0
         str&len& db len dup ('X'),0
    ENDM
to see the effect and got:

Aligned
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    25   25   25   25   25   28   28   32   38   47   82  117  169
FStrLen       6    8   10    9   11   12   16   21   32   61   89  132  199
Ratch         8   11   12   15   14   14   20   28   39   78  101  142  221
szLength     10   11   10   11   13   17   20   28   38   77  112  168  257
szLen         8    9   13   12   18   22   28   38   54   92  139  207  321

Misaligned by 1 byte
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    28   28   28   28   28   30   30   37   41   55   90  126  178
FStrLen       6    8   10    9   11   12   17   24   35   72  102  150  230
Ratch         8   11   12   15   18   15   23   30   40   85  109  154  241
szLength     14   13   13   16   17   20   22   32   44   85  116  174  262
szLen         8    9   13   13   18   22   28   39   57   93  139  207  323

Misaligned by 2 bytes
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    28   28   28   28   28   30   30   37   41   55   89  125  177
FStrLen       6    8   10    9   11   12   17   24   35   71  102  149  229
Ratch         8   11   12   15   18   16   23   30   40   85  109  155  240
szLength     14   13   16   16   17   20   22   32   64   84  116  172  262
szLen         8    8   13   13   18   22   28   39   54   93  139  208  328

Misaligned by 3 bytes
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    28   28   28   28   28   30   30   35   45   54   91  123  174
FStrLen       6    7   10    9   11   12   17   25   35   71  102  149  229
Ratch         8   11   12   15   18   15   23   30   41   85  109  154  240
szLength     13   16   16   17   21   21   29   38   64   83  121  172  266
szLen         8    9   13   13   18   22   28   38   54   93  139  207  321

So his code is less affected by misalignment than the others.

Codewarp

Jimg:  Nice work! I am unclear as to nature of this Ratch algorithm.  But it doesn't
matter anyway, since further work on szLength( ) makes it the fastest :dance:.  You were
right--my algorithm is intended to enhance performance of both misaligned strings
and of short strings.  The new version below ratchets up :wink the performance for long
strings with a little tuck-and-unroll.  I have also exchanged the roles of
eax and edx, to avoid one extra mov instruction at the end of every call.


; search to find length of a null-terminated string
; fast performance: strings of 1/10/100/1000 bytes require 7/13/101/653 cycles (Athlon64)
__declspec( naked ) int szLength (const void* src) {
    _asm {
            mov     eax, [esp + 4]          ; point eax to start of string
            test    eax, 3
            jnz     fixalign                ; jmp to fix misalignment
            align   4
lenscan:    mov    edx, [eax]               ; load next four bytes
resume:     lea     ecx, [edx - 1010101h]
            not     edx                     ; on each byte in ecx: (byte-1) & ~byte & 80h
            and     ecx, edx
            and     ecx, 80808080h
            jnz     found                   ; branch if found

            mov     edx, [eax + 4]          ; load next four bytes
            add     eax, 8                  ; advance ptr (twice)
            lea     ecx, [edx - 1010101h]
            not     edx                     ; on each byte in ecx: (byte-1) & ~byte & 80h
            and     ecx, edx
            and     ecx, 80808080h
            jz      lenscan                 ; repeat while no zeros found
            sub     eax, 4                  ; back off to last dword

found:      sub     eax, [esp + 4]          ; subtract the base address
            test     ecx, 8080h             ; test first two bytes
            jz      upper2                  ; jmp if not found in the lower 2 bytes
            shr     ecx, 8                  ; set carry from bit7 of 1st byte
            sbb     eax, -1                 ; return eax = eax + (1-carry)
            ret
upper2:     shl     ecx, 9                  ; set carry from bit7 of 3rd byte
            sbb     eax, -3                 ; return eax = (eax+2) + (1-carry)
            ret

fixalign:   mov     ecx, eax
            and     eax, ~3                 ; eax points to aligned addr
            and     ecx, 3                  ; ecx = bytes to skip
            shl     ecx, 3                  ; ecx = bits to skip
            mov     edx, 1
            shl     edx, cl                 ; put cl 1's into edx from the bottom up
            sub     edx, 1
            or      edx, [eax]              ; combine with first four bytes
            jmp     resume                  ; start up in the aligned search
    }
}


Jimg

Codewarp-

On first test, it seems much faster.  Unfortunately, it's crashing on misaligned strings.  I haven't had a chance to figure out why yet.

Later...

ok, it the last fixalign instruction-
            or      edx, [eax]              ; combine with first four bytes

Phil

Jimg: I had expected that there would have been trouble in your earlier alignment tests with the lszLenSSE routines. I had thought that it required 16-byte alignment for the strings. Anyway, I saw that it ran okay for you and I haven't looked into it further. I just wanted to say something in case the crashes you are encountering are caused by it and not szLength.

Codewarp

Quote from: Jimg on June 21, 2005, 12:59:37 AM
Codewarp-

On first test, it seems much faster.  Unfortunately, it's crashing on misaligned strings.  I haven't had a chance to figure out why yet.

Later...

ok, it the last fixalign instruction-
            or      edx, [eax]              ; combine with first four bytes


Jimg -

I have been unable to reproduce any crash, nor can I find any erroneous result.  That is not to say there isn't one,
but I can assure you that the or edx, eax is the essence of the fixalign, and not an oversight.  What it does, is
to fill the beginning ragged edge with 1's just in case those bytes contained zeros.  Remember, on misalignments, I
back-up the pointer to the starting dword, then I force 1-2-3 bytes to 1's, depending on the misalignment.

I have single-stepped through this process and found that it does exactly what it should (at the register level).
I am calling it from within larger software that passes a diversity of string lengths and alignments to it, and it all executes
without any apparent difficulty.

I am eager to repair this algorithm if it is indeed in error, but I need some evidence to take your claim seriously. :naughty:
Please be sure that you have included all my lines.  Note that the VC++ compiler places all routines on 16-byte
boundaries--your tests should do the same.  I see that same unexpectedly large swings in clock counts with only
minor changes in code and entry point alignments, reported in many postings.

Phil

Codewarp: Here's an updated zip including your recent modifications. The results weren't all that different from your previous version but in these tests all strings and procedures are aligned on 16-byte boundries.
Proc/Bytes    0    1    2    3    5    8   13   21   34   55   89  144  233
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenSSE    16   16   16   16   16   19   20   22   28   48   63   86  116
FStrLen       7    7   10    9   10   13   16   35   47   59   86  128  194
Ratch        18   25   32   39   29   25   35   52   66   92  105  144  227
szLength     19   20   19   19   23   25   30   51   60   81  121  176  265
szLength2    19   19   19   19   23   24   30   37   47   81  116  173  261
szLen         8   10   12   14   19   22   30   56   77  115  175  270  428


Jimg: Could you zip up and post your recent test source and results so we could use the same methods to test various alignments? I know you said that you just added 1-3 nops before each string in your runs and that sounds easy enough but the actual source would be helpful in case something unusual is happening.


[attachment deleted by admin]

Codewarp

Phil--

I am getting different clock counts than you are, which is likely due to different CPUs (mine=athon64, yours=?) and
different benchmarking technique.  What I am seeing with the different strlen( ) methods is this:

  (1)  The DWORD-101010h & ~DWORD method is the basis for all the highest performing strlen( ) methods.  I
        use it in all of my implementations, as does Rachet (i.e. identical code).

  (2)  This method takes 15-25% longer on long misaligned strings, without realignment.  This loss occurs in all
        implementations of it, but does not seem to show up well in your tests.

  (3)  By unrolling the loop for 8 bytes per iteration, the method gains noticable speed, using 15-25% less time.  This
       does not show up at all on your tests, but I was testing 1000 byte searches, yours were 233.

  (4)  Handling misalignment costs 2-3 cycles up front, on every call, misaligned or not.  On misaligned calls it
        costs an additional 4-5 clocks.  This 6-8 cycle overhead is paid by the first 3 or 4 dwords.  The overhead
        of byte-at-a-time realignment is excessive after the first byte because of the jmps involved.  Memory
        misalignment costs vary between processors, but typical "abc" quoted strings are not aligned.

  (5)  Cost of locating the byte in the last dword shows up in short strings of a few bytes.  Short strings are far more
        commonly passed to strlen( ) than long strings.  The fewest number of jmps to do this runs the fastest.

My benchmark practice is as follows:

  (1)  t1 = clocks  for 1000 empty, baseline iterations
  (2)  t2 = clocks for 1000 iterations of a target routine
  (3)  t3 = overhead reading the clock
  (4)  report (t2 - t1 - t3)/1000 as the time for the target routine.

I run this whole thing at least three times, and note the result stability, rerunning as needed to get a stable
lowest clock count benchmark.  This all assures that I am not simply measuring the efficiency of my memory
system (or of interrupt processing), but getting as close as possible to the algorithmic cost itself.  I have
validated this approach by benchmarking routines of known cost, and observing just that.

I believe it is wise to be suspicious of all benchmarks, questioning their basis and understanding their limitations.
A vigorous discussion of these topics will benefit everyone.






Phil

I certainly agree that discussion is key to understanding what's happening here. I plugged your unrolled code into the test as szLength2 and didn't see much of a difference as indicated by the result. I'm testing on a 996 MHz P3 and the trials I see are very consistent on this machine. I'm using MichaelW's timer macros that just repeat an invoke szLength the number of times specified by LOOP_COUNT which then returns the average number of CPU clocks for each iteration as the result. I've added complexity by defining list driven macros that, hopefully, make it easier to modify the tests and add procedures. As far as I can tell there is no compensation in MichaelW's timer macros to discount the result based on the overhead costs of reading the clock. The math to compute the averages is obviously done outside the loop after the final clock is read.

Are you able to assemble and run these tests directly on your machine? The only part of the source that I did not include are Michael's timer macros that I'll zip up and include here for your convenience. They are also posted in several other places and have been used and tested extensively. We do, however, see unusual results from time to time but I don't think it's associated with his macros.

I'd be glad to run your benchmark on my machine if you care to post the source and executable. I can see that unrolling the loop should indeed make a big difference in the timing as you have said.


[attachment deleted by admin]

MichaelW

Phil,

My macros compensate for the loop and timing overhead by timing an empty loop that is otherwise identical to the test loop, and subtracting the cycle count for the empty loop from the cycle count for the test loop.

eschew obfuscation

hutch--

hmmmm,

Quote
My benchmark practice is as follows:

  (1)  t1 = clocks  for 1000 empty, baseline iterations
  (2)  t2 = clocks for 1000 iterations of a target routine
  (3)  t3 = overhead reading the clock
  (4)  report (t2 - t1 - t3)/1000 as the time for the target routine.

I run this whole thing at least three times, and note the result stability, rerunning as needed to get a stable
lowest clock count benchmark.  This all assures that I am not simply measuring the efficiency of my memory
system (or of interrupt processing), but getting as close as possible to the algorithmic cost itself.  I have
validated this approach by benchmarking routines of known cost, and observing just that.

This is in fact an interesting notion but I am wary of what is left as it will still depend on the opcode implimentation from processor to processor which differ substantially over time and between different manufacturers. Usually the reference to a known code is more useful but this also has its limitations in that an algo that is fast one one machine can be slow on another if its written to use a specific characteristic of one form of hardware.

Memory speed is of course a factor but on the same box testing two different routines, one known and the other developmental there is no advantage or disadvantage to either.  What I am inclined to trust is algo comparison on a range of different boxes with different processors to see which works better on what box which is the basics of writing mixed model code that is general purpose.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Codewarp-
Sorry, I misinterpreted one of your instructions-
     and     eax, ~3
I haven't seen the tilde uesd before, and thought it was just a spurious character and deleted it :red.  Phil fixed it properly-
     and     eax, 0FFFFFFFCh ; ~3    ; eax points to aligned addr
Works perfectly so far and is the fastest (at least on my machine).


Phil-

   The code I used is identical to yours, I just ran it four separate times, inserting one more byte in the string definition macro each time as I said, I didn't automate it at all-
    %FOR len,<SIZES>
         align 16
         db 0  ;,0,0
         str&len& db len dup ('X'),0
    ENDM