News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

Codewarp

#120
Lingo,

Very cool solution to the strlen( ) implementation. :clap: Your method illustrates a different strategy for search alignment than any of the other algorithms seen in this thread.  It avoids alignment complexity by scanning the first 16 bytes without regard for alignment, and only aligning for subsequent reads if necessary.   I do have a few comments about it:

(1)  The Prefetches appear to be unnecessary and removing them reduces clock count by a few points
       (on Athlon64).  Prefetches are good for multiple data streams, and/or enhancing random access.  However,
       this is a single sequential stream, and normal cpu prefetch for that is as good as it gets.

(2)  Unless I am mistaken, this code requires SSE2, because of the movdqa instruction, making it less
       widely applicable.

(3)  Strings < 16 bytes can be sped up a few cycles, with simplification and avoiding jmps.

(4)  An additional 12% cycle reduction resulted when I doubled-up the loop to process 32-bytes at a time,
       but I left it out because of its potential 31-byte overreach, past the end of the string, and off the end of the last page.

(5)  Your method and mine below can both access memory past the end of the string, and off the end of the last page,
       by as much as 15 bytes (causing a possible page-fault).  But it is a way too cool :8) algorithm to leave on the shelf...

My adaptation of your method is as follows:


            mov             eax, [esp + 4]  ; eax = base address to start search
            movdqu          xmm1, [eax]     ; load first 16 bytes, aligned or not
            pxor            xmm0, xmm0      ; xmm0 = 0's
            and             eax, -16        ; align eax down to base paragraph
            pcmpeqb         xmm1, xmm0      ; check 16 bytes for zeros
            pmovmskb        edx, xmm1           
            test            edx, edx        ; test edx for zero
            jz              again           ; branch ahead if no zero bytes found
            bsf             eax, edx        ; return the byte position as the length
            ret

            align           8
again:      movdqa          xmm1, [eax + 16] 
            add             eax, 16         ; eax = address of 16-byte compare
            pcmpeqb         xmm1, xmm0      ; search the 16-bytes
            pmovmskb        edx, xmm1     
            test            edx, edx
            jz              again
            sub             eax, [esp + 4]    ; subtract original base address
            bsf             edx, edx          ; get position of 1st zero byte
            add             eax, edx          ; add to base address
            ret


lingo

#121
Codewarp, :bg

"2)  Unless I am mistaken, this code requires SSE2, because of the movdqa instruction, making it less
       widely applicable."


I have a new version with movups and movaps indeed movdqu and movdqa
Pls, reload the zip file and test it again
    
"(3)  Strings < 16 bytes can be sped up a few cycles, with simplification and avoiding jmps."
I agree

"5)  Your method and mine below can both access memory past the end of the string, and off the end of the last page,
   by as much as 15 bytes (causing a possible page-fault).  But it is a way too cool  algorithm to leave on the shelf..."


I disagree
What is  "the end of the string"?
We search the end of the string, hence the phrase "the end of the string" is undefined ?
Before the usage of the StrLen we need a buffer with ENOUGHT memory
for our string, hence  if someone use my algo he needs to allocate  ENOUGHT+ 32 bytes of memory

Example: The constant  _MAX_PATH ->(Maximum length of full path) is 260 bytes long
               If we search length of the string  of the current path with file name
              we need to allocate 260+32 bytes for buffer
   or ENOUGHT+32 memory will be 292 bytes

"; align eax down to base paragraph"
Thanks for comments  :bg

Regards,
Lingo



Codewarp

Lingo,

The bare SSE support is much appreciated. :thumbu

Sadly, the overreach is real.  Suppose that you have a valid 2 byte string (3rd byte is zero), and these are the last three bytes at end of a 4k page, the last page in the block.  This algorithm will initially read 16 bytes from the misaligned address, trespassing 13 bytes into the non-existant next page.  The best way to handle this problem is to use only 16/aligned reads, then mask out the initial unwanted comparison bytes.  But that is hard to do as fast as the method you have here.

Ratch

Codewarp,
     Why not just put a ALIGN 16 after the buffer?  Then you would be guaranteed to be within a 16 byte boundary.  Ratch

Codewarp

Putting an Align 16 anywhere is irrelevant.  You are handed a pointer to 3 bytes, you read 16 bytes, you don't get to choose its alignment.  This will lead to overrun.  There are only three choices:  live with it, backup to read 16 bytes from the beginning, or use GPRs instead of SSE.

lingo

Codewarp, :bg

"The best way to handle this problem is to use only 16/aligned reads,
then mask out the initial unwanted comparison bytes.
But that is hard to do as fast as the method you have here.


For me it is not a big deal (Pls, reload the zip file and see Lingo2)
but it is slower because we have additional code  :(

Regards,
Lingo

Ratch

#126
Codewarp,

QuoteYou are handed a pointer to 3 bytes, you read 16 bytes, you don't get to choose its alignment.

     Now you setting conditions that I did not know about.

Quote
...backup to read 16 bytes from the beginning

     Do you mean back it up to a 16 byte boundary?  It becomes even more complicated if there is a '0' lurking within the backed up data area.  Ratch


Codewarp

#127
I am not setting any conditions--I am merely reporting a perfectly normal case where the algorithm in question fails to meet desired objectives, i.e. where it reads past the end of its data, and off the end of the 4k page.  By doing so, I have shown the assertion to be irrefutable. 

Ratch, are you really suggesting that strlen( ) work only for 16-byte aligned strings? :eek Strlen( ) is a routine defined by the c-runtime library--it has no preconditions about alignment.  Its users expect it to return a correct string length from any byte address, as long as its terminating zero byte follows within valid memory.  We, as programmers, are bound to implement that, even if we don't like it.  Now, if you want to have two versions, strlen( ) and strlen16( ), where everyone knows the limitations, that's fine.  Or you can switch internally to a different method, upon detecting misaligned strings.  However expecting strings to all be 16-byte aligned is unreasonable.

By backing up to the previous 16-byte boundary, comparing, then masking off the unwanted part of the comparison, you can completely avoid unaligned reads AND avoid the overreach.  But that is more work than lingo's algorithm, and consequently slower, but perhaps more correct.

Ratch

Codewarp,
Quote
Ratch, are you really suggesting that strlen( ) work only for 16-byte aligned strings?

     Nope, I am saying that backing up and reading might pose difficulities with respect to time and logic.  I think it's better to go forward to the next 16-byte boundary instead.  See below.  Ratch

http://www.masmforum.com/simple/index.php?topic=2442.0

Codewarp

Ratch,
Your prediction of "logic and time" consequences doesn't hold up.  Here is a fully correct version that cannot ever overreach a string and its 4k page.  It required exactly three additional GPR instructions, plus it reads only aligned blocks, which can more than pay for the 3 instructions, for a net zero cost.  This is a practical sse version of Lingo's method for any application, and one that outperforms all GPR implementations:


lensse:  mov            eax, [esp + 4]  ; eax = base address to start search
         mov            ecx, eax
         and            eax, not 15     ; pull down to aligned address
         and            ecx, 15         ; ecx = pre-read bytes to skip
         movaps         xmm1, [eax]     ; load first 16 bytes, aligned
         pxor           xmm0, xmm0      ; xmm0 = 0's
         pcmpeqb        xmm1, xmm0      ; check 16 bytes for zeros
         pmovmskb       edx, xmm1       ; edx holds a 1-bit for each zero byte (in the low 16 bits)
         shr            edx, cl         ; discard the pre-read
         test           edx, edx        ; test edx for zero
         jz             again           ; branch ahead if no zero bytes found
         bsf            eax, edx        ; return the bit position as the length
         ret

         align          8
again:   movaps         xmm1, [eax + 16] 
         add            eax, 16         ; eax = address of 16-byte compare
         pcmpeqb        xmm1, xmm0      ; search the 16-bytes
         pmovmskb       edx, xmm1     
         test           edx, edx
         jz             again
         bsf            edx, edx        ; get position of 1st zero byte
         sub            eax, [esp + 4]  ; subtract original base address
         add            eax, edx        ; add for base address
         ret


Biterider

#130
Hi Codewarp
I was working on a similar approach like yours, but you came first. Since I have some problems compiling the movdqa instruction, I reduced the the routine to xmm instructions and shuffled a little bit some instructions to obtain a better performance. As expected, the performance is similar to Donkeys routine, but the real advantage comes out with misaligned strings.

StrLength proc pString:dword
.xmm
    mov eax, [esp + 4]    ; eax = base address to start search
    mov ecx, eax
    and eax, not 7        ; pull down to aligned address
    and ecx, 7            ; ecx = pre-read bytes to skip
    movq mm1, [eax]       ; load first 8 bytes, aligned
    pxor mm0, mm0         ; mm0 = 0's
    pcmpeqb mm1, mm0      ; check 8 bytes for zeros
    pmovmskb edx, mm1     ; edx holds a 1-bit for each zero byte (in the low 8 bits)
    shr edx, cl           ; discard the pre-read
    test edx, edx         ; test edx for zero
    jz more               ; branch ahead if no zero bytes found
    emms
    bsf eax, edx          ; return the bit position as the length
    ret 4
more:
    add eax, 8
again:   
    movq mm1, [eax] 
    pcmpeqb mm1, mm0      ; search the 8-bytes
    add eax, 8            ; eax = address of 8-byte compare
    pmovmskb edx, mm1     
    test edx, edx
    jz again
    bsf edx, edx          ; get position of 1st zero byte
    emms
    sub eax, [esp + 4]    ; subtract original base address
    lea eax, [eax + edx - 8]   ; add for base address
    ret 4
StrLength endp


Regards,

Biterider

Codewarp

Thank you, Biterider, your mmx adaptation will now replace the mmx version in my codebase (with some changes ::)).  I was particularly pleased with how effortlessly I could dispose of the pre-read.  You are quite correct about the misalignment performance, important now since the greater majority of strings we call this on are not 16-byte aligned.

Ratch

Codewarp,

Quote
This is a practical sse version of Lingo's method for any application, and one that outperforms all GPR implementations:

     It does not perform anything on my machine (AMD Athlon running Windows ME).  Evidently my CPU chokes on 128-bit registers referenced by MOVAPS XMM1,[EAX].  Not all machines have the latest MMX hardware and OS support.  That's one advantage of writing GPR code; it works on the 386 and up.  Anyway, congratulations on finding a solution to the alignment problem.  Ratch

Brett Kuntz

Hello, I was bored so I disasm'd Borlands version and tweaked it a bit. Could someone bench this against some of the other fast ones posted up to see where it stands? I ran it through Olly with a whole lot of different strings and it always returned the proper length.


   strlen proto :dword
.code
;##########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

strlen proc xstr:dword

   mov eax, dword ptr [esp+4]
d1: mov edx, dword ptr [eax]
   mov ecx, edx
   add eax, 4
   sub edx, 16843009
   and edx, 2155905152
   jz d1
   not ecx
   and edx, ecx
   jz d1
   test dl, dl
   jnz d2
   test dh, dh
   jnz d3
   test edx, 16711680
   jnz d4
   jmp d5
d2: dec eax
d3: dec eax
d4: dec eax
d5: dec eax
   sub eax, dword ptr [esp+4]
   add esp, 8
   jmp dword ptr [esp-8]

strlen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
;##########################################################################

Jimg

That's the slowest one yet!


Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  239  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
szLength     8    8    9    9   12   15   17   24   48   63   89  131  202  779
Ratch        9   11   12   15   14   14   20   31   64   78  100  143  231  855
Jens_fast   20   20   19   20   21   26   29   36   58   69   99  145  219  923
roticvSSE    4   28   29   28   28   32   32   35   39   53   79  112  160  576
lszLenSSE   25   25   25   25   25   28   28   32   38   47   84  116  166  591
Jens_mmx2    6   31   31   32   37   43   48   55   78   45   93   60  124  288
BiteRider   27   26   26   26   26   28   29   32   37   46   82  112  161  569
borland      9   10   11    9   14   18   22   29   58   74  109  162  249 1121

1 byte misalignment
szLength    13   13   14   16   15   19   20   26   56   67   91  134  206  781
Ratch        8   11   12   15   18   16   23   32   69   86  110  156  254  959
Jens_fast   20   20   21   20   24   28   32   41   64   75  105  154  243 1001
roticvSSE    3    7   10   12   18   53   48   51   60   73   94  125  176  589
lszLenSSE   32   28   28   28   31   30   30   37   45   55   92  126  178  623
Jens_mmx2    6   10   12   16   26   73   59   65   86   61   97  123  142  296
BiteRider   25   26   26   26   27   29   29   31   48   57   84  113  169  576
borland      8    9   11   10   15   20   25   36   71   90  135  205  323 1198

2 byte misalignment
szLength    14   14   16   16   15   20   20   31   41   51   92  136  208  787
Ratch        8   11   12   15   18   17   24   32   69   86  110  156  254  950
Jens_fast   20   20   20   19   24   28   32   40   62   75  104  153  237  997
roticvSSE    4    7    8   12   18   45   46   51   59   67   93  125  172  590
lszLenSSE   28   28   28   30   28   30   31   35   41   54   94  126  178  621
Jens_mmx2    7    8   12   17   26   59   54   58   84   59   96  120  141  295
BiteRider   26   26   26   26   24   29   29   33   48   57   84  112  164  575
borland      9   10   11    9   15   17   25   36   72   90  135  206  325 1203

3 byte misalignment
szLength    15   16   16   15   19   20   25   30   41   51   94  137  209  784
Ratch        9   11   12   15   18   16   23   32   70   85  108  156  253  947
Jens_fast   20   20   20   20   24   28   32   40   65   76  104  153  237  996
roticvSSE    6    7   10   35   18   43   46   49   53   66   89  119  172  583
lszLenSSE   28   28   29   28   28   30   31   32   41   57   91  124  176  624
Jens_mmx2    7    9   12   16   26   54   62   70   86   63  115  125  144  290
BiteRider   26   26   26   26   29   29   31   33   47   56   82  111  164  577
borland      8   10   12    9   15   19   25   36   69   91  134  207  325 1197


These are only the one that give the correct answers, the one's with movedqa don't work on my machine.