szLen optimize...

Maelstrom · June 11, 2005, 03:51:12 PM

Yeah your right, what a mess. So much for copying and pasting from older threads...
I went through the thread from the beginning and debugged it though. This one is corrected.

Code Select


FStrLen proc string:dword
        mov     esi, [esp+8]
        mov     ebx, 80808080h      ; we'll use register ebx rather than immediate 80808080h
        mov     eax, [esi]
        xor     edx, edx            ; edx = 0
    @@:
        lea     ecx, [eax-1010101h] ; sub 1 from each byte in eax
        inc     edx                 ; ready for next dword
        and     ecx, ebx            ; test  sign ; ebx= 80808080h     
        mov     eax, [esi+edx*4]    ; get next dword
        jz      @B                  ; if not loop again

        test    ecx, 000000FFh      ; is al zero?
        jnz     C_minus4            ;
        test    ecx, 0000FF00h      ; is ah zero?
        jnz     C_minus3            ;
        test    ecx, 00FF0000h      ; is zero?
        jnz     C_minus2            ;
    C_minus1:
        lea     eax, [edx*4-1]
        ret
    C_minus2:
        lea     eax, [edx*4-2]
        ret
    C_minus3:
        lea     eax, [edx*4-3]
        ret
    C_minus4:
        lea     eax, [edx*4-4]
        ret
FStrLen endp

Phil · June 12, 2005, 06:15:42 AM

Finally got FStrLen working. Looks like it's faster on my P3 for everything larger than 3 byte strings with the modifications that I made. Originally, It was about the same as StrLen1 when it had to preserve ebx and esi to be a 'fair' comparison with the other procedures. I re-wrote it but maintained the spirit of doing 4 bytes at a time. Here's the modified procedure:

Code Select

FStrLen proc string:dword
        mov     ecx, [esp+4]
        xor     eax, eax            ; clear initial size 
    @@:
        mov     edx, [ecx+4*eax]
        lea     edx, [edx-1010101h] ; sub 1 from each byte in eax
        add     eax,1               ; ready for next dword
        and     edx, 80808080h      ; check sign bit in all byte
        jz      @B                  ; more to do until we have a sign
        test    edx, 000000FFh      ; is byte-1 zero?
        jnz     C_minus4            ;
        test    edx, 0000FF00h      ; is byte-2 zero?
        jnz     C_minus3            ;
        test    edx, 00FF0000h      ; is byte-3 zero?
        jnz     C_minus2            ;
    C_minus1:
        lea     eax,[4*eax-1]
        ret     1*4
    C_minus2:
        lea     eax,[4*eax-2]
        ret     1*4
    C_minus3:
        lea     eax,[4*eax-3]
        ret     1*4
    C_minus4:
        lea     eax,[4*eax-4]
        ret     1*4

FStrLen endp

I should point out that this procedure isn't always safe. If the final null byte of the string is at the end of an allocated memory segment then the last dword mov could cause a general protection fault if the string wasn't aligned on a 4-byte boundry. To make it safe, more code would be needed up front to do things a byte at a time until 4-byte alignment was assured. It would probably still be faster on strings larger than 6 or 7 bytes but it would slow it down a bit. Finally, the timings on a 996 MHz P3:

Code Select

Bytes    0     1     2     3     4     5     15    16    17    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
FStrLen  6     9     9     9     9     15    16    18    21    209   785
StrLen   15    15    16    15    18    18    25    28    28    278   1046
StrLen1  12    12    12    12    16    16    23    25    25    276   1042
StrLen2  11    10    13    13    15    14    25    27    26    286   1058
szLen    8     10    12    14    17    19    32    35    36    465   1808
szLen1   5     7     9     11    13    15    28    30    31    374   1449
szLen2   5     8     11    11    13    15    29    30    32    403   1560
szLen3   8     7     9     10    14    15    28    34    31    412   1595
szLen4   5     7     10    10    13    14    29    27    29    353   1349
lstrlen  44    44    46    47    49    62    92    95    98    812   3120

The crossover point where it becomes consistently faster might be higher than 3 bytes on some machines but I think it's a general win for average and large string sizes. Thanks for posting the algorithm and also the szlen program that I also modified and renamed fszlen to avoid confusion.

[attachment deleted by admin]

Phil · June 12, 2005, 10:55:31 AM

I changed the lea adjustments to sub's because they should be faster on many machines.

Code Select

FStrLn2 proc string:dword
        mov     ecx, [esp+4]
        xor     eax, eax            ; clear initial size 
    @@:
        mov     edx, [ecx+eax]
        lea     edx, [edx-1010101h] ; sub 1 from each byte in eax
        add     eax, 4              ; ready for next dword
        and     edx, 80808080h      ; check sign bit in all byte
        jz      @B                  ; more to do until we have a sign
        test    edx, 000000FFh      ; is byte-1 zero?
        jnz     C_minus4
        test    edx, 0000FF00h      ; is byte-2 zero?
        jnz     C_minus3
        test    edx, 00FF0000h      ; is byte-3 zero?
        jnz     C_minus2
    C_minus1:
        sub     eax,1
        ret     1*4
    C_minus2:
        sub     eax,2
        ret     1*4
    C_minus3:
        sub     eax,3
        ret     1*4
    C_minus4:
        sub     eax,4
        ret     1*4

FStrLn2 endp

And the timing, again on a 996 MHz P3

Code Select

Bytes    0     1     2     3     4     5     15    16    17    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
FStrLen  6     9     10    9     9     15    16    18    21    209   785
FStrLn2  7     8     9     9     18    11    15    19    20    209   785
StrLen   15    15    16    15    18    18    25    28    28    278   1047
StrLen1  11    11    12    12    15    15    21    24    24    231   856
StrLen2  12    12    12    13    14    14    24    26    26    269   986
szLen    8     10    12    14    17    19    32    35    37    464   1808
szLen1   5     7     9     11    13    15    28    28    31    371   1443
szLen2   6     7     9     12    13    15    30    30    31    408   1580
szLen3   5     8     9     10    13    15    28    33    30    413   1589
szLen4   6     7     9     10    13    14    27    30    31    403   1553

Notice that StrLen1 and StrLen2 changed dramatically from the previous version. I carefully removed the new function FStrLn2 from the test and commented the procedure definition so it wouldn't be loaded into memory and the StrLen functions went back to their previous slower timings. It's unusual and I repeated the test by carefully editing the code to put my updated function in and they when back to the more likely speeds that we see here. I don't understand how the position in memory can affect the timing this dramatically.

[attachment deleted by admin]

hutch-- · June 12, 2005, 11:07:02 AM

Phil,

In the short term, try aligning each procedure entry at align 16 and if that does not help, try another dirty trick,

Code Select


REPEAT 1024
  nop
ENDM

After each ALIGN 16 but before the procedure. The idea is to isolate each procedure.

Jimg · June 12, 2005, 03:30:02 PM

QuoteI should point out that this procedure isn't always safe. If the final null byte of the string is at the end of an allocated memory segment then the last dword mov could cause a general protection fault if the string wasn't aligned on a 4-byte boundry.

It should also be noted that these routines do not work on ascii character > 128

For example, change test string 5 to

str5 db 'X',130,'XXX',0 ; was 'XXXXX',0

run the test and look at the length printed in the headers.

Phil · June 12, 2005, 05:16:42 PM

JimG: Thanks for pointing out that many of these functions only work with 7-bit ASCII. I hadn't noticed that until you mentioned it.

Hutch: Thanks for the suggestions on issolating the functions. I'll try the align 16 and modifiy this post later with the results.

donkey · June 12, 2005, 05:30:24 PM

I wrote this some time ago, it is for exceeding long strings (>128bytes) using MMX. Note that it is assumed the the entry of the proc is on a paragraph boundary, something that is gauranteed if it is in a static lib using GoAsm. Also the string must start on a 16 byte boundary.

Code Select

lszLenMMX FRAME pString

	mov eax,[pString]
	nop
	nop ; fill in stack frame+mov to 8 bytes

	pxor mm0,mm0
	nop ; fill pxor to 4 bytes
	pxor mm1,mm1
	nop ; fill pxor to 4 bytes

	: ; this is aligned to 16 bytes
	movq mm0,[eax]
	pcmpeqb mm0,mm1
	add eax,8
	pmovmskb ecx,mm0
	or ecx,ecx
	jz <

	sub eax,[pString]

	bsf ecx,ecx
	sub eax,8
	add eax,ecx

	emms

   RET

ENDF

Ratch · June 13, 2005, 04:14:07 PM

To the Ineffable All,
Perhaps all those interested should peruse this link. Ratch
http://board.win32asmcommunity.net/index.php?topic=16299.msg128369;topicseen#msg128369

Jimg · June 13, 2005, 06:04:00 PM

Very nice Ratch, the fastest one that works on all ascii characters so far.

I couldn't leave good enough alone, of course, so I played a little with your code. First my appologies for messing with your code, I don't get along well with loops and repeats and such, so I coverted it to brute force code to play with.

I'm getting some strange results. My first try only cut a few cycles off ( proc RatchX), but then I inserted a nop in preparation for some other tests, and on my screwy athlon, it made the code run much faster on the large string (proc RatchX2). The nop misaligned the main loop, so it should have made it run much slower. Here are my results:

Code Select

Bytes    0     1     2     3     4     5     15    16    31    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2  11    11    11    11    13    14    22    25    36    283   927
Ratch    8     11    12    15    10    14    26    21    39    243   875
RatchX   8     10    13    13    13    14    22    22    35    234   864
RatchX2  8     11    14    14    14    14    20    21    35    217   800

Press enter to exit...

What's going on here.

If someone with an intel chip would try this out to see if there is a similar effect, I'd appreciate it.

[attachment deleted by admin]

Phil · June 13, 2005, 06:19:07 PM

Here are the results on a 996 MHz P3

Code Select

Bytes    0     1     2     3     4     5     15    16    31    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2  9     11    12    13    13    15    24    42    56    301   1047
Ratch    18    25    32    39    22    29    49    31    72    258   893
RatchX   18    24    31    33    21    29    41    31    63    253   884
RatchX2  19    23    30    33    20    27    41    33    68    247   881

And a complete pass for all procedures included in this thread.

Code Select

Proc/Bytes    0    1    2    3    4    5   15   16   17   31 1023
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
FStrLen       7    7   10    8   18   10   17   19   19   43  786
FStrLn2       7    8    9    8   13   11   15   19   20   43  786
Ratch        18   25   32   39   22   29   49   31   38   72  893
RatchX       18   24   31   33   21   29   41   31   38   63  883
RatchX2      19   23   30   33   20   27   41   33   37   68  882
StrLen       15   15   16   16   18   18   25   28   28   53 1047
StrLen1      12   12   12   12   16   16   23   25   25   53 1042
StrLen2       9   11   11   13   13   15   24   39   27   56 1047
szLen         8   10   12   14   17   19   32   49   48   73 1808
szLen1        5    7    9   11   13   15   28   30   32   65 1566
szLen2        6    7    9   11   13   15   29   43   45   67 1580
szLen3        6    6   10   10   12   14   27   51   72   65 1571
szLen4        5    7    8   10   14   14   27   30   46   63 1552
lstrlenA     48   41   44   45   48   62   91   94   97  139 3117

Also, I defined a fld$ macro that produces a fixed length field and allows right alignment. The code is list driven so it is easier to modify. The complete source is included in the zip but here are the defining elements that show the idea:

Code Select

    include     \masm32\include\masm32rt.inc    ; defaults to .386

.586

    include     timers.asm      ; requires a pentium

    LOOPCOUNT equ 1000000

    PROCS TEXTEQU <FStrLen,FStrLn2, \
                   Ratch,RatchX,RatchX2, \
                   StrLen,StrLen1,StrLen2, \
                   szLen,szLen1,szLen2,szLen3,szLen4, \
                   lstrlen >

    SIZES TEXTEQU <0,1,2,3,4,5,15,16,17,31,1023>

    %FOR proc,<PROCS>
         proc proto :DWORD
    ENDM

    MAXWIDTH  equ 20
    HDRWIDTH  equ 10
    COLWIDTH  equ 5

.data

    %FOR len,<SIZES>
         align 16
         str&len& db len dup ('X'),0
    ENDM

fld$ MACRO DDpointer,DDwidth,DDalign:=<0> 
    LOCAL rvstring
    .data
        rvstring db MAXWIDTH+4 dup (0)
        align 16
    .code
    invoke sxFld,reparg(DDpointer),ADDR rvstring,DDwidth,DDalign
    EXITM <ADDR rvstring>
ENDM

.code
start:

    print fld$(chr$("Proc/Bytes"),HDRWIDTH)
    %FOR len,<SIZES>
        invoke StrLen,ADDR str&len&
        print fld$(ustr$(eax),COLWIDTH,1)
    ENDM
    print chr$(13,10)

    print fld$(chr$("=========="),HDRWIDTH)
    %FOR len,<SIZES>
        invoke StrLen,ADDR str&len&
        print fld$(" ====",COLWIDTH)
    ENDM
    print chr$(13,10)

    %FOR proc,<PROCS>
        print fld$(chr$("&proc&"),HDRWIDTH)
        %FOR len,<SIZES>
            counter_begin LOOPCOUNT, HIGH_PRIORITY_CLASS
                invoke proc,ADDR str&len&
            counter_end
            mov ebx,eax
            print fld$(ustr$(ebx),COLWIDTH,1)
        ENDM
        print chr$(13,10)
    ENDM
    mov   eax,input(13,10,"Press enter to exit...")
    exit

[attachment deleted by admin]

hutch-- · June 13, 2005, 11:38:55 PM

Here are the timings on my Prescott PIV.

Bytes 0 1 2 3 4 5 15 16 31 255 1023
===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2 6 9 8 6 10 9 27 25 38 302 862
Ratch 1 1 13 4 4 6 19 13 43 260 910
RatchX -1 0 1 1 7 3 16 13 30 256 910
RatchX2 -1 2 1 1 3 4 13 13 30 248 882

Jimg · June 14, 2005, 12:04:59 AM

Hutch-
Strange. The misaligned code is faster on yours also. Not as dramatic, but suprising anyway. Thanks.

Phil · June 15, 2005, 05:59:03 AM

Ratch, RatchX, RatchX2, StrLen1, and StrLen2 all seem to be quite sensitive to alignment. For some reason FStrLen and FStrLen1 were not. The following table was produced by loading 13 different copies of each procedure into memory at various alignments. Before each procedure is:

Code Select


align 16

repeat PAD
  nop
endm

Ratch&PAD& proc arg:DWORD ; procedure definition and entry begins here.
.... full procedure definition
Ratch&PAD& endp

The &PAD& appends the pad value to the procedure name to avoid duplicate names as the same procedure is loaded with different padding. This is the last of PAD values from 4 to 16. The values are easily alterable in the source if you care to play with them. I was quite surprised because the 16-byte alignment doesn't always seem to be best. Also, the timings are for a 1023 dup('X') string that was used in many of the preceeding tests and the different timings at various alignments help explain wide differences as we would add or remove the 250 ms sleep which changed the alignment.

Code Select

Pad nops      4    5    6    7    8    9   10   11   12   13   14   15   16
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
Ratch      1067 1066 1332 1077 1335 1333 1333 1334 1078  889 1067  883  893
RatchX     1061 1061 1313 1059 1313 1313 1313 1313 1060  882 1059  872  883
RatchX2    1061 1313 1058 1312 1312 1312 1314 1059  885 1059  879  880  882
StrLen1    1298 1046 1048 1046 1043  870  873  874  859 1046 1046 1043 1043
StrLen2     987  978  977  992 1059 1059 1058 1059 1302 1302 1302 1302 1048

RatchX2 will be fastest when it is preceeded by 'align 16' followed by 14 or 15 'nops'.
StrLen2 is fastest when preceeded by 'align 16' followed by 2 nops, etc.

I don't understand but it is interesting!

[attachment deleted by admin]

Jimg · June 15, 2005, 03:03:23 PM

Nice technique Phil, I kept thinking I was gonna get around to something similar :U

Here are my results, except I changed 16 to zero to test against original configuration:

Code Select

Pad nops      4    5    6    7    8    9   10   11   12   13   14   15    0
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
Ratch      1058  904 1063 1060 1058 1059 1189 1059  933  892  893  809  873
RatchX     1056  894 1053 1052 1055 1054 1180 1054  929  888  884  801  864
RatchX2     887 1052 1052 1053 1057 1180 1056  926  881  884  798  800  800
StrLen1    1058  931  928  930  927  870  865  866  865 1053 1056 1054 1056
StrLen2     909  910  913  910 1007 1007 1008 1007 1122 1122 1122 1122  930

I would have thought 12 pads on the Ratch procs would have been the best since it would align the loop at 16 byte, but didn't work out that way. Still a mystery here :dazzled:
Also, I think looping a million times is probably not the best indication of how fast a routine can be used in a normal program. Perhaps calling each routine once in sequence, loop back, call each one again, etc., keeping a running average?

Phil · June 16, 2005, 12:03:46 AM

JimG: Doing what you suggest would well produce a better indication of how the code might perform in a typical program. However, I think doing it like this is probably better when comparing the raw speed of an algorithm. It produces values which can be reproduced and that is very important when we are trying to tune, or slightly de-tune, our code for greater performance! Like you said, we still have mysteries here. I still don't know how mis-aligning a loop by adding a nop can make it run faster ... but it certainly seems to be the case occasionally!

Donkey: Finally got your MMX strlen for long strings into the test procedure. Sorry it took me so long to get around to it. As you said earlier, the strings must begin at a 16-byte boundry and it gains it's speed using MMX to load and examine 8-bytes at a time. Great job! It certainly seems to fly! I've added some commentary at the end of this block ... please let me know if I've missed anything or mis-understood anything. Here's your routine in MASM format:

Code Select

align 16

lszLenMMX proc pString:DWORD    ; Donkey's MMX strlen for long strings
.mmx
.xmm
        mov eax,[esp+4]
	nop
        nop                     ; fill in stack frame+mov to 8 bytes

	pxor mm0,mm0
        nop                     ; fill pxor to 4 bytes
	pxor mm1,mm1
        nop                     ; fill pxor to 4 bytes

    @@:                         ; this is aligned to 16 bytes
	movq mm0,[eax]
	pcmpeqb mm0,mm1
	add eax,8
        pmovmskb ecx,mm0        ; edit: by Phil - is this Polish instruction?
        or ecx,ecx              ;       ... ah, here's one that i understand!
        jz @B

        sub eax,[esp+4]

	bsf ecx,ecx
	sub eax,8
	add eax,ecx

	emms

        ret 1*4

lszLenMMX endp

First, the pxor's clear both mm0 and mm1. Then 8-bytes are loaded into mm0 with movq. Next the index is updated and pcmpeqb compares all 8-bytes to the zeros stored in mm1. If any of the 8-bytes are equal to nul then the corresponding byte in mm0 is set to all ones. Otherwise, the corresponding byte in mm0 is set to all zeros. Finally, the pmovmskb (pmov mask byte) copies the most significant bit of each byte in mm0 into the destination register which is ecx. If any of the bits are set then the end of the zero terminated string has been found. The bsf instruction scans forward (right-to-left) thru the bits in ecx and returns the index of the first bit that was set. The index returned would be 0 if bit 0 was set, indicating that the terminator was in the first (least significant) byte of the 8-bytes being checked. Subtracting 8 and adding the index to the difference between the original and final pointers produces the string length that is returned in eax!

Here are the timings for all routines now included in timesz.asm using the longer strings. Again, many of them only support 7-bit ASCII. Donkey's lszLenMMX procedure nicely handles 8-bit extended ASCII and there is no danger of page boundry over-run at the end because the source string must be aligned on a 16-byte boundry. Thanks for sharing your MMX code with us Donkey!

Code Select

Proc/Bytes    0    1    2    3    4    5   15   16   17  127  255 1023 2047
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenMMX    16   16   16   16   16   16   20   22   22   76  122  411  795
FStrLen       7    7   10    8   18   10   17   19   19  113  210  785 1554
FStrLn2       7    8    9    9   18   11   15   19   20  113  210  785 1555
Ratch        18   25   32   39   22   29   49   31   38  148  259  895 1746
RatchX       18   24   31   33   21   29   41   31   38  142  253  884 1721
RatchX2      19   23   31   33   20   27   41   33   37  146  248  881 1722
StrLen       15   15   16   16   18   18   25   28   28  150  278 1047 2071
StrLen1      11   11   12   12   15   15   21   24   24  127  231  857 1701
StrLen2      13   13   12   14   15   15   24   27   27  142  264  978 1929
szLen         8   10   12   14   17   19   32   49   47  241  464 1809 3604
szLen1        5    7    9   11   13   15   28   30   43  210  406 1566 3113
szLen2        6    7    9   12   13   15   29   30   45  212  408 1580 3149
szLen3        6    6   10   10   12   15   27   55   61  213  409 1571 3148
szLen4        5    7    8   10   14   14   26   30   30  208  399 1551 3090
lstrlenA     48   41   44   45   48   62   91   94   97  427  812 3117 6192
scasx        45   48   53   57   59   63  102  105  109  548 1055 4109 8179

The slowest algorithm by far is the 'repnz scasb' cutie that I also added to this version. The files in the zip have the same names as the previous timesz.asm. Be sure to rename your older files if you'd like to save them before you extract from this archive.

[attachment deleted by admin]

News:

szLen optimize...

Maelstrom

Phil

Phil

hutch--

Jimg

Phil

donkey

Ratch

Jimg

Phil

hutch--

Jimg

Phil

Jimg

Phil