News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

Maelstrom

Yeah your right, what a mess.  So much for copying and pasting from older threads...
I went through the thread from the beginning and debugged it though.  This one is corrected.


FStrLen proc string:dword
        mov     esi, [esp+8]
        mov     ebx, 80808080h      ; we'll use register ebx rather than immediate 80808080h
        mov     eax, [esi]
        xor     edx, edx            ; edx = 0
    @@:
        lea     ecx, [eax-1010101h] ; sub 1 from each byte in eax
        inc     edx                 ; ready for next dword
        and     ecx, ebx            ; test  sign ; ebx= 80808080h     
        mov     eax, [esi+edx*4]    ; get next dword
        jz      @B                  ; if not loop again

        test    ecx, 000000FFh      ; is al zero?
        jnz     C_minus4            ;
        test    ecx, 0000FF00h      ; is ah zero?
        jnz     C_minus3            ;
        test    ecx, 00FF0000h      ; is zero?
        jnz     C_minus2            ;
    C_minus1:
        lea     eax, [edx*4-1]
        ret
    C_minus2:
        lea     eax, [edx*4-2]
        ret
    C_minus3:
        lea     eax, [edx*4-3]
        ret
    C_minus4:
        lea     eax, [edx*4-4]
        ret
FStrLen endp

Phil

#16
Finally got FStrLen working. Looks like it's faster on my P3 for everything larger than 3 byte strings with the modifications that I made. Originally, It was about the same as StrLen1 when it had to preserve ebx and esi to be a 'fair' comparison with the other procedures. I re-wrote it but maintained the spirit of doing 4 bytes at a time. Here's the modified procedure:
FStrLen proc string:dword
        mov     ecx, [esp+4]
        xor     eax, eax            ; clear initial size
    @@:
        mov     edx, [ecx+4*eax]
        lea     edx, [edx-1010101h] ; sub 1 from each byte in eax
        add     eax,1               ; ready for next dword
        and     edx, 80808080h      ; check sign bit in all byte
        jz      @B                  ; more to do until we have a sign
        test    edx, 000000FFh      ; is byte-1 zero?
        jnz     C_minus4            ;
        test    edx, 0000FF00h      ; is byte-2 zero?
        jnz     C_minus3            ;
        test    edx, 00FF0000h      ; is byte-3 zero?
        jnz     C_minus2            ;
    C_minus1:
        lea     eax,[4*eax-1]
        ret     1*4
    C_minus2:
        lea     eax,[4*eax-2]
        ret     1*4
    C_minus3:
        lea     eax,[4*eax-3]
        ret     1*4
    C_minus4:
        lea     eax,[4*eax-4]
        ret     1*4

FStrLen endp

I should point out that this procedure isn't always safe. If the final null byte of the string is at the end of an allocated memory segment then the last dword mov could cause a general protection fault if the string wasn't aligned on a 4-byte boundry. To make it safe, more code would be needed up front to do things a byte at a time until 4-byte alignment was assured. It would probably still be faster on strings larger than 6 or 7 bytes but it would slow it down a bit. Finally, the timings on a 996 MHz P3:
Bytes    0     1     2     3     4     5     15    16    17    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
FStrLen  6     9     9     9     9     15    16    18    21    209   785
StrLen   15    15    16    15    18    18    25    28    28    278   1046
StrLen1  12    12    12    12    16    16    23    25    25    276   1042
StrLen2  11    10    13    13    15    14    25    27    26    286   1058
szLen    8     10    12    14    17    19    32    35    36    465   1808
szLen1   5     7     9     11    13    15    28    30    31    374   1449
szLen2   5     8     11    11    13    15    29    30    32    403   1560
szLen3   8     7     9     10    14    15    28    34    31    412   1595
szLen4   5     7     10    10    13    14    29    27    29    353   1349
lstrlen  44    44    46    47    49    62    92    95    98    812   3120


The crossover point where it becomes consistently faster might be higher than 3 bytes on some machines but I think it's a general win for average and large string sizes. Thanks for posting the algorithm and also the szlen program that I also modified and renamed fszlen to avoid confusion.


[attachment deleted by admin]

Phil

#17
I changed the lea adjustments to sub's because they should be faster on many machines.
FStrLn2 proc string:dword
        mov     ecx, [esp+4]
        xor     eax, eax            ; clear initial size
    @@:
        mov     edx, [ecx+eax]
        lea     edx, [edx-1010101h] ; sub 1 from each byte in eax
        add     eax, 4              ; ready for next dword
        and     edx, 80808080h      ; check sign bit in all byte
        jz      @B                  ; more to do until we have a sign
        test    edx, 000000FFh      ; is byte-1 zero?
        jnz     C_minus4
        test    edx, 0000FF00h      ; is byte-2 zero?
        jnz     C_minus3
        test    edx, 00FF0000h      ; is byte-3 zero?
        jnz     C_minus2
    C_minus1:
        sub     eax,1
        ret     1*4
    C_minus2:
        sub     eax,2
        ret     1*4
    C_minus3:
        sub     eax,3
        ret     1*4
    C_minus4:
        sub     eax,4
        ret     1*4

FStrLn2 endp


And the timing, again on a 996 MHz P3
Bytes    0     1     2     3     4     5     15    16    17    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
FStrLen  6     9     10    9     9     15    16    18    21    209   785
FStrLn2  7     8     9     9     18    11    15    19    20    209   785
StrLen   15    15    16    15    18    18    25    28    28    278   1047
StrLen1  11    11    12    12    15    15    21    24    24    231   856
StrLen2  12    12    12    13    14    14    24    26    26    269   986
szLen    8     10    12    14    17    19    32    35    37    464   1808
szLen1   5     7     9     11    13    15    28    28    31    371   1443
szLen2   6     7     9     12    13    15    30    30    31    408   1580
szLen3   5     8     9     10    13    15    28    33    30    413   1589
szLen4   6     7     9     10    13    14    27    30    31    403   1553


Notice that StrLen1 and StrLen2 changed dramatically from the previous version. I carefully removed the new function FStrLn2 from the test and commented the procedure definition so it wouldn't be loaded into memory and the StrLen functions went back to their previous slower timings. It's unusual and I repeated the test by carefully editing the code to put my updated function in and they when back to the more likely speeds that we see here. I don't understand how the position in memory can affect the timing this dramatically.


[attachment deleted by admin]

hutch--

Phil,

In the short term, try aligning each procedure entry at align 16 and if that does not help, try another dirty trick,


REPEAT 1024
  nop
ENDM


After each ALIGN 16 but before the procedure. The idea is to isolate each procedure.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

QuoteI should point out that this procedure isn't always safe. If the final null byte of the string is at the end of an allocated memory segment then the last dword mov could cause a general protection fault if the string wasn't aligned on a 4-byte boundry.

It should also be noted that these routines do not work on ascii character > 128

For example, change test string 5 to

    str5        db 'X',130,'XXX',0   ; was 'XXXXX',0

run the test and look at the length printed in the headers.

Phil

JimG: Thanks for pointing out that many of these functions only work with 7-bit ASCII. I hadn't noticed that until you mentioned it.

Hutch: Thanks for the suggestions on issolating the functions. I'll try the align 16 and modifiy this post later with the results.

donkey

I wrote this some time ago, it is for exceeding long strings (>128bytes) using MMX. Note that it is assumed the the entry of the proc is on a paragraph boundary, something that is gauranteed if it is in a static lib using GoAsm. Also the string must start on a 16 byte boundary.

lszLenMMX FRAME pString

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms

   RET

ENDF
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Ratch

To the Ineffable All,
     Perhaps all those interested should peruse this link. Ratch 
http://board.win32asmcommunity.net/index.php?topic=16299.msg128369;topicseen#msg128369

Jimg

Very nice Ratch, the fastest one that works on all ascii characters so far.

I couldn't leave good enough alone, of course, so I played a little with your code.  First my appologies for messing with your code, I don't get along well with loops and repeats and such, so I coverted it to brute force code to play with.

I'm getting some strange results.  My first try only cut a few cycles off  ( proc RatchX), but then I inserted a nop in preparation for some other tests, and on my screwy athlon, it made the code run much faster on the large string (proc RatchX2).  The nop misaligned the main loop, so it should have made it run much slower.  Here are my results:
Bytes    0     1     2     3     4     5     15    16    31    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2  11    11    11    11    13    14    22    25    36    283   927
Ratch    8     11    12    15    10    14    26    21    39    243   875
RatchX   8     10    13    13    13    14    22    22    35    234   864
RatchX2  8     11    14    14    14    14    20    21    35    217   800

Press enter to exit...

  What's going on here.

If someone with an intel chip would try this out to see if there is a similar effect, I'd appreciate it.

[attachment deleted by admin]

Phil

#24
Here are the results on a 996 MHz P3
Bytes    0     1     2     3     4     5     15    16    31    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2  9     11    12    13    13    15    24    42    56    301   1047
Ratch    18    25    32    39    22    29    49    31    72    258   893
RatchX   18    24    31    33    21    29    41    31    63    253   884
RatchX2  19    23    30    33    20    27    41    33    68    247   881



And a complete pass for all procedures included in this thread.
Proc/Bytes    0    1    2    3    4    5   15   16   17   31 1023
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
FStrLen       7    7   10    8   18   10   17   19   19   43  786
FStrLn2       7    8    9    8   13   11   15   19   20   43  786
Ratch        18   25   32   39   22   29   49   31   38   72  893
RatchX       18   24   31   33   21   29   41   31   38   63  883
RatchX2      19   23   30   33   20   27   41   33   37   68  882
StrLen       15   15   16   16   18   18   25   28   28   53 1047
StrLen1      12   12   12   12   16   16   23   25   25   53 1042
StrLen2       9   11   11   13   13   15   24   39   27   56 1047
szLen         8   10   12   14   17   19   32   49   48   73 1808
szLen1        5    7    9   11   13   15   28   30   32   65 1566
szLen2        6    7    9   11   13   15   29   43   45   67 1580
szLen3        6    6   10   10   12   14   27   51   72   65 1571
szLen4        5    7    8   10   14   14   27   30   46   63 1552
lstrlenA     48   41   44   45   48   62   91   94   97  139 3117


Also, I defined a fld$ macro that produces a fixed length field and allows right alignment. The code is list driven so it is easier to modify. The complete source is included in the zip but here are the defining elements that show the idea:
    include     \masm32\include\masm32rt.inc    ; defaults to .386

.586

    include     timers.asm      ; requires a pentium

    LOOPCOUNT equ 1000000

    PROCS TEXTEQU <FStrLen,FStrLn2, \
                   Ratch,RatchX,RatchX2, \
                   StrLen,StrLen1,StrLen2, \
                   szLen,szLen1,szLen2,szLen3,szLen4, \
                   lstrlen >

    SIZES TEXTEQU <0,1,2,3,4,5,15,16,17,31,1023>

    %FOR proc,<PROCS>
         proc proto :DWORD
    ENDM

    MAXWIDTH  equ 20
    HDRWIDTH  equ 10
    COLWIDTH  equ 5

.data

    %FOR len,<SIZES>
         align 16
         str&len& db len dup ('X'),0
    ENDM

fld$ MACRO DDpointer,DDwidth,DDalign:=<0>
    LOCAL rvstring
    .data
        rvstring db MAXWIDTH+4 dup (0)
        align 16
    .code
    invoke sxFld,reparg(DDpointer),ADDR rvstring,DDwidth,DDalign
    EXITM <ADDR rvstring>
ENDM

.code
start:

    print fld$(chr$("Proc/Bytes"),HDRWIDTH)
    %FOR len,<SIZES>
        invoke StrLen,ADDR str&len&
        print fld$(ustr$(eax),COLWIDTH,1)
    ENDM
    print chr$(13,10)

    print fld$(chr$("=========="),HDRWIDTH)
    %FOR len,<SIZES>
        invoke StrLen,ADDR str&len&
        print fld$(" ====",COLWIDTH)
    ENDM
    print chr$(13,10)

    %FOR proc,<PROCS>
        print fld$(chr$("&proc&"),HDRWIDTH)
        %FOR len,<SIZES>
            counter_begin LOOPCOUNT, HIGH_PRIORITY_CLASS
                invoke proc,ADDR str&len&
            counter_end
            mov ebx,eax
            print fld$(ustr$(ebx),COLWIDTH,1)
        ENDM
        print chr$(13,10)
    ENDM
    mov   eax,input(13,10,"Press enter to exit...")
    exit

[attachment deleted by admin]

hutch--

Here are the timings on my Prescott PIV.


Bytes    0     1     2     3     4     5     15    16    31    255   1023
         ===== ===== ===== ===== ===== ===== ===== ===== ===== ===== =====
StrLen2  6     9     8     6     10    9     27    25    38    302   862
Ratch    1     1     13    4     4     6     19    13    43    260   910
RatchX   -1    0     1     1     7     3     16    13    30    256   910
RatchX2  -1    2     1     1     3     4     13    13    30    248   882
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Hutch-
Strange. The misaligned code is faster on yours also.  Not as dramatic, but suprising anyway.  Thanks.

Phil

Ratch, RatchX, RatchX2, StrLen1, and StrLen2 all seem to be quite sensitive to alignment. For some reason FStrLen and FStrLen1 were not. The following table was produced by loading 13 different copies of each procedure into memory at various alignments. Before each procedure is:

align 16

repeat PAD
  nop
endm

Ratch&PAD& proc arg:DWORD ; procedure definition and entry begins here.
.... full procedure definition
Ratch&PAD& endp

The &PAD& appends the pad value to the procedure name to avoid duplicate names as the same procedure is loaded with different padding. This is the last of PAD values from 4 to 16. The values are easily alterable in the source if you care to play with them. I was quite surprised because the 16-byte alignment doesn't always seem to be best. Also, the timings are for a 1023 dup('X') string that was used in many of the preceeding tests and the different timings at various alignments help explain wide differences as we would add or remove the 250 ms sleep which changed the alignment.
Pad nops      4    5    6    7    8    9   10   11   12   13   14   15   16
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
Ratch      1067 1066 1332 1077 1335 1333 1333 1334 1078  889 1067  883  893
RatchX     1061 1061 1313 1059 1313 1313 1313 1313 1060  882 1059  872  883
RatchX2    1061 1313 1058 1312 1312 1312 1314 1059  885 1059  879  880  882
StrLen1    1298 1046 1048 1046 1043  870  873  874  859 1046 1046 1043 1043
StrLen2     987  978  977  992 1059 1059 1058 1059 1302 1302 1302 1302 1048


RatchX2 will be fastest when it is preceeded by 'align 16' followed by 14 or 15 'nops'.
StrLen2 is fastest when preceeded by 'align 16' followed by 2 nops, etc.

I don't understand but it is interesting!


[attachment deleted by admin]

Jimg

#28
Nice technique Phil, I kept thinking I was gonna get around to something similar :U

Here are my results, except I changed 16 to zero to test against original configuration:

Pad nops      4    5    6    7    8    9   10   11   12   13   14   15    0
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
Ratch      1058  904 1063 1060 1058 1059 1189 1059  933  892  893  809  873
RatchX     1056  894 1053 1052 1055 1054 1180 1054  929  888  884  801  864
RatchX2     887 1052 1052 1053 1057 1180 1056  926  881  884  798  800  800
StrLen1    1058  931  928  930  927  870  865  866  865 1053 1056 1054 1056
StrLen2     909  910  913  910 1007 1007 1008 1007 1122 1122 1122 1122  930

I would have thought 12 pads on the Ratch procs would have been the best since it would align the loop at 16 byte, but didn't work out that way.  Still a mystery here :dazzled:
Also, I think looping a million times is probably not the best indication of how fast a routine can be used in a normal program.  Perhaps calling each routine once in sequence, loop back, call each one again, etc., keeping a running average?

Phil

#29
JimG: Doing what you suggest would well produce a better indication of how the code might perform in a typical program. However, I think doing it like this is probably better when comparing the raw speed of an algorithm. It produces values which can be reproduced and that is very important when we are trying to tune, or slightly de-tune, our code for greater performance! Like you said, we still have mysteries here. I still don't know how mis-aligning a loop by adding a nop can make it run faster ... but it certainly seems to be the case occasionally!

Donkey: Finally got your MMX strlen for long strings into the test procedure. Sorry it took me so long to get around to it. As you said earlier, the strings must begin at a 16-byte boundry and it gains it's speed using MMX to load and examine 8-bytes at a time. Great job! It certainly seems to fly! I've added some commentary at the end of this block ... please let me know if I've missed anything or mis-understood anything. Here's your routine in MASM format:
align 16

lszLenMMX proc pString:DWORD    ; Donkey's MMX strlen for long strings
.mmx
.xmm
        mov eax,[esp+4]
nop
        nop                     ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
        nop                     ; fill pxor to 4 bytes
pxor mm1,mm1
        nop                     ; fill pxor to 4 bytes

    @@:                         ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
        pmovmskb ecx,mm0        ; edit: by Phil - is this Polish instruction?
        or ecx,ecx              ;       ... ah, here's one that i understand!
        jz @B

        sub eax,[esp+4]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms

        ret 1*4

lszLenMMX endp

First, the pxor's clear both mm0 and mm1. Then 8-bytes are loaded into mm0 with movq. Next the index is updated and pcmpeqb compares all 8-bytes to the zeros stored in mm1. If any of the 8-bytes are equal to nul then the corresponding byte in mm0 is set to all ones. Otherwise, the corresponding byte in mm0 is set to all zeros. Finally, the pmovmskb (pmov mask byte) copies the most significant bit of each byte in mm0 into the destination register which is ecx. If any of the bits are set then the end of the zero terminated string has been found. The bsf instruction scans forward (right-to-left) thru the bits in ecx and returns the index of the first bit that was set. The index returned would be 0 if bit 0 was set, indicating that the terminator was in the first (least significant) byte of the 8-bytes being checked. Subtracting 8 and adding the index to the difference between the original and final pointers produces the string length that is returned in eax!

Here are the timings for all routines now included in timesz.asm using the longer strings. Again, many of them only support 7-bit ASCII. Donkey's lszLenMMX procedure nicely handles 8-bit extended ASCII and there is no danger of page boundry over-run at the end because the source string must be aligned on a 16-byte boundry. Thanks for sharing your MMX code with us Donkey!
Proc/Bytes    0    1    2    3    4    5   15   16   17  127  255 1023 2047
========== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====
lszLenMMX    16   16   16   16   16   16   20   22   22   76  122  411  795
FStrLen       7    7   10    8   18   10   17   19   19  113  210  785 1554
FStrLn2       7    8    9    9   18   11   15   19   20  113  210  785 1555
Ratch        18   25   32   39   22   29   49   31   38  148  259  895 1746
RatchX       18   24   31   33   21   29   41   31   38  142  253  884 1721
RatchX2      19   23   31   33   20   27   41   33   37  146  248  881 1722
StrLen       15   15   16   16   18   18   25   28   28  150  278 1047 2071
StrLen1      11   11   12   12   15   15   21   24   24  127  231  857 1701
StrLen2      13   13   12   14   15   15   24   27   27  142  264  978 1929
szLen         8   10   12   14   17   19   32   49   47  241  464 1809 3604
szLen1        5    7    9   11   13   15   28   30   43  210  406 1566 3113
szLen2        6    7    9   12   13   15   29   30   45  212  408 1580 3149
szLen3        6    6   10   10   12   15   27   55   61  213  409 1571 3148
szLen4        5    7    8   10   14   14   26   30   30  208  399 1551 3090
lstrlenA     48   41   44   45   48   62   91   94   97  427  812 3117 6192
scasx        45   48   53   57   59   63  102  105  109  548 1055 4109 8179


The slowest algorithm by far is the 'repnz scasb' cutie that I also added to this version. The files in the zip have the same names as the previous timesz.asm. Be sure to rename your older files if you'd like to save them before you extract from this archive.


[attachment deleted by admin]