News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

Ratch

I optimized my version of STRLEN on a 32-bit Athlon. I refuse to chase a peculiar hardware speed with software .  I try to code using rules that apply to most every processor, and ignore the anomalies. You can't optimize everything all the time.  Ratch

Jimg

Codewarp--

I've found a small problem with your latest version.  Please check the lengths reported rather then the cycle times.  It has something to do with high ascii and/or a string following the test string, not sure which.

[attachment deleted by admin]

Codewarp

Jimg--

No, it had to do will my a-little-to-quick transcription of the latest changes--sorry :red.  This one should be fixed now--but there's no telling what else I've broke if I can't type it straight... :wink.

[attachment deleted by admin]

Jimg

Codewarp-

Still a small problem on my machine:
Test routines for correctness:
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength     0    0    0    0    4    8   12   20   32   52   88  144  232  996
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength    -1   -1   -1    3    3    7   11   19   31   55   87  143  231  999
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength    -2   -2    2    2    2    6   10   18   34   54   86  142  230  998
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength    -3    1    1    1    5    5   13   21   33   53   89  141  233  997
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999

The previous version is working and still seems to be the fastest non-sse, even on Hutch's pentium  :wink

Jimg

Hutch-

Are the number glitchs (the ones printing  4294942949) repeatable on your machine?  It doesn't seem to be related to any routine?

hutch--

Jim,

The repeat number is 42949 and it is not consistent across different runs of the test piece. I downloaded Michaels timing code so I could run the test.

The machine is a 2.8 gig Prescott on an 800 meg FSB Intel board with 2 gig of DDR400 and it runs faultlessly, particularly when making timings. I may be worth getting someone else with a reasonably late pentium to test it as well.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Codewarp

#81
Jimg --

I guess this is what I get for keeping separate versions of szLength( ) code for c++ and masm... This one is supposed to work ::).  It has another cycle knocked out of every JNZ FOUND (with yet another align 4), and another cycle evaporated by replacing:

           OR EDX, [EAX]

with:   MOV ECX, [EAX]
          OR EDX, ECX

then hiding the MOV in the shadow of a non-dependent instruction.  Once again, if you wouldn't mind inserting the .exe for this new code and try again...  :red :red

[attachment deleted by admin]

Jimg

Perfect now :bg

Here's my results, and a copy of the code with an exe for those wanting to try it without building the exe.


Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
lszLenSSE   25   25   25   25   25   28   28   32   38   47   82  119  166  587
Ratch        8   11   12   15   14   14   20   30   64   77  103  141  219  852
szLength     8    8    9   10   11   15   17   23   34   47   89  134  199  777
Jens_fast   20   20   20   20   21   26   29   36   57   69   99  145  217  925

1 byte misalignment
lszLenSSE   28   28   28   28   29   30   31   33   41   54   92  125  177  617
Ratch        7   10   12   15   18   17   23   32   69   85  108  154  240  952
szLength    13   14   14   16   15   20   19   26   56   67   92  135  201  782
Jens_fast   20   20   21   20   24   28   32   40   62   76  105  153  233  999

2 byte misalignment
lszLenSSE   28   28   28   28   28   30   30   31   42   55   92  124  176  621
Ratch        8   10   11   15   18   16   23   32   69   88  109  155  243  953
szLength    15   13   15   15   15   19   21   29   39   52   92  135  200  783
Jens_fast   19   19   19   21   24   28   32   41   61   77  105  155  235 1002

3 byte misalignment
lszLenSSE   27   27   28   29   28   31   29   35   43   56   91  124  175  626
Ratch        7   11   12   15   18   16   24   32   69   86  110  155  243  953
szLength    13   16   16   15   19   18   24   29   41   52   94  134  202  790
Jens_fast   19   19   19   20   24   28   32   40   61   75  104  153  230  995




[attachment deleted by admin]

hutch--

This looks a lot better, I just ran the EXE and there are no "funny" numbers.

PIV Prescott 2.8 gig, 80 meg FSB board with 2 gig of DDR400.


Test routines for correctness:
0 byte misalignment
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  233  999
szLength     0    1    2    3    5    8   13   22   39   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  233  999
1 byte misalignment
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  233  999
szLength     0    1    2    3    5    8   13   22   39   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  233  999
2 byte misalignment
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  233  999
szLength     0    1    2    3    5    8   13   22   39   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  233  999
3 byte misalignment
lszLenSSE    0    1    2    3    5    8   13   22   39   55   89  144  233  999
Ratch        0    1    2    3    5    8   13   22   39   55   89  144  233  999
szLength     0    1    2    3    5    8   13   22   39   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   22   39   55   89  144  233  999

Proc/Byte    0    1    2    3    5    8   13   22   39   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

0 byte misalignment
lszLenSSE   31   11   11   23   12   15   27   20   26   49   54  154  193  665
Ratch        2    1    5   -4   10    5   22   23   41   71  118  146  207  900
szLength   -10    1    1    2   17    9   15   31   42   66  102  145  237  895
Jens_fast    9   -1   -1    6    4    5   32   40   69   74   79  152  214  994

1 byte misalignment
lszLenSSE   13   25   16   13   23   16   15   29   24   59   80  174  237  779
Ratch        1   16    6    7    7   -6   13   63   48   79  135  196  310 1180
szLength    10   22    9   17   23   16   19   34   40   61  107  160  227  860
Jens_fast    4    0   -1   -1   -7    6   12   53   69   99  144  172  269 1173

2 byte misalignment
lszLenSSE   25   12   13   12   11   15   53   21   24   57   85  168  247  788
Ratch       12    2    6    4   22    6   35   64   42   67  136  204  310 1171
szLength    21   11   19   10   -1   19   18   16   37   71  107  149  226  885
Jens_fast    3   -1  -12   -1    3    7   62   40   69   85  126  193  282 1184

3 byte misalignment
lszLenSSE   27   11    0   12   12   27   19   20   48   43  126  200  281  813
Ratch        2    1    6   -5   11    5   26   22   86   59  136  219  291 1140
szLength    -2   17   10   21   15   14   31   31   77   86  132  149  231  837
Jens_fast  -10   -1    0   -1   14    5   34   51   94   92  116  182  252 1169
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Phil

Looking great guys! Hutch, the funny numbers that you saw in the previous run, 42949's, were an attempt to display unsigned numbers in a 5 character field when print ustr$(ebx) is working correctly. The timelen.asm source files should be corrected so they use 'print sstr$(ebx)' instead of the unsigned version that I used incorrectly. The 42949 is actually the first 5 digits of -1 when it is displayed as unsigned. My mistake.

JimG and Codewarp: Glad to see you are moving right along here. I'm sorry if I slowed the flow here trying to understand things that are probably still just a bit beyond my abilities at the moment. Thank you all for your patience and help. I'll just keep re-reading the posts and scratching my head occasionally until it makes a little more sense to me. I'm still working on understanding how to cure the register stalls that Hutch had pointed out in some other code that I'm working on ... And, I just understand things a lot better when I see results like these posted that generally agree with the words and symbols that I'm trying to fit into my mind. Thanks again.

Codewarp

Would it be easy/legal/appropriate to incorporate the IdCPU code into the developing standard benchmarking code being used here on the strlen( ) code?  That way, every report says what it is--it would also be cool...

hutch--

Thanks Phil, for a moment I thought my PIV had developed a maths bug.  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv

I am quite surprised that Jens_mmx version is not found in the test bed (Too bad the graphs that used to be found there were gone).


strlen proc lpString:DWORD
mov eax, 1 ; request CPU feature flags
cpuid ; CPUID instruction

;- Pre-Scan to align the string-start ----
mov ecx, lpString
mov eax, ecx
cmp byte ptr [eax], 0
je done
and ecx, 0FFFFFFF8h
add ecx, 8
sub ecx, eax
cmp ecx, 8
je aligned
@@:
inc eax
cmp byte ptr [eax], 0
je done
dec ecx
jnz @B
aligned:
mov ecx, eax
;-----------------------------------------

test edx, 800000h ; test bit 23 to see if MMX available
jz no_mmx ; jump if no MMX is available
pxor mm0, mm0

@@:
movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]
movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]
movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]

pcmpeqb mm1, mm0
pcmpeqb mm2, mm0
pcmpeqb mm3, mm0
pcmpeqb mm4, mm0
pcmpeqb mm5, mm0
pcmpeqb mm6, mm0

por mm1, mm2
por mm3, mm4
por mm5, mm6
por mm1, mm3
por mm1, mm5

add ecx, 48

packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B

sub ecx, 48

emms ; Empty MMX state
no_mmx:

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]

done:

sub eax, lpString

ret
strlen endp


Bitrake's strlen for AMD Athlon and small strings could not be found too.


StrLen MACRO lpString:REQ
LOCAL _0,_1
mov ecx,lpString
pxor MM0,MM0
pxor MM1,MM1

mov ebx,16
ALIGN 16
_0: pcmpeqb MM1,[ecx+8]
pcmpeqb MM0,[ecx]
nop

add ecx,ebx
packsswb MM1,MM1
packsswb MM0,MM0

movd edx,MM1
movd eax,MM0
or edx,eax

je _0
bsf eax,eax
jne _1
add ecx,8
bsf eax,edx
_1: sub ecx,lpString
shr eax,2

lea eax,[ecx+eax-16]
ENDM

His footnotes says
"- Instructions packaged/aligned to 8 bytes offer highest decode bandwidth.
- Branch targets aligned to 16 bytes boundaries
- Use when average string is >32 bytes"

Mark Jones

Codewarp, I think MichaelW is working on that. :)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MichaelW

Quote from: Codewarp on June 24, 2005, 03:03:21 AM
Would it be easy/legal/appropriate to incorporate the IdCPU code into the developing standard benchmarking code being used here on the strlen( ) code?  That way, every report says what it is--it would also be cool...

I am working on it, but I have a problem with obtaining a brand identification string for recent Intel processors that return a brand index of zero. Unlike the AMD processors, where CPUID functions 80000002h-80000004h return a 48-byte processor name string (starting with the K5 Model 1), the Intel processors return a brand string that encodes the rated FSB frequency and the multiplier. The name string is not absolutely necessary, but I would like to provide a nice "friendly" name for all of the recent processors, and I would like to use a method that would not require constant updating. If Intel would just follow in AMD's footsteps for a change :bg
eschew obfuscation