News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

ic2

QuoteI not taking credit for anything other than finding the link.  Here is where I founded Jens Duttke code.  It's the biggest discussion ever when it come to stlen in asm.  Make sure you have a big pot of coffee ready.  You got to read before you start the new thread.

http://board.win32asmcommunity.net/index.php?PHPSESSID=abcf67ef9a161ce95dd0c8f181663739&topic=4058.0

Jimg

Hmmm.  So all the big guns are just sitting back and grinning at us because they went through all this two years ago???

hutch--

 :bg

I would not lose sleep over treading the same ground, if no-one did it you would never get improvements.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv

 :toothy I remember the thread very clearly, but it was nice to see another discussion on it.  :green2

ic2

You can't expect hutch-- to direct you to threads.  I know for sure he wanted to, i could feel it in his first few posts.

Rule one, search the world first own your own.  My teacher did not give me the test before i studied.  Scientist work from ground up.  If it an new discussion going on do you think it should stop just because of an old dead one.  Life goes on with new and old members from around the world.

I learned more from this thread than the link i founded, read and posted. 

It's the little things that count...  it's all about improvements so i hope you will continue or i will never post any thread ever again to help people who seek improvement.  I know where everything is but this don't mean i know the meaning of it all.  I love searching but i love The Laboratory more.


hutch--

The laboratory is basically a place for bashing algos to death to get them faster or smaller or smarter or whatever else can be done with them. While not everyone has the time to track the discussions in real detail, battering around the edges of algos is the way they have been made faster over time so for those who have the time and the interest, its a worthwhile passtime as it is basically research that is being shared.

I am much easier to please with string length algos, I prefer a classic byte scanner for general purpose work and when I have the luxury of working with aligned data with a buffer that is safely larger then the source in it, I use the Agner Fog versions as it is a good average performer on most hardware.

One thing that is worth stressing with algos pointed at general purpose work is to try them out across different hardware and you learn all of the joys of writing mixed model code that has to perform reasonably on most hardware.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Maelstrom

Dont wanna burst anyones bubble because, of course, Jens code rocked but there was another thread after that...
Jens came in with the original thread.  This routine was improved upon by buliaNaza.  After that Lingo12 rocked the boat with the last one posted to any thread ive seen.

Should be noted that FStrLen is, basically, the exact same routine as Lingo12's...

The continuance to Jens thread was here for reference:
http://board.win32asmcommunity.net/index.php?topic=8330.msg60805#msg60805

Phil

ic2: Wow .. what a link! The thread that just won't die! Looks like FStrLen in this thread came from buliaNaza in 2002 and then modified by Lingo! Small world we live in!


roticv

buliaNaza is the same person as lingo if I am not wrong. Oh well.

ic2

What got me intersted was timelen.
if so im glad he stuck around.  To me buliaNaza is one of THE greatest.
Hope we can see the new results on Intel and AMD and improvements if possible.  Old news was igood news in this case.

Good luck

Codewarp

Phil,

Let me take another crack at how it is that Jens is faster on p3 while szLength is faster on Athlon 32/64.  The code fragments below are from the heart of the algo for both methods (with identical register for this discussion).  The only difference is the NOT ECX vs XOR ECX,EDX in szLength and Jens, respectively.


    ; Jens method
    lea  edx, [ecx-01010101h]
    xor  ecx, edx
    and  ecx, 80808080h
    and  ecx, edx

   ; szLength method
    lea  edx, [ecx-01010101h]
    not  ecx
    and  ecx, 80808080h
    and  ecx, edx


The problem for the Athlon is the LEA/XOR pair has a register dependency, so LEA has to finish before the XOR can start.  The NOT ECX can start at the same time as the LEA, causing its 1 cycle to disappear on every iteration (in Athlons).

The p3 may not be able to take advantage of this opportunity, so its clock cycle mix is determined by other things, like instruction times, pipelining, etc.  This is why it is next to impossible to develop single-routines that execute "the fastest" on a broad range of CPUs.

If you are a p3 fan, then Jens is for you.  If you are fan of multiple execution units and parallel computation, then methods like szLength are for you. You can't make a judgement on this one, without choosing sides and without being biased.  I freely admit to being biased toward using as much of recent advances in CPU architecture as reasonable compatibility will permit, and if there is a better place to draw the line, I would like to hear about it.

Jimg

Ok then...  would someone run the dang thing on a P5??

Codewarp

Jimg -

Here is the timelen for my Athlon64, with the shocking details.  Also, I have attached a new timelen3.zip with a new version of szLength that may do better on the p3.  I am not currently able to rebuild timelen, so if someone could add the new .exe back to this file, I would be grateful.



Running on Athlon64, family.model.step = 15.4.10

Test routines for correctness:
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
FStrLen      0    1    2    3    5    8   13   21   34   55   89  144  233  128
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength     0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLen        0    1    2    3    5    8   13   21   34   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999

Proc/Byte    0    1    2    3    5    8   13   21   34   55   89  144  233  999
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

Misaligned by 0 bytes:
lszLenSSE   19   19   19   21   19   22   22   25   30   35   69   89  123  408
FStrLen      4    6    8    7    9   10   15   21   48   63   89  129  199  117
Ratch        4    7   14   16   10   10   16   22   34   70   93  132  201  779
szLength     6    6    6    6    9   12   13   18   27   59   83  126  182  708
szLen        3    7    8    9   16   20   49   63   88  129  197  309  494 2019
Jens_fast   17   17   17   17   18   19   21   26   33   65   92  134  200  858

Misaligned by 1 bytes:
lszLenSSE   19   20   19   19   19   25   25   28   33   38   71   91  125  411
FStrLen      4    6    8    7    9   10   15   21   52   68   92  131  199  120
Ratch        4    7   14   16   10   11   16   22   34   79   93  133  201  784
szLength    12   12   12   12   12   17   17   22   34   64   84  119  174  655
szLen        3    7    8    9   16   20   48   63   88  129  197  309  485 2020
Jens_fast   17   17   17   17   21   22   23   32   37   71   95  140  203  884

Misaligned by 2 bytes:
lszLenSSE   19   19   19   20   19   25   25   28   33   38   71   91  125  409
FStrLen      4    6    8    7    9   10   15   21   52   68   92  131  199  120
Ratch        4    7   14   16   10   11   16   22   34   76   93  133  201  784
szLength    12   12   12   12   12   17   17   22   52   64   84  119  174  655
szLen        3    7    8    9   16   20   47   63   88  129  197  309  485 2019
Jens_fast   17   17   17   17   21   22   23   32   37   71   95  140  203  883

Misaligned by 3 bytes:
lszLenSSE   19   20   19   19   19   25   25   29   33   38   71   92  125  409
FStrLen      4    6    8    7    9   10   17   21   52   68   92  131  199  120
Ratch        4    7   14   16   10   11   16   22   34   76   93  133  201  784
szLength    12   12   12   12   17   17   22   27   52   64   87  119  177  655
szLen        3    7    8    9   16   20   49   63   88  129  197  309  486 2019
Jens_fast   17   17   17   17   21   22   23   32   37   71   95  140  203  885




[attachment deleted by admin]

Jimg

Thanks.  What I meant was you and I both have athlons, and Phil has the P3.  We need someone with a later pentium to run the timings.

You new routine runs about 20 cycles faster for the 999 string on my Athlon.  Good job!

hutch--

2.8 gig Prescott.


Test routines for correctness:
lszLenSSE    0    1    2    3    5    8   13   21   34   55   89  144  233  999
FStrLen      0    1    2    3    5    8   13   21   34   55   89  144  233  128
Ratch        0    1    2    3    5    8   13   21   34   55   89  144  233  999
szLength 4294942949    2    3    5    8   13   21   34   55   89  144  233  998
szLen        0    1    2    3    5    8   13   21   34   55   89  144  233  999
Jens_fast    0    1    2    3    5    8   13   21   34   55   89  144  233  999

Proc/Byte    0    1    2    3    5    8   13   21   34   55   89  144  233  996
========= ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ====

Misaligned by 0 bytes:
lszLenSSE   1642949   11   11   25   15   16   31   28   47   53  142  204  663
FStrLen      2    3    5   15    6    5    9   30   33   65  109  153  234  150
Ratch        1    1   18    8    8    5    1   17   37   74  123  156  218  944
szLength     1    1    142949   12    9   25   24   29   63  107  144  221  841
szLen    42949    3    5   27   11   17   39   69  110  180  216  353  527 2186
Jens_fast42949    142949   11    4   33   26   42   59   73   79  154  210  967

Misaligned by 1 bytes:
lszLenSSE   13   11    0   11   11   43   15   31   38   56   80  178  235  794
FStrLen      3   12    7    3    5   18   18   23   43   64  127  194  264  173
Ratch    42949    1    5    742949   46   25   18   34   73  143  206  309 1156
szLength     942949   10   15   22   16   20   64   31   62  106  149  227  870
szLen    42949    5    842949   11   18   18   81   95  151  217  326  538 2196
Jens_fast    3    04294942949    4    5   37   42   92  140  184  179  300 1233

Misaligned by 2 bytes:
lszLenSSE   14   11   24   11   11   28   17   33   41   37   85  174  261  787
FStrLen      3    3   16    2    5    5   21  112   28   66  138  170  291  208
Ratch        1    1   16    7    7    5   21   23   35   49  149  183  339 1163
szLength     9   10   15    1   10   13   27   52   59   50  106  148  204  891
szLen       11    3    7    6   29   18   27   83  107  171  205  348  529 2280
Jens_fast    3   114294942949    3    5   24   71   50   71  127  208  281 1170

Misaligned by 3 bytes:
lszLenSSE   53   24   13   11   23   49   18   30   27   54  119  192  236  782
FStrLen      3    4    5    242949    4   11   32   35   66  176  192  313  167
Ratch       13    2    5    7   19    7   12   17   16   78  155  194  320 1170
szLength     9   12   22   10   15   25   22   25   57   52  109  153  221  861
szLen    42949    5    7   20   11   18   39   68  106  182  216  348  563 2324
Jens_fast    342949    442949    3    6   41   40   60  109  139  172  264 1176
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php