News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

tetsu-jp

Thanks, on occasion, i allow people simply to call me "Alex".

I'm really honestly interested in such a banchmark, because i wrote such a program in 1997 myself.

Yes unfortunately due to my life circumstances, i have lost many source codes.

this is what i wrote 5 years ago to get string length: http://www.masm32.com/board/index.php?topic=1807.msg82540#msg82540

And i am thinking to write a benchmark (again), for strlen, memcopy and the like,

including 64bit!

I'm not assembly professional, let say, intermediate, the largest source i've ever produced was about 300K.

the purpose to visit the forum is to improve my skills, among having some fun!

so I could really write a benchmark using MASM, if people ask me to do it.

simply cheating the cache, always accessing the same string, is not serious testing.

there was IBM service program, it has done testing upwards, downwards, in certain steps, backwards, random, and twenty other options!
I don't think they just accessed one fixed location.

so all the feature i've listed above must be implemented!
I can do this...but I am not the pro, so it is uncertain, when this is going to happen.
for instance, i do not use the "pro" string length algorithms introduced here in this thread (some of them would make sense for certain applications).

It would be a research project to documentate the REP SCASB (SCAS) performance for all CPUs, over the years, I've read it degraded a little on Pentium, but recently, there might have been new implementations (on AMD CPUs).

I can't do it, I do not have many different computers. someone here might be able to create such a software,
with 100s of options, and donate it to the community!

what i think is that alignment is not so much relevant anymore (tough it can cause some trade-off).

NightWare

Quote from: BeeOnRope on April 02, 2009, 12:31:54 AM
Even if you don't believe it, it doesn't answer the point about interaction with legacy or proprietary APIs that you cannot change.
what i'm supposed to answer ? laws are what they are (i haven't defined them). all i can say is : life is made of choices, nothing else. and you must assume the results of those choices..., so IF a work doesn't follow YOUR SPECIFICATIONS, the work is supposed to be refused, IF NOT the work has been made correctly !
IF, later, you want modifications, then ask to the developpers, and pay for... it's the normal PRICE to pay when you don't code your apps yourself...

FORTRANS

Hi,

Quotebut maybe out of your head

; - - - String length routine.  - - -
; Use SCASB to find a C style string's length,
; 3 April 2009, SRN
StrLenS:
        CLD                     ; Search forward.
        MOV     EDI,OFFSET Test_Str ; Point destination index to string buffer.
        MOV     ECX,Limit       ; Maximum string length.
        MOV     AL,0            ; Character to search for.
  REPNE SCASB                   ; Scan for character.
        MOV     EAX,Limit
        SUB     EAX,ECX         ; Return length in EAX (includes the zero).

        RET


   Or some such.

Cheers,

Steve N.

jj2007

Quote from: FORTRANS on April 03, 2009, 03:53:23 PM
Hi,

Quotebut maybe out of your head

; - - - String length routine.  - - -
; Use SCASB to find a C style string's length,
; 3 April 2009, SRN
StrLenS:
        CLD                     ; Search forward.
        MOV     EDI,OFFSET Test_Str ; Point destination index to string buffer.
        MOV     ECX,Limit       ; Maximum string length.
        MOV     AL,0            ; Character to search for.
  REPNE SCASB                   ; Scan for character.
        MOV     EAX,Limit
        SUB     EAX,ECX         ; Return length in EAX (includes the zero).

        RET


   Or some such.

Cheers,

Steve N.

Thanksalot, Steve :bg

Quote from: tetsu-jp on April 02, 2009, 02:43:04 PM

and I've read the new AMD manuals, the REP SCAS is explicitely recommended for small strings!

so, would you include it, and show the result for REP SCASB as well?

tetsu-san,

following your request, I have added Steve's code to the testbed, see attachment and timings below. Now we are of course curious how your code will perform on your AMD, and how you will optimise it.


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32s=132strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
StrLenS (FORTRANS)   68312 cycles
strlen32s            3019 cycles
strlen64LingoB       3037 cycles
NWStrLen             3061 cycles
_strlen (Agner Fog)  4444 cycles

-- test 4k, misaligned 11, 4096 bytes
StrLenS (FORTRANS)   17029 cycles
strlen32s            768 cycles
strlen64LingoB       770 cycles
NWStrLen             789 cycles
_strlen (Agner Fog)  1142 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1362 cycles
  crt strlen         1012 cycles
StrLenS (FORTRANS)   4302 cycles
strlen32s            206 cycles
strlen64LingoB       199 cycles
NWStrLen             215 cycles
_strlen (Agner Fog)  284 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   136 cycles
  crt strlen         114 cycles
StrLenS (FORTRANS)   471 cycles
strlen32s            30 cycles
strlen64LingoB       25 cycles
NWStrLen             34 cycles
_strlen (Agner Fog)  37 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   138 cycles
  crt strlen         127 cycles
StrLenS (FORTRANS)   473 cycles
strlen32s            28 cycles
strlen64LingoB       27 cycles
NWStrLen             34 cycles
_strlen (Agner Fog)  35 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   26 cycles
  crt strlen         29 cycles
StrLenS (FORTRANS)   125 cycles
strlen32s            6 cycles
strlen64LingoB       5 cycles
NWStrLen             17 cycles
_strlen (Agner Fog)  14 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   27 cycles
  crt strlen         26 cycles
StrLenS (FORTRANS)   124 cycles
strlen32s            7 cycles
strlen64LingoB       3 cycles
NWStrLen             15 cycles
_strlen (Agner Fog)  14 cycles



[attachment deleted by admin]

herge

 Hi jj2007:

The results from here:

Friday, April 03, 2009 3:32 PM
Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32s=132strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
strlen32s            1522 cycles
strlen64LingoB       1231 cycles
NWStrLen             1334 cycles
_strlen (Agner Fog)  2844 cycles

-- test 4k, misaligned 11, 4096 bytes
strlen32s            395 cycles
strlen64LingoB       322 cycles
NWStrLen             348 cycles
_strlen (Agner Fog)  735 cycles

-- test 1k, misaligned 15, 1024 bytes
  Masm32 lib szLen   1071 cycles
  crt strlen         629 cycles
strlen32s            111 cycles
strlen64LingoB       85 cycles
NWStrLen             111 cycles
_strlen (Agner Fog)  182 cycles

-- test 0, misaligned 0, 100 bytes
  Masm32 lib szLen   107 cycles
  crt strlen         69 cycles
strlen32s            17 cycles
strlen64LingoB       11 cycles
NWStrLen             18 cycles
_strlen (Agner Fog)  21 cycles

-- test 1, misaligned 1, 100 bytes
  Masm32 lib szLen   105 cycles
  crt strlen         100 cycles
strlen32s            17 cycles
strlen64LingoB       11 cycles
NWStrLen             18 cycles
_strlen (Agner Fog)  21 cycles

-- test 5, misaligned 5, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         17 cycles
strlen32s            5 cycles
strlen64LingoB       1 cycles
NWStrLen             8 cycles
_strlen (Agner Fog)  7 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   19 cycles
  crt strlen         16 cycles
strlen32s            4 cycles
strlen64LingoB       2 cycles
NWStrLen             9 cycles
_strlen (Agner Fog)  7 cycles
-- Hit X Key --


Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

tetsu-jp



how can the exe file be produced? i tried with the include file from previous attachment,
all i get is a blank line, and then command prompt (and the exe file locked).

i have copied ML.EXE from the VC directory, it is 9.0, it is assembling,
and also linking works.

but the program can not work correctly!

what i am doing wrong???
I've just started with MASM32!
any idea why it can not act?

and yes, REP SCAS is slower...

if i can get the source working, I'll try SCASW, SCASD, and SCASQ (should be faster).

I have tried both linkers, the original MASM32, and from the VC directory: 43520 bytes exe file

jj2007

> how can the exe file be produced?

Did you choose CONSOLE assembly? I use RichMasm, which autodetects console/windows, but in other IDE's you might need to specify that explicitly.

tetsu-jp

I can assemble the supplied MASM32 examples, both via IDE, and via CLI:

-using the supplied .BAT file
-typing the command directly ~(ARGHH ..... this can work via copying binaries into the work directory.

so all this works, but the .EXE can not perform anything. something is not set up right.
I've removed the .EXE, and it is freshly generated, so assembler and linker work.

EDIT: I get along now! as i've guessed, the options have not been set up correctly, MASM32 just performs a plain call.

well, i had some fun with AZTEC C in a similar manner (and it requires a small file from a commercial SDK, one disk is defective a little, so people wo don't know, well they can try forever).

2 hours or 3 hours (I did other things as well).

by the way, the thumbnail is 70Kbyte, and the fullsize PNG just 16K



so please, include information about how to build this project. not everyone can read your mind.

tetsu-jp

therre are many other threads.

so i think your code (strlen) will be gently skipped (by me).
the problem was there is no makefile.

i wanted the timing for the SCAS, and that's the point.
someone already added it.

by the way i can understand most of the strlen sources, thanks.

it's really a waste of time to write you a reply but here you go.

tetsu-jp

1. You are new in assembly ->"I've just started with MASM32!"



you don't read carefully. i used MASM32 before, and wrote other assembly programs as well.

i just...had a break of 5 years.

i hope..you will not experience "the beans" in your life.
some people...just experience it, you know.

PS: it works now, see screenshot.

so the correct spelling is: I've just started with MASM32 (again) on a new machine...after a break of 5 years (not using assembly language).

ecube

tetsu-jp,
while lingo's personality is strong and he can be very direct, he is one of if not the most gifted assembly programmer on this forum/anywhere.  Rarely can anyone write faster code than him, which signify's that he has deep underlying system understanding so keeping that in mind, and what he said to you, i'd listen to him. The Gensis project is aimed at helping people quick start with MASM and i'm sure they help with assembly questions ingeneral.

NightWare

hmm, the laboratory is certainly not the appropriate place, yes.
however, his comment concerning SCACSB isn't totally wrong, it slower yes, but it use a hack to avoid branch misprediction (similar to movcc), so it WAS faster for small string... unfortunately, later, simd instructions have been introduced, and of course the speed difference has changed... yep, things must always been replaced in their context...


tetsu-jp

so he's an assembler wiz.

I've run a few tests, and notice differences each time the program runs.
large differences, upto 30 percent (no modification).

also i have modified the code for REP SCASD (within cache), and now the difference is only 4x.

it is OK you are the pro's, for years, if not decades? but who can deal with you?

i am willing to do it, and i can understand all of the source code, no worry.

don't understand your trouble.

i do not have general assembly questions, and just to work with the examples supplied,
there would be no need to deal with the forum.
it is just for fun, i do not use assembly for commercial projects.

so i have added SCASW and SCASD, and figured out, 30% difference each time the program is started.
so the numbers are not very relieable- performance can depend on many factors (usually there are more programs running at the same time, occupying the cache and all that).

but yes, you are the small group of pro's, and know, SCAS is ten times slower, 15 times slower.

i guess, SCASQ is just two times slower, in some contexts.

but saves people from artistic code (which also can be good).



this is the result for REP SCASD, cache=0

tetsu-jp

Quote from: E^cube on April 03, 2009, 10:34:36 PM
tetsu-jp,
while lingo's personality is strong and he can be very direct, he is one of if not the most gifted assembly programmer on this forum/anywhere.  Rarely can anyone write faster code than him, which signify's that he has deep underlying system understanding so keeping that in mind, and what he said to you, i'd listen to him. The Gensis project is aimed at helping people quick start with MASM and i'm sure they help with assembly questions ingeneral.

I'm not gifted...you can clearly see that, i need two hours to get the source working!
but makefile is no shame, you could call a makefile:
providing "deep underlying system understanding" for people who don't have it for some reason
(for instance, they can not read the brains of a small insider group).

there are people who h&te assembler, or refuse to deal with it completely. well i like it, but I understand why.

because the assembler wiz, who simply does not know that his world is just a special case, and not real world.

someone wrote "the strlen algorithms are not used in commercial applications"?
i think i've read this 3 pages ago.
so the wiz status is just to show off, in reality, the numbers are different.

yes, i like this SSE2 stuff, and will read all your code.

NightWare

Quote from: tetsu-jp on April 03, 2009, 11:26:44 PM
someone wrote "the strlen algorithms are not used in commercial applications"?
i think i've read this 3 pages ago.
so the wiz status is just to show off, in reality, the numbers are different.
no, i've said it's never used in SERIOUS apps, and it's not exactly what commercial applications are... by this you "should" have understood : nationnal app/database systems for administrations, army, etc... you've just avoided another occasion to keep some credibility...  :(