szLen optimize...

BeeOnRope · April 01, 2009, 06:32:25 PM

Quote from: hutch-- on April 01, 2009, 01:37:24 AM
Bee,

I agree that string functions generally need to be fast, particularly when you are doing complex parsing but I would hold to my original comment that almost all string length requirements are more than adequately handled by the simplest byte scanner using one register. It is very rare to use long strings (> 1 meg) and where you do have an unusual case that has to repeatedly scan strings for their length, you write a different algo. Agner Fog's 1995 DWORD algo is still a very good performer here but if your task requires it you write a dedicated string length algo that is faster.

This is my favourite type of string length algo.

Code Select Expand
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE slen proc pstr:DWORD mov ecx, [esp+4] mov eax, ecx sub eax, 1 @@: add eax, 1 cmp BYTE PTR [eax], 0 jne @B sub eax, ecx ret 4 slen endp OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef ; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

A techynique I regularly use when tokenising a large string is to make a copy of the string if preserving the original matters, do a one pass in place tokenise on the data overwriting the line terminator with a zero and writing the start offset of each line to an array. Now this leaves me with an array of unaligned members but the tokenising method is faster than any data copy to array method by some considerable amount.

If I then need to get the length of any or all of the tokenised strings, I use the very small one above because in most instances its takeoff time makes it faster than the bigger clunkier ones that are very fast on single long strings but hopeless on variable length unaligned short strings.

I agree more or less - it is rare enough in most applications to take the length of long strings repeatedly, but it definitely does happen. This code could definitely be re-written to make it faster in many cases, but a string algo that is 10x faster in the first place will be an automatic huge win in these places without the need to thread a length through 10s or 100s of functions. Arguably I'm preaching to the wrong crowd here - programs written entirely in assembly perhaps aren't likely to reach the scale (in terms of lines of code) where this becomes a consideration, but in the kind of software I'm interesting in (HLL + selected routines in assembly) it counts.

One thing you are missing from this thread is that the "larger/clunkier" routines developed by Lingo, NW and jj are blazing fast on short and unaligned strings - much, much faster than than Agner's routine, the CRT routine etc. For example, on some 15 byte misaligned strings, Lingo's routine takes as little as 1 cycle. That's 19x faster than the existing routine, and 7x faster than Agner's routine.

With such speeds it is arguably faster, for short strings, to not bother passing around the length, but to call strlen when needed (especially since the length may be need to 4 bytes or more, if you can accommodate larger strings, even if they are usually short in practice).

jj2007 · April 01, 2009, 07:07:20 PM

Quote from: BeeOnRope on April 01, 2009, 06:32:25 PM
One thing you are missing from this thread is that the "larger/clunkier" routines developed by Lingo, NW and jj are blazing fast on short and unaligned strings - much, much faster than than Agner's routine, the CRT routine etc. For example, on some 15 byte misaligned strings, Lingo's routine takes as little as 1 cycle. That's 19x faster than the existing routine, and 7x faster than Agner's routine.

That seems a bit too optimistic, BeeOnRope. Lingo's algo is indeed blazing fast at 4 cycles, but first, the timings may not be so accurate at that scale, and second, it will hardly matter for such short strings - there will be plenty of slower code before and after.

Code Select


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32s=132strlen64B=84NWStrLen=118, _strlen=66 bytes

-- test 16k, misaligned 0, 16434 bytes
  Masm32 lib szLen   20648 cycles
  crt strlen         15255 cycles
strlen32s            2894 cycles
strlen64LingoB       2919 cycles
NWStrLen             2935 cycles
_strlen (Agner Fog)  4264 cycles

-- test 15, misaligned 15, 15 bytes
  Masm32 lib szLen   25 cycles
  crt strlen         25 cycles
strlen32s            6 cycles
strlen64LingoB       4 cycles
NWStrLen             15 cycles
_strlen (Agner Fog)  14 cycles

Quote
With such speeds it is arguably faster, for short strings, to not bother passing around the length, but to call strlen when needed (especially since the length may be need to 4 bytes or more, if you can accommodate larger strings, even if they are usually short in practice).

Laziness is indeed a valid argument - codesize maybe not. A harmless mov eax, len(offset MyString) costs 12 bytes...

lingo · April 01, 2009, 10:01:46 PM

Hutch,
don't use ecx in strlen because jj will improve your code
and preserve his lovely "count" register ecx. :wink
After that he'll post a message that "his" code is faster... :lol

jj2007 · April 01, 2009, 10:12:26 PM

Quote from: lingo on April 01, 2009, 10:01:46 PM
Hutch,
don't use ecx in strlen because jj will improve your code
and preserve his lovely "count" register ecx. :wink
After that he'll post a message that "his" code is faster... :lol

... and you will reply with a 1% faster SSE8 version, hehe :green2

Lingo, I have changed philosophy completely. No more preserving of count registers! No, just the opposite: How can one set all registers to zero with less than 12 bytes in less than 5 cycles. I swear it's more exciting than solving crosswords. All the gurus here have already contributed a solution. We are waiting for you :thumbu

tetsu-jp · April 01, 2009, 10:28:39 PM

Code Select


memory_search proc address:dword,data:dword,limit:dword
;returns position of char in eax.
;returns limit if char was not found.
;the limit can be set to avoid buffer overrun
 push ecx
 push edi
 mov ecx,limit
 mov eax,data
 mov edi,address
 repne scasb
 
 mov eax,limit
 sub eax,ecx
 pop edi
 pop ecx
 ret
memory_search endp

the point was to prevent the program from locking up on errand strings.

i know the REP prefix is long depretated...how slow is it actually compared to the other implementations?

would it improve to have a sufficient number of extra zero's, and to use SCASD?

real-world code needs to prevent lock-up/overrun by all means. so to be fair, you'd also have to add test code to the other string functions.

http://www.azillionmonkeys.com/qed/asmexample.html
http://www.visionx.com/markl/optimization_tips.htm
http://forums.appleinsider.com/archive/index.php/t-27572.html

BeeOnRope · April 01, 2009, 10:32:29 PM

Quote from: jj2007 on April 01, 2009, 07:07:20 PM
Quote from: BeeOnRope on April 01, 2009, 06:32:25 PM
One thing you are missing from this thread is that the "larger/clunkier" routines developed by Lingo, NW and jj are blazing fast on short and unaligned strings - much, much faster than than Agner's routine, the CRT routine etc. For example, on some 15 byte misaligned strings, Lingo's routine takes as little as 1 cycle. That's 19x faster than the existing routine, and 7x faster than Agner's routine.

That seems a bit too optimistic, BeeOnRope. Lingo's algo is indeed blazing fast at 4 cycles, but first, the timings may not be so accurate at that scale, and second, it will hardly matter for such short strings - there will be plenty of slower code before and after.

Agreed - 1 cycle is not realistic and likely reflects overlapping of loops by the CPU which won't like occur in practice in real code. Still - the point remains, the "clunky" routines test as significantly faster for short or misaligned strings than the simpler routines, contradicting Hutch's assertion that this quest is misguided because these routines will only work well for giant, aligned strings.

Quote
Laziness is indeed a valid argument - codesize maybe not. A harmless mov eax, len(offset MyString) costs 12 bytes...

Actually I was referring to data size, not code size in this case. Imagine 1 million length 0 strings - you might use 4 MB with length-prefixed strings, compared to 1 MB with null terminated strings. Don't get me wrong, I'm nearly always in favor of explicit length for stings, but the termination technique can be appealing for many very short strings.

NightWare · April 01, 2009, 10:35:18 PM

Quote from: hutch-- on April 01, 2009, 07:33:44 AM
I would be interested to see if it has become faster again on a core 2 duo or quad.

concerning lea, i've never seen difference between my p3-500/celeron-700/P4-2Ghz/Core2-2Ghz
(my P4 was a northwood i think..., maybe there is difference with a prescot)

Quote from: BeeOnRope on April 01, 2009, 06:20:35 PM
That's an interesting statement. You just said that string algos are *never* used in a serious app, yet I have been in developing several "serious" apps, and I've seen string functions, including strlen, be the bottleneck for interesting workflows for many of them. It's very though to assert that something never happens when I'm saying plainly and without any particular secret motivation that I have seen exactly this in "serious" apps.

I didn't write the functions in question, rather noted the bottleneck in software developed by teams of hundreds of people - I already mentioned that in some cases it is possible to return a length (or to use a class that remembers it), but if you are interoperating with other code you may not have a choice because (a) you don't have the source (b) cannot legally modify the source (c) do not have the time to modify the source, etc.

here the problem come from the approach, most coders "think" in term of tasks, and in the contrary they should have a global approach. i don't know a case where updating a counter from time to time is slower than process an entire area, especially a large one.
(hundred/thousand/million people doing the same thing does not mean they're right, especially when they've been paid to quickly produce a "serious" app. it's a well known fact, you only obtain the product you have paid for. and asking coders to respect time limits certainly not encourage them to "think their code").

Quote from: jj2007 on April 01, 2009, 07:07:20 PM
A harmless mov eax, len(offset MyString) costs 12 bytes...

correct, the code size isn't for free, it's ok for fast algo, but only if you can be sure it's maintained in the cache (and the larger the code is, the more often the cache will be updated).

BeeOnRope · April 02, 2009, 12:31:54 AM

Quote from: NightWare on April 01, 2009, 10:35:18 PM

here the problem come from the approach, most coders "think" in term of tasks, and in the contrary they should have a global approach. i don't know a case where updating a counter from time to time is slower than process an entire area, especially a large one.
(hundred/thousand/million people doing the same thing does not mean they're right, especially when they've been paid to quickly produce a "serious" app. it's a well known fact, you only obtain the product you have paid for. and asking coders to respect time limits certainly not encourage them to "think their code").

Sure, that approach is fine for a monolithic application written by one or a few people (most assembly-only programs will fall into this category). In practice, with components being written by teams around the globe with differing delivery schedules, coding styles, etc, it is important to think in terms of self-contained components, tasks, APIs, whatever. It simply isn't possible for any one person to have the whole end-to-end workflow or code-flow in his mind at once. If you don't believe this, you have never worked on a large, distributed software project.

Even if you don't believe it, it doesn't answer the point about interaction with legacy or proprietary APIs that you cannot change.

hutch-- · April 02, 2009, 06:02:00 AM

The problem as I see it is the "one size fits all" approach. Having written libraries for many years commercially back in the 90s I am as a matter of fact familiar with distributed projects but I am also familiar with their defects, the risk of being a headless monster screwed together with a multitude of compromises to fit the deviations of opinion, technique and disposition of the sum total of its contributors, a situation that is something like the corporate decision making process but with a random factor added.

I don't see any problem at all with keeping a dozen different routines to do similar tasks and simiply dial up the one that best fits your need if in fact any of them will fit the need, the alternative is to write another that in fact does do exactly what you need.

ecube · April 02, 2009, 06:14:01 AM

I put together a library of all the fastest procedures i've found on here awhile ago, under stress testing and even general use a lot of them messed up on me a lot(great deal of the procedures being loco's code, since he's won most of the speed contests). I've come to love the slower yet reliable code :) old trusty, heh.

and on to more pressing business, Hutch I don't know if you've noticed but you have the devil sign in your post count and your names in red in the user list... :'(

jj2007 · April 02, 2009, 08:37:22 AM

Quote from: E^cube on April 02, 2009, 06:14:01 AM
I put together a library of all the fastest procedures i've found on here awhile ago, under stress testing and even general use a lot of them messed up on me a lot

Interesting. Post some example code, please.

hutch-- · April 02, 2009, 10:36:40 AM

Cube,

In 3 posts time I will have four (4) sixes as my post count, I wonder if that encapsulates both old Nick AND his sidekick ? :bg

donkey · April 02, 2009, 11:08:20 AM

Quote from: hutch-- on April 02, 2009, 10:36:40 AM
AND his sidekick ? :bg

Not sure what George W Bush has to do with posting here :bg

tetsu-jp · April 02, 2009, 02:43:04 PM

add Bill Gates ASCII and you get 666, as well MS DOS 6.22 and Windows 3.11!

so the name, DOS and Windows version have been adjusted to sum up to 666. it is known for many years.

for the topic, I've made such a test program in 1997 (won't you guess, it is long lost).

different methods for memory copy! it was a 80386 SLC, some obscure variant.

now, alignment was absolutely irrelevant to performance.

and also 16 bytes, then 16K bytes, is no good testing.

you must test 256 bytes, 1K, 4K (common cache size/page size).

and then you must differentiate:

-linear access within a page
-linear access within L1/L2 cache
-always accessing the same location (cheating the cache)
-random access within a page
-random access within L1/L2 chaches
-long range random access

if you don't implement all this, your test algorithm is more than questionable.

and I've read the new AMD manuals, the REP SCAS is explicitely recommended for small strings!

so, would you include it, and show the result for REP SCASB as well?

if you have extra time, implement all data sizes: 8 bit to 64 bit, and alignment as well.

jj2007 · April 02, 2009, 05:23:48 PM

Quote from: tetsu-jp on April 02, 2009, 02:43:04 PM

and I've read the new AMD manuals, the REP SCAS is explicitely recommended for small strings!

so, would you include it, and show the result for REP SCASB as well?

Mr tetsu,

I am afraid my knowledge of assembler is not sufficient to implement a strlen algo based on rep scasb. But we have all read your posts with great interest, and are eager to see how you would do it. Could you post a snippet, please? We know that all your sources got lost, but maybe out of your head, or with the help of the AMD manual?

Thank you so much.

News:

szLen optimize...

BeeOnRope

tetsu-jp

BeeOnRope

BeeOnRope

ecube

tetsu-jp