News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

lingo

"why do you want to use xmm2 ?"
Thanks NightWare, it was from other similar algos...
IMO we may need several strlen algos to use in the application.
For example: strlenA for bigger strings and strlenB  for short strings...
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
codesizes: strlen32=92, strlen32b=114, strlen64A=112, strlen64B=87, _strlen=66

-- test 16k           return values LingoA,LingoB, jj, Agner: 16384, 16384, 163
4, 16384
strlen32      :       1577 cycles
strlen32b     :       1585 cycles
strlen64LingoA :      1553 cycles
strlen64LingoB :      1604 cycles
_strlen (Agner Fog):  2793 cycles

-- test 4k            return values LingoA,LingoB, jj, Agner: 4096, 4096, 4096, 4096
crt_strlen    :       2727 cycles
strlen32      :       420 cycles
strlen32b     :       421 cycles
strlen64LingoA :      405 cycles
strlen64LingoB :      412 cycles
_strlen (Agner Fog):  716 cycles

-- test 0             return values LingoA,LingoB, jj, Agner: 95, 95, 95, 95
crt_strlen    :       77 cycles
strlen32      :       17 cycles
strlen32b     :       15 cycles
strlen64LingoA :      11 cycles
strlen64LingoB :      13 cycles
_strlen (Agner Fog):  19 cycles

-- test 1             return values LingoA,LingoB, jj, Agner: 95, 95, 95, 95
crt_strlen    :       79 cycles
strlen32      :       17 cycles
strlen32b     :       19 cycles
strlen64LingoA :      28 cycles
strlen64LingoB :      25 cycles
_strlen (Agner Fog):  20 cycles

-- test 3             return values LingoA,LingoB, jj, Agner: 14, 14, 14, 14
crt_strlen    :       17 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
strlen64LingoB :      4 cycles
_strlen (Agner Fog):  7 cycles

-- test 15            return values LingoA,LingoB, jj, Agner: 14, 14, 14, 14
crt_strlen    :       16 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...





[attachment deleted by admin]

askm

I imagine this because there are lots of timings posted on this and other topics.

Wouldnt it be real nice to be able to write

code and is profiled as your writing it...youd get timings instantly !

Timings that would be identical to what youd get as you do timings now, manually.

Or even written and profiled simultaneously

as if your on a different processor altogether. Clusters ? Parallel ?

Code would be profiled by speed, security, or memory...

just daydreaming. I know this kind of editor

would have to be partially if not fully written

in assembler, and not in my lifetime ?  Open source ?

IT PROBABLY IS NOT AS DIFFICULT AS IT SEEMS, ON SOME LEVELS.

I know you think I am going toward 'the super optimizing compiler' direction.

More like 'the supervising optimizing compiler'.

NightWare

Quote from: jj2007 on March 11, 2009, 07:05:49 AM
I have wondered myself whether clearing is not needed in some places, but was not sure. In fact, I took one out tonight, see below, ; pxor xmm0, xmm0. Could you please indicate where you consider it not needed?
pxor xmm1,xmm1 just after is also useless coz you have jumped to fdr1 if it's not equal to 0.  :wink

herge

 Hi lingo:

Results from my computer.

Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32=92, strlen32b=114, strlen64A=112, strlen64B=87, _strlen=66

-- test 16k       return values LingoA,LingoB, jj, Agner: 16384, 16384, 16384, 16384
strlen32      :       1491 cycles
strlen32b     :       1521 cycles
strlen64LingoA :      1140 cycles
strlen64LingoB :      1297 cycles
_strlen (Agner Fog):  2862 cycles

-- test 4k       return values LingoA,LingoB, jj, Agner: 4096, 4096, 4096, 4096
crt_strlen    :       2443 cycles
strlen32      :       401 cycles
strlen32b     :       410 cycles
strlen64LingoA :      353 cycles
strlen64LingoB :      325 cycles
_strlen (Agner Fog):  730 cycles

-- test 0       return values LingoA,LingoB, jj, Agner: 95, 95, 95, 95
crt_strlen    :       66 cycles
strlen32      :       17 cycles
strlen32b     :       14 cycles
strlen64LingoA :      12 cycles
strlen64LingoB :      14 cycles
_strlen (Agner Fog):  23 cycles

-- test 1       return values LingoA,LingoB, jj, Agner: 95, 95, 95, 95
crt_strlen    :       62 cycles
strlen32      :       18 cycles
strlen32b     :       18 cycles
strlen64LingoA :      31 cycles
strlen64LingoB :      25 cycles
_strlen (Agner Fog):  21 cycles

-- test 3       return values LingoA,LingoB, jj, Agner: 14, 14, 14, 14
crt_strlen    :       15 cycles
strlen32      :       11 cycles
strlen32b     :       10 cycles
strlen64LingoA :      6 cycles
strlen64LingoB :      2 cycles
_strlen (Agner Fog):  7 cycles

-- test 15       return values LingoA,LingoB, jj, Agner: 14, 14, 14, 14
crt_strlen    :       14 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...


Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: NightWare on March 11, 2009, 10:43:53 PM
Quote from: jj2007 on March 11, 2009, 07:05:49 AM
I have wondered myself whether clearing is not needed in some places, but was not sure. In fact, I took one out tonight, see below, ; pxor xmm0, xmm0. Could you please indicate where you consider it not needed?
pxor xmm1,xmm1 just after is also useless coz you have jumped to fdr1 if it's not equal to 0.  :wink

I thought so, too. But the shr edx, cl (shift out false bits) trick has one nasty side effect: You might have an FF somewhere in xmm1 because there was a zero byte before your misaligned string:
      align 16
      db 15 dup (0)
      szTest_Fail db "my other brother darryl my other brother darryl"
      db 255, 255, 255, 0

Now one might argue that no sane person has a string with FF/255 bytes. But it fails exactly for this case (I tested it) :wink

NightWare

Quote from: jj2007 on March 11, 2009, 11:09:04 PM
I thought so, too. But the shr edx, cl (shift out false bits) trick has one nasty side effect: You might have an FF somewhere in xmm1 because there was a zero byte before your misaligned string:

hmm, for example you could use (in your strlen32 algo) :

pxor xmm0,xmm0
movdqu xmm1,[eax]
pcmpeqb xmm1,xmm0 ; <- here you will have the same result as pxor xmm1,xmm1 if there is no 0
and eax,0FFFFFFF0h
pmovmskb edx,xmm1
...


and no need for shr/shl edx,cl

jj2007

Quote from: NightWare on March 11, 2009, 11:49:15 PM
Quote from: jj2007 on March 11, 2009, 11:09:04 PM
I thought so, too. But the shr edx, cl (shift out false bits) trick has one nasty side effect: You might have an FF somewhere in xmm1 because there was a zero byte before your misaligned string:

hmm, for example you could use (in your strlen32 algo) :

pxor xmm0,xmm0
movdqu xmm1,[eax]
pcmpeqb xmm1,xmm0 ; <- here you will have the same result as pxor xmm1,xmm1 if there is no 0
and eax,0FFFFFFF0h
pmovmskb edx,xmm1
...


and no need for shr/shl edx,cl


Thanks, NightWare. In the meantime, I had found a different way to overcome this, a repeated pcmpeqb xmm0, [eax]:

strlen32s proc src:DWORD ; jj 12 March 2007, 89 bytes; 0.176 cycles/byte at 16k
mov ecx, [esp+4] ; get pointer to string: -- this part taken from Agner Fog ----
pxor xmm0, xmm0 ; zero for comparison
movups xmm1, [ecx] ; move 16 bytes into xmm1, unaligned (adapted from Lingo)
pcmpeqb xmm1, xmm0 ; set bytes in xmm2 to FF if nullbytes found in xmm1
pmovmskb edx, xmm1 ; set byte mask in edx
bsf eax, edx ; bit scan forward
jne Le16 ; return bsf index if a bit was set
mov eax, ecx ; copy pointer
and eax, -16 ; align pointer by 16
pxor xmm1, xmm1 ; zero for comparison
and ecx, 15 ; lower 4 bits indicate misalignment
je @F ; jumping is a few cycles faster
pcmpeqb xmm0, [eax] ; force FF's into false positives (the SSE2 equivalent to Agner's shr/shl trick)

@@: pcmpeqb xmm0, [eax] ; ------ this part taken from Lingo, with adaptions ------
pcmpeqb xmm1, [eax+16] ; ecx is pointer to initial string, 16-byte aligned
por xmm1, xmm0
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
pmovmskb edx, xmm1
test edx, edx
jz @B

pmovmskb ecx, xmm0
shl edx, 16 ; bswap works, too, but one cycle slower
or edx, ecx
bsf edx, edx
lea eax, [eax+edx-32] ; add scan index, subtract initial bytes
sub eax, [esp+4]
Le16: ret 4
strlen32s endp


New Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32s=89, strlen64B=87, _strlen=66

-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       15288 cycles
strlen32s     :       2890 cycles
strlen64LingoB :      2904 cycles
_strlen (Agner Fog):  4253 cycles

-- test 1k            return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       977 cycles
strlen32s     :       199 cycles
strlen64LingoB :      193 cycles
_strlen (Agner Fog):  272 cycles

-- test 0             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       101 cycles
strlen32s     :       29 cycles
strlen64LingoB :      28 cycles
_strlen (Agner Fog):  30 cycles

-- test 1             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       112 cycles
strlen32s     :       40 cycles
strlen64LingoB :      33 cycles
_strlen (Agner Fog):  34 cycles

-- test 3             return values Lingo, jj, Agner: 15, 15, 15
crt_strlen    :       25 cycles
strlen32s     :       5 cycles
strlen64LingoB :      6 cycles
_strlen (Agner Fog):  14 cycles

-- test 15            return values Lingo, jj, Agner: 15, 15, 15
crt_strlen    :       24 cycles
strlen32s     :       5 cycles
strlen64LingoB :      6 cycles
_strlen (Agner Fog):  14 cycles


The new version includes also a correctness test for all algos. My new favourite is strlen32s: For long strings, it is 14 cycles faster than No. 2, strlen64LingoB, while for very short strings it is a whopping 16% faster than the latter. Lingo, you have a challenge!

[attachment deleted by admin]

herge

 Hi jj2007:

Even More Results from herge.


Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32s=89, strlen64B=87, _strlen=66

-- test 16k       return values Lingo, jj, Agner: 16384, 16384, 16384
crt_strlen    :       9628 cycles
strlen32s     :       1489 cycles
strlen64LingoB :      1185 cycles
_strlen (Agner Fog):  2854 cycles

-- test 1k       return values Lingo, jj, Agner: 1024, 1024, 1024
crt_strlen    :       649 cycles
strlen32s     :       101 cycles
strlen64LingoB :      99 cycles
_strlen (Agner Fog):  193 cycles

-- test 0       return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       64 cycles
strlen32s     :       15 cycles
strlen64LingoB :      14 cycles
_strlen (Agner Fog):  19 cycles

-- test 1       return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       91 cycles
strlen32s     :       31 cycles
strlen64LingoB :      25 cycles
_strlen (Agner Fog):  20 cycles

-- test 3       return values Lingo, jj, Agner: 15, 15, 15
crt_strlen    :       17 cycles
strlen32s     :       3 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

-- test 15       return values Lingo, jj, Agner: 15, 15, 15
crt_strlen    :       15 cycles
strlen32s     :       2 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...


Regards herge

// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

NightWare

?
   mov eax, ecx         ; copy pointer why ?
   and eax, -16         ; align pointer by 16
   pxor xmm1, xmm1         ; zero for comparison why ?
you don't need the following lines anymore... whith movups the possible 0 before can't exist...
   and ecx, 15         ; lower 4 bits indicate misalignment
   je @F            ; jumping is a few cycles faster
   pcmpeqb xmm0, [eax]      ; force FF's into false positives (the SSE2 equivalent to Agner's shr/shl trick)

you just need to modify the end of the algo to obtain the correct result...

EDIT :
Quote from: jj2007 on March 12, 2009, 12:49:21 AM
for very short strings it is a whopping 16% faster than the latter. Lingo, you have a challenge!
:bg, but i remember you there is a jump, so a (certainly) branch misprediction, and
QuoteThe cost of a branch misprediction ranges from 12 to more than 50 clock cycles, depending on the length of the pipeline and other details of the microarchitecture.
(taken fom agner fog's last optimizations pdf file). so 50 cycles... it could be 1000% slower...  :bg

lingo

jj,
Let's see what you "have":  :wink
1. strlen32 - it is 1st half of code from A.Fog end the rest from Lingo - just the name strlen32 is from you
Proof:
"Now I took the best of two worlds, i.e. Lingo's speed and Agner's brilliant alignment scheme, and threw them together. The result (shown as strlen32) is, ehm, how to put it: just about good enough for my own private library: "

2. strlens32s - it is your top of the ice cream... :lol
It is code without nothing from A.Fog and 100 % from Lingo's strlenLingoB code...What happen with "Agner's brilliant alignment scheme"?  :lol
Of course the new name- strlens32s and the test program is from you again.
Proof:"My new favorite is strlen32s: bla,blah,bla..."  and Lingo's code insight  :lol

3. "Lingo, you have a challenge!"

Actually you don't have  your own code or ideas to "compete" here and I am not interested to fight with myself ...so there is no challenge for me to try to continue...
against my own code and ideas.
Proof:  I have new faster strlen algo based on the new Nehalem string instructions but it is other story and challenge.
Hence, don't hurry up and read and think about NightWare notes carefully because I don't want to publish it yet...   :lol


jj2007

Quote from: lingo on March 12, 2009, 03:44:43 AM
Proof:
"Now I took the best of two worlds, i.e. Lingo's speed and Agner's brilliant alignment scheme, and threw them together. The result (shown as strlen32) is, ehm, how to put it: just about good enough for my own private library: "


Lingo, you don't have to prove something that is openly stated. This code has evolved over time, and you, Nightware and myself, we have produced the two fastest algos ever, despite of certain trolls pretending that a fast len algo is a waste of time (but argue endlessly elsewhere about bad practices wasting cycles and damaging registers etc.). We are here because assembler can produce lean and mean code, and because it's fun testing the limits. You are excellent in testing these limits, and therefore your name does appear twice in the 30 lines of my current favourite called strlen32s. And if I find the time today, Nightware's corrections will also be tested, and his name will be added somewhere. Take it easy :U

jj2007

#236
Quote from: NightWare on March 12, 2009, 02:37:16 AM
Quote from: jj2007 on March 12, 2009, 12:49:21 AM
for very short strings it is a whopping 16% faster than the latter. Lingo, you have a challenge!
:bg, but i remember you there is a jump, so a (certainly) branch misprediction, and
QuoteThe cost of a branch misprediction ranges from 12 to more than 50 clock cycles, depending on the length of the pipeline and other details of the microarchitecture.
(taken fom agner fog's last optimizations pdf file). so 50 cycles... it could be 1000% slower...  :bg

:bg Thanks for your hints, it's now shorter and a bit faster. But Lingo's algo is equally good. New testbed attached below.

align 16 ; jj2007, 12 March 2007, 85 bytes; 0.176 cycles/byte at 16k on Celeron M (0.3 on P4)
strlen32s proc src:DWORD ; with lots of inspiration from Lingo, NightWare and Agner Fog
mov eax, [esp+4] ; get pointer to string
movups xmm1, [eax] ; move 16 bytes into xmm1, unaligned (adapted from Lingo/NightWare)
pxor xmm0, xmm0 ; zero for comparison (no longer needed for xmm1 - thanks, NightWare)
pcmpeqb xmm1, xmm0 ; set bytes in xmm1 to FF if nullbytes found in xmm1
pmovmskb eax, xmm1 ; set byte mask in eax
bsf eax, eax ; bit scan forward
jne Lt16 ; less than 16 bytes, we can return the index in eax

@@: push ecx ; all registers preserved, except eax = return value
push edx ; eax will be pointer to initial string, 16-byte aligned
mov ecx, [esp+12] ; get pointer to string
and ecx, -16 ; align initial pointer to 16-byte boundary
lea eax, [ecx+16] ; aligned pointer + 16 (first 0..15 dealt with by movups above)

@@: pcmpeqb xmm0, [eax] ; ---- inner loop inspired by Lingo, with adaptions -----
pcmpeqb xmm1, [eax+16] ; compare packed bytes in [m128] and xmm1 for equality
por xmm1, xmm0 ; or them: one of the mem locations may contain a nullbyte
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jz @B

pmovmskb ecx, xmm0 ; set byte mask in ecx (has to be repeated, sorry)
shl edx, 16 ; create space for the ecx bytes
or edx, ecx ; combine xmm0 and xmm1 results
bsf edx, edx ; bit scan for the index
lea eax, [eax+edx-32] ; add scan index, subtract initial bytes
pop edx
sub eax, [esp+8]
pop ecx
Lt16: ret 4
strlen32s endp


Timings:


              Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
ERROR in strlen64A at ebx=11: 16 bytes instead of 11

codesizes: strlen32s=77, strlen64A=120, strlen64B=87, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32s     :       4634 cycles
strlen64LingoB :      4978 cycles
_strlen (Agner Fog):  10152 cycles

-- test 4k, misaligned 11, 4096 bytes
crt_strlen    :       3955 cycles
strlen32s     :       1130 cycles
strlen64LingoB :      1126 cycles
_strlen (Agner Fog):  2235 cycles

-- test 1k, misaligned 0, 1024 bytes
strlen32s     :       345 cycles
strlen64LingoB :      349 cycles
_strlen (Agner Fog):  636 cycles

-- test 0, misaligned 0, 95 bytes
crt_strlen    :       231 cycles
strlen32s     :       80 cycles
strlen64LingoB :      80 cycles
_strlen (Agner Fog):  100 cycles

-- test 1, misaligned 1, 95 bytes
crt_strlen    :       203 cycles
strlen32s     :       91 cycles
strlen64LingoB :      58 cycles
_strlen (Agner Fog):  64 cycles

-- test 3, misaligned 3, 15 bytes
crt_strlen    :       35 cycles
strlen32s     :       11 cycles
strlen64LingoB :      13 cycles
_strlen (Agner Fog):  23 cycles

-- test 15, misaligned 15, 15 bytes
crt_strlen    :       32 cycles
strlen32s     :       12 cycles
strlen64LingoB :      14 cycles
_strlen (Agner Fog):  23 cycles


EDIT: Attached new version with minor modifications.

[attachment deleted by admin]

herge

 Hi JJ2007:

The latest results from herge.


Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
ERROR in strlen64A at ebx=11: 16 bytes instead of 11

codesizes: strlen32s=77, strlen64A=120, strlen64B=87, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
strlen32s     :       1457 cycles
strlen64LingoB :      1260 cycles
_strlen (Agner Fog):  2797 cycles

-- test 4k, misaligned 11, 4096 bytes
crt_strlen    :       2401 cycles
strlen32s     :       387 cycles
strlen64LingoB :      340 cycles
_strlen (Agner Fog):  731 cycles

-- test 1k, misaligned 0, 1024 bytes
strlen32s     :       97 cycles
strlen64LingoB :      95 cycles
_strlen (Agner Fog):  178 cycles

-- test 0, misaligned 0, 95 bytes
crt_strlen    :       60 cycles
strlen32s     :       20 cycles
strlen64LingoB :      14 cycles
_strlen (Agner Fog):  18 cycles

-- test 1, misaligned 1, 95 bytes
crt_strlen    :       63 cycles
strlen32s     :       32 cycles
strlen64LingoB :      25 cycles
_strlen (Agner Fog):  20 cycles

-- test 3, misaligned 3, 15 bytes
crt_strlen    :       15 cycles
strlen32s     :       4 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

-- test 15, misaligned 15, 15 bytes
crt_strlen    :       15 cycles
strlen32s     :       4 cycles
strlen64LingoB :      3 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...



Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

Mark Jones

Here's my compulsatory submission for the latest evolution.


AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
ERROR in strlen64A at ebx=11: 16 bytes instead of 11

codesizes: strlen32s=85, strlen64A=120, strlen64B=87, _strlen=66

-- test 16k, misaligned 0, 16384 bytes
crt_strlen    :       12338 cycles
strlen32s     :       3135 cycles
strlen64LingoB :      3120 cycles
_strlen (Agner Fog):  13916 cycles

-- test 4k, misaligned 11, 4096 bytes
crt_strlen    :       3229 cycles
strlen32s     :       828 cycles
strlen64LingoB :      814 cycles
_strlen (Agner Fog):  3496 cycles

-- test 1k, misaligned 15, 1024 bytes
crt_strlen    :       826 cycles
strlen32s     :       252 cycles
strlen64LingoB :      237 cycles
_strlen (Agner Fog):  900 cycles

-- test 0, misaligned 0, 95 bytes
crt_strlen    :       93 cycles
strlen32s     :       57 cycles
strlen64LingoB :      40 cycles
_strlen (Agner Fog):  122 cycles

-- test 1, misaligned 1, 95 bytes
crt_strlen    :       102 cycles
strlen32s     :       59 cycles
strlen64LingoB :      43 cycles
_strlen (Agner Fog):  101 cycles

-- test 3, misaligned 3, 15 bytes
crt_strlen    :       20 cycles
strlen32s     :       20 cycles
strlen64LingoB :      20 cycles
_strlen (Agner Fog):  34 cycles

-- test 15, misaligned 15, 15 bytes
crt_strlen    :       20 cycles
strlen32s     :       20 cycles
strlen64LingoB :      20 cycles
_strlen (Agner Fog):  34 cycles


Ya know, tools such as these should also show the OS version and bit-width. It could be assumed erroniously that this box is running 64-bit XP when in fact it is running 32-bit XP. (Wasteful, perhaps, but I cannot afford to upgrade in the foreseeable future.)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on March 12, 2009, 03:39:16 PM
Here's my compulsatory submission for the latest evolution.


AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

-- test 0, misaligned 0, 95 bytes
crt_strlen    :       93 cycles
strlen32s     :       57 cycles
strlen64LingoB :      40 cycles
_strlen (Agner Fog):  122 cycles

:bg Thanxalot. It seems Lingo has a little edge here. Interesting that Agner's algo gets beaten by crt_strlen, though.

Quote
Ya know, tools such as these should also show the OS version and bit-width. It could be assumed erroniously that this box is running 64-bit XP when in fact it is running 32-bit XP. (Wasteful, perhaps, but I cannot afford to upgrade in the foreseeable future.)

Good idea in principle, but showing the OS with GetVersionEx is so hilariously clumsy that I get an allergy when I even think of it :red