News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

szLen optimize...

Started by denise_amiga, May 31, 2005, 07:42:44 PM

Previous topic - Next topic

jj2007

I was not satisfied with the performance of my algo for short strings, so I fumbled together a variant, strlen32b. It is now almost on par with Lingo's algo for short strings, and about 2% faster for very long strings. Of course, AMD, Core2 and P4 might look different again - the Celeron M is "Core" but not "Core Duo" ::)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32=92, strlen32b=114, strlen64A=117, _strlen=66

ERROR in strlen64A at ct 16: 14 bytes instead of 15
-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
strlen32      :       2881 cycles
strlen32b     :       2936 cycles
strlen64LingoA :      3024 cycles
_strlen (Agner Fog):  4250 cycles

-- test 4k            return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       3800 cycles
strlen32      :       743 cycles
strlen32b     :       744 cycles
strlen64LingoA :      774 cycles
_strlen (Agner Fog):  1103 cycles

-- test 0             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       101 cycles
strlen32      :       31 cycles
strlen32b     :       29 cycles
strlen64LingoA :      25 cycles
_strlen (Agner Fog):  30 cycles

-- test 1             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       111 cycles
strlen32      :       31 cycles
strlen32b     :       35 cycles
strlen64LingoA :      37 cycles
_strlen (Agner Fog):  34 cycles

-- test 3             return values Lingo, jj, Agner: 11, 14, 14
crt_strlen    :       23 cycles
strlen32      :       20 cycles
strlen32b     :       16 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  14 cycles

-- test 15            return values Lingo, jj, Agner: -1, 14, 14
crt_strlen    :       23 cycles
strlen32      :       20 cycles
strlen32b     :       16 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  14 cycles

[attachment deleted by admin]

PBrennick

JJ,

hmmm? No kidding? Misalignment of a 'one byte aligned' string... I do not know whether to laugh or cry. I do not think you 'got' the point Hutch was trying to make.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

NightWare

hi lingo,
por xmm1, xmm0
pxor xmm2, xmm2 ; why ? why do you want to use xmm2 ?
pmovmskb edx, xmm1
pxor xmm1, xmm1 ; why ? if the mask in edx = 0 then both xmm0 and xmm1 = 0
test edx, edx
>>jnz<< Ex_1


hi jj, same thing for you in your algo, why do you want to clean the simd registers when it's not needed ?
plus, if you really want to take care of the speed for small strings, you should avoid the jump (it's a potential misprediction).

herge

 Hi jj2007:

Here is my latest results:


Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32=92, strlen32b=114, strlen64A=117, _strlen=66

ERROR in TestAlgo at ct 16: 14 bytes instead of 15
-- test 16k       return values Lingo, jj, Agner: 16384, 16384, 16384
strlen32      :       1503 cycles
strlen32b     :       1512 cycles
strlen64LingoA :      1138 cycles
_strlen (Agner Fog):  2814 cycles

-- test 4k       return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       2425 cycles
strlen32      :       410 cycles
strlen32b     :       403 cycles
strlen64LingoA :      325 cycles
_strlen (Agner Fog):  716 cycles

-- test 0       return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       61 cycles
strlen32      :       16 cycles
strlen32b     :       15 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  19 cycles

-- test 1       return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       61 cycles
strlen32      :       15 cycles
strlen32b     :       18 cycles
strlen64LingoA :      26 cycles
_strlen (Agner Fog):  20 cycles

-- test 3       return values Lingo, jj, Agner: 11, 14, 14
crt_strlen    :       14 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
_strlen (Agner Fog):  7 cycles

-- test 15       return values Lingo, jj, Agner: -1, 14, 14
crt_strlen    :       14 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...



Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: PBrennick on March 11, 2009, 12:48:59 AM
JJ,

hmmm? No kidding? Misalignment of a 'one byte aligned' string... I do not know whether to laugh or cry. I do not think you 'got' the point Hutch was trying to make.

Paul


Explain, please. I don't get your point.

sinsi

Well, jj, put it this way...

ALIGN 1

Unless we can ALIGN 0.5, byte align is...everything...'get' it?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: NightWare on March 11, 2009, 01:25:59 AM
hi jj, same thing for you in your algo, why do you want to clean the simd registers when it's not needed ?
plus, if you really want to take care of the speed for small strings, you should avoid the jump (it's a potential misprediction).

Hi NightWare, thanks a lot for reading this thoroughly :thumbu
I have wondered myself whether clearing is not needed in some places, but was not sure. In fact, I took one out tonight, see below, ; pxor xmm0, xmm0. Could you please indicate where you consider it not needed?

As to the nullptr jump, I had to introduce it because it failed for null strings (and I admit I ws too tired to analyse the reason; plus, my Olly version here does not display xmm registers :dazzled:).

strlen32b proc src:DWORD ; jj 9 March 2007, 92 (down from 103) bytes; 0.176 cycles/byte at 16k
mov ecx, [esp+4] ; get pointer to string: -- this part taken from Agner Fog ----
mov al, [ecx] ; test for Null$
test al, al
je nullptr
pxor xmm0, xmm0 ; set to zero for comparison
mov eax, ecx ; copy pointer
pxor xmm1, xmm1
and eax, -16 ; align pointer by 16
and ecx, 15 ; lower 4 bits indicate misalignment
pcmpeqb xmm1, [eax] ; read 16 from nearest preceding boundary and compare with zero
; lea eax, [eax+16]
add eax, 16
pcmpeqb xmm0, [eax] ; ecx is pointer to initial string, 16-byte aligned
por xmm1, xmm0
pmovmskb edx, xmm1 ; get one bit for each byte result; - OK, isnull, stays null, Z=1, Bad: notnull, willbenull, z=0
shr edx, cl ; shift out false bits ** compliments to Agner, **
shl edx, cl ; shift back again ** this is called genius ;-) **
test edx, edx
jnz fdr1
add eax, 16 ; correct aligned pointer for bytes already treated above (lea exactly same cycles)
; pxor xmm0, xmm0 (must be 0) ; reset to zero for comparisons below
pxor xmm1, xmm1 ; align 16 no good, costs about one cycle extra

@@: pcmpeqb xmm0, [eax] ; ------ this part taken from Lingo, with adaptions ------
pcmpeqb xmm1, [eax+16] ; ecx is pointer to initial string, 16-byte aligned
por xmm1, xmm0
; add eax, 32 ; is marginally slower than lea
lea eax, [eax+32] ; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
pmovmskb edx, xmm1
test edx, edx
jz @B
sub eax, 32 ; subtract initial bytes

fdr1: pmovmskb ecx, xmm0
shl edx, 16 ; bswap works, too, but one cycle slower
or edx, ecx
bsf edx, edx
add eax, edx
sub eax, [esp+4]
nullptr: ret 4
strlen32b endp

jj2007

Quote from: sinsi on March 11, 2009, 07:03:23 AM
Well, jj, put it this way...

ALIGN 1

Unless we can ALIGN 0.5, byte align is...everything...'get' it?

Yeah, of course :boohoo: I was desperately trying to find align 1 in the code I posted, but now I realise Paul means Hutch's statement. But again, this is deliberately trying to misunderstand him: What he meant (apparently - I also risk to misinterpret him) is that strings are in general not aligned on a paragraph or even dword border, and that algos for aligned strings are therefore pretty useless. Lingo's first version had that problem, but he fixed it (but still has a minor problem for very short strings). So both "fast" algos are general purpose, if you assume that the user has a modern CPU

sinsi

I think that strings > MAX_PATH are rare - most strings (that you need to get the length of) are short (filenames etc.) so there is no need for a 'long' string scanner.
Mind you, for a long string scanner, you have possibly opened up a small niche, and the replies re optimisation can often apply to other bits of code.

Thanks to you, I've been looking at SSEx instructions and broadening my asm horizons (at least I think 'thanks, but' since they're a bit hard atm  :bdg)
Light travels faster than sound, that's why some people seem bright until you hear them.

herge

 Hi jj2007:

Use windbg from Microsoft. It does display the xmm registers.

Regards herge
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

jj2007

Quote from: sinsi on March 11, 2009, 07:26:47 AM
I think that strings > MAX_PATH are rare - most strings (that you need to get the length of) are short (filenames etc.) so there is no need for a 'long' string scanner.
Mind you, for a long string scanner, you have possibly opened up a small niche, and the replies re optimisation can often apply to other bits of code.

Thanks to you, I've been looking at SSEx instructions and broadening my asm horizons (at least I think 'thanks, but' since they're a bit hard atm  :bdg)

Long strings are rare, that's correct. But the new algos are a factor 3 faster than len(My$):

-- test 0             return values Lingo, jj, Agner: 62, 62, 62
Masm32 lib szLen :    88 cycles
crt_strlen    :       81 cycles
strlen32b     :       23 cycles
strlen64LingoA :      24 cycles
_strlen (Agner Fog):  26 cycles

-- test 1             return values Lingo, jj, Agner: 62, 62, 62
Masm32 lib szLen :    89 cycles
crt_strlen    :       81 cycles
strlen32b     :       23 cycles
strlen64LingoA :      23 cycles
_strlen (Agner Fog):  22 cycles


D:\masm32\examples\exampl10\timer_demos\unroll\unroll_test.exe has 62 bytes ;-)

hutch--

 :bg

> Shame you don't read the posts in your own forum. Both 'winner' algos have no problem with misalignment.

Years of reading posts leave you with a reasonably good idea of the value of an "atom cracking" string length algo. Let me think, "As useful as a hip pocket in a singlet", wjhat about the world's fastest "MessageBoxA" algo ? How about a hobbling horse in the Kentucky Derby ?  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: herge on March 11, 2009, 07:29:29 AM
Hi jj2007:

Use windbg from Microsoft. It does display the xmm registers.

Regards herge

Thanks, herge. I will look into it. Olly has the same capacity, but it is somewhat hidden in the options.

herge


Hi jj2007:

See attachment a picture of windbg in action.

Regards herge

[attachment deleted by admin]
// Herge born  Brussels, Belgium May 22, 1907
// Died March 3, 1983
// Cartoonist of Tintin and Snowy

PBrennick

Hutch,

... llike a screen door on a submarine.

Paul
The GeneSys Project is available from:
The Repository or My crappy website