szLen optimize...

jj2007 · March 10, 2009, 11:33:31 PM

I was not satisfied with the performance of my algo for short strings, so I fumbled together a variant, strlen32b. It is now almost on par with Lingo's algo for short strings, and about 2% faster for very long strings. Of course, AMD, Core2 and P4 might look different again - the Celeron M is "Core" but not "Core Duo" ::)

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
codesizes: strlen32=92, strlen32b=114, strlen64A=117, _strlen=66

ERROR in strlen64A at ct 16: 14 bytes instead of 15
-- test 16k           return values Lingo, jj, Agner: 16384, 16384, 16384
strlen32      :       2881 cycles
strlen32b     :       2936 cycles
strlen64LingoA :      3024 cycles
_strlen (Agner Fog):  4250 cycles

-- test 4k            return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       3800 cycles
strlen32      :       743 cycles
strlen32b     :       744 cycles
strlen64LingoA :      774 cycles
_strlen (Agner Fog):  1103 cycles

-- test 0             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       101 cycles
strlen32      :       31 cycles
strlen32b     :       29 cycles
strlen64LingoA :      25 cycles
_strlen (Agner Fog):  30 cycles

-- test 1             return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       111 cycles
strlen32      :       31 cycles
strlen32b     :       35 cycles
strlen64LingoA :      37 cycles
_strlen (Agner Fog):  34 cycles

-- test 3             return values Lingo, jj, Agner: 11, 14, 14
crt_strlen    :       23 cycles
strlen32      :       20 cycles
strlen32b     :       16 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  14 cycles

-- test 15            return values Lingo, jj, Agner: -1, 14, 14
crt_strlen    :       23 cycles
strlen32      :       20 cycles
strlen32b     :       16 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  14 cycles

[attachment deleted by admin]

PBrennick · March 11, 2009, 12:48:59 AM

JJ,

hmmm? No kidding? Misalignment of a 'one byte aligned' string... I do not know whether to laugh or cry. I do not think you 'got' the point Hutch was trying to make.

Paul

NightWare · March 11, 2009, 01:25:59 AM

hi lingo,

Code Select

	por xmm1, xmm0
	pxor xmm2, xmm2	; why ? why do you want to use xmm2 ?
	pmovmskb edx, xmm1
	pxor xmm1, xmm1	; why ? if the mask in edx = 0 then both xmm0 and xmm1 = 0
	test edx, edx
	>>jnz<< Ex_1

hi jj, same thing for you in your algo, why do you want to clean the simd registers when it's not needed ?
plus, if you really want to take care of the speed for small strings, you should avoid the jump (it's a potential misprediction).

herge · March 11, 2009, 02:35:32 AM

Hi jj2007:

Here is my latest results:

Code Select


Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz (SSE4)
codesizes: strlen32=92, strlen32b=114, strlen64A=117, _strlen=66

ERROR in TestAlgo at ct 16: 14 bytes instead of 15
-- test 16k	      return values Lingo, jj, Agner: 16384, 16384, 16384
strlen32      :       1503 cycles
strlen32b     :       1512 cycles
strlen64LingoA :      1138 cycles
_strlen (Agner Fog):  2814 cycles

-- test 4k	      return values Lingo, jj, Agner: 4096, 4096, 4096
crt_strlen    :       2425 cycles
strlen32      :       410 cycles
strlen32b     :       403 cycles
strlen64LingoA :      325 cycles
_strlen (Agner Fog):  716 cycles

-- test 0	      return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       61 cycles
strlen32      :       16 cycles
strlen32b     :       15 cycles
strlen64LingoA :      14 cycles
_strlen (Agner Fog):  19 cycles

-- test 1	      return values Lingo, jj, Agner: 95, 95, 95
crt_strlen    :       61 cycles
strlen32      :       15 cycles
strlen32b     :       18 cycles
strlen64LingoA :      26 cycles
_strlen (Agner Fog):  20 cycles

-- test 3	      return values Lingo, jj, Agner: 11, 14, 14
crt_strlen    :       14 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
_strlen (Agner Fog):  7 cycles

-- test 15	      return values Lingo, jj, Agner: -1, 14, 14
crt_strlen    :       14 cycles
strlen32      :       10 cycles
strlen32b     :       8 cycles
strlen64LingoA :      6 cycles
_strlen (Agner Fog):  7 cycles

Press any key to exit...

Regards herge

jj2007 · March 11, 2009, 06:54:55 AM

Quote from: PBrennick on March 11, 2009, 12:48:59 AM
JJ,

hmmm? No kidding? Misalignment of a 'one byte aligned' string... I do not know whether to laugh or cry. I do not think you 'got' the point Hutch was trying to make.

Paul

Explain, please. I don't get your point.

sinsi · March 11, 2009, 07:03:23 AM

Well, jj, put it this way...

Code Select


ALIGN 1

Unless we can ALIGN 0.5, byte align is...everything...'get' it?

jj2007 · March 11, 2009, 07:05:49 AM

Quote from: NightWare on March 11, 2009, 01:25:59 AM
hi jj, same thing for you in your algo, why do you want to clean the simd registers when it's not needed ?
plus, if you really want to take care of the speed for small strings, you should avoid the jump (it's a potential misprediction).

Hi NightWare, thanks a lot for reading this thoroughly :thumbu
I have wondered myself whether clearing is not needed in some places, but was not sure. In fact, I took one out tonight, see below, ; pxor xmm0, xmm0. Could you please indicate where you consider it not needed?

As to the nullptr jump, I had to introduce it because it failed for null strings (and I admit I ws too tired to analyse the reason; plus, my Olly version here does not display xmm registers :dazzled:).

Code Select

strlen32b proc src:DWORD			; jj 9 March 2007, 92 (down from 103) bytes; 0.176 cycles/byte at 16k
	mov ecx, [esp+4]			; get pointer to string: -- this part taken from Agner Fog ----
	mov al, [ecx]		; test for Null$
	test al, al
	je nullptr
	pxor xmm0, xmm0			; set to zero for comparison
	mov eax, ecx			; copy pointer
	pxor xmm1, xmm1
	and eax, -16			; align pointer by 16
	and ecx, 15			; lower 4 bits indicate misalignment
	pcmpeqb xmm1, [eax]		; read 16 from nearest preceding boundary and compare with zero
	; lea eax, [eax+16]
	add eax, 16
	pcmpeqb xmm0, [eax]		; ecx is pointer to initial string, 16-byte aligned
	por xmm1, xmm0
	pmovmskb edx, xmm1		; get one bit for each byte result; - OK, isnull, stays null, Z=1, Bad: notnull, willbenull, z=0
	shr edx, cl			; shift out false bits	** compliments to Agner, **
	shl edx, cl			; shift back again	** this is called genius ;-) **
	test edx, edx
	jnz fdr1
	add eax, 16			; correct aligned pointer for bytes already treated above (lea exactly same cycles)
	; pxor xmm0, xmm0	(must be 0)	; reset to zero for comparisons below
	pxor xmm1, xmm1			; align 16 no good, costs about one cycle extra

@@:	pcmpeqb xmm0, [eax]		; ------ this part taken from Lingo, with adaptions ------
	pcmpeqb xmm1, [eax+16]		; ecx is pointer to initial string, 16-byte aligned
	por xmm1, xmm0
	; add eax, 32 ; is marginally slower than lea
	lea eax, [eax+32]			; len counter (moving up lea or add costs 3 cycles for the 191 byte string)
	pmovmskb edx, xmm1
	test edx, edx
	jz @B
	sub eax, 32			; subtract initial bytes

fdr1:	pmovmskb ecx, xmm0
	shl edx, 16			; bswap works, too, but one cycle slower
	or edx, ecx
	bsf edx, edx
	add eax, edx
	sub eax, [esp+4]
nullptr:	ret 4
strlen32b endp

jj2007 · March 11, 2009, 07:13:19 AM

Quote from: sinsi on March 11, 2009, 07:03:23 AM
Well, jj, put it this way...
Code Select Expand
ALIGN 1
Unless we can ALIGN 0.5, byte align is...everything...'get' it?

Yeah, of course :boohoo: I was desperately trying to find align 1 in the code I posted, but now I realise Paul means Hutch's statement. But again, this is deliberately trying to misunderstand him: What he meant (apparently - I also risk to misinterpret him) is that strings are in general not aligned on a paragraph or even dword border, and that algos for aligned strings are therefore pretty useless. Lingo's first version had that problem, but he fixed it (but still has a minor problem for very short strings). So both "fast" algos are general purpose, if you assume that the user has a modern CPU

sinsi · March 11, 2009, 07:26:47 AM

I think that strings > MAX_PATH are rare - most strings (that you need to get the length of) are short (filenames etc.) so there is no need for a 'long' string scanner.
Mind you, for a long string scanner, you have possibly opened up a small niche, and the replies re optimisation can often apply to other bits of code.

Thanks to you, I've been looking at SSEx instructions and broadening my asm horizons (at least I think 'thanks, but' since they're a bit hard atm :bdg)

herge · March 11, 2009, 07:29:29 AM

Hi jj2007:

Use windbg from Microsoft. It does display the xmm registers.

Regards herge

jj2007 · March 11, 2009, 08:13:58 AM

Quote from: sinsi on March 11, 2009, 07:26:47 AM
I think that strings > MAX_PATH are rare - most strings (that you need to get the length of) are short (filenames etc.) so there is no need for a 'long' string scanner.
Mind you, for a long string scanner, you have possibly opened up a small niche, and the replies re optimisation can often apply to other bits of code.

Thanks to you, I've been looking at SSEx instructions and broadening my asm horizons (at least I think 'thanks, but' since they're a bit hard atm :bdg)

Long strings are rare, that's correct. But the new algos are a factor 3 faster than len(My$):

Code Select

-- test 0             return values Lingo, jj, Agner: 62, 62, 62
Masm32 lib szLen :    88 cycles
crt_strlen    :       81 cycles
strlen32b     :       23 cycles
strlen64LingoA :      24 cycles
_strlen (Agner Fog):  26 cycles

-- test 1             return values Lingo, jj, Agner: 62, 62, 62
Masm32 lib szLen :    89 cycles
crt_strlen    :       81 cycles
strlen32b     :       23 cycles
strlen64LingoA :      23 cycles
_strlen (Agner Fog):  22 cycles

D:\masm32\examples\exampl10\timer_demos\unroll\unroll_test.exe has 62 bytes ;-)

hutch-- · March 11, 2009, 10:21:19 AM

:bg

> Shame you don't read the posts in your own forum. Both 'winner' algos have no problem with misalignment.

Years of reading posts leave you with a reasonably good idea of the value of an "atom cracking" string length algo. Let me think, "As useful as a hip pocket in a singlet", wjhat about the world's fastest "MessageBoxA" algo ? How about a hobbling horse in the Kentucky Derby ? :P

jj2007 · March 11, 2009, 10:39:44 AM

Quote from: herge on March 11, 2009, 07:29:29 AM
Hi jj2007:

Use windbg from Microsoft. It does display the xmm registers.

Regards herge

Thanks, herge. I will look into it. Olly has the same capacity, but it is somewhat hidden in the options.

herge · March 11, 2009, 12:06:37 PM

Hi jj2007:

See attachment a picture of windbg in action.

Regards herge

[attachment deleted by admin]

PBrennick · March 11, 2009, 12:14:23 PM

Hutch,

... llike a screen door on a submarine.

Paul

News:

szLen optimize...