strstr rewrite / core 2 6400 issue

Xan · April 30, 2008, 07:23:20 AM

Hi Everyone,
I posted this topic in a different section ( here ) and Hutch suggested I post my implementation on this board. The just of the issue is I've rewritten the standard strstr implementation found in Visual Studio to search for a substring in a carriage return terminated string ( that is also always 2*n bytes in lenght ). My code executes equal to or faster than the standard strstr function on my P4 at work, however, its slower on my Core 2 Duo 6400 at home ( 125 ticks vs 109 on a 40 MB string to search )! I've tried various rewrites and I can't get my version to operate as fast as the standard version. Its really puzzling me as I'm trying to understand what it is about the standard version that grants it faster processing time than mine on the 6400?

Here's my code:

Code Select

strnstr proc C string:dword, tofind:dword
	mov  esi, string
	mov  edi, tofind
	mov  ecx, esi
	mov   dx, [edi]

strs_start:
	mov  esi, ecx
	
strs_loop:
	mov  ax, [esi]
	add  esi, 2
	
	cmp  al, dl
	je   strs_f1
	
	xor  al, 0Dh
	jz   strs_eos

	cmp  ah, dl
	je   strs_fprep
	
	xor  ah, 0Dh
	jz   strs_eos

	jmp  strs_loop

strs_f1:
	cmp  ah, dh
	jne  strs_loop
	
	xor  ah, 0Dh
	jz   strs_eos
	
strs_fprep:
	mov  ebx, edi
	mov  ecx, esi

strs_floop:
	mov  ah, [ebx+2]
	test ah, ah
	jz   strs_gotit

	mov  al, [esi]
	cmp  al, ah
	jne  strs_start

	mov  ah, [ebx+3]
	test ah, ah
	jz   strs_gotit

	mov  al, [esi+1]
	cmp  al, ah
	jne  strs_start

	mov  ah, [ebx+4]
	test ah, ah
	jz   strs_gotit

	mov  al, [esi+2]
	cmp  al, ah
	jne  strs_start

	mov  ah, [ebx+5]
	test ah, ah
	jz   strs_gotit

	mov  al, [esi+3]
	cmp  al, ah
	jne  strs_start

	add  ebx, 4
	add  esi, 4

	jmp  strs_floop

strs_gotit:
	mov  eax, ecx
	sub  eax, 2
	ret

strs_eos:
	xor  eax, eax
	ret

strnstr endp

The version from Visual Studio is attached. Any ideas?

Thanks!
Xan

[attachment deleted by admin]

hutch-- · May 01, 2008, 09:39:42 AM

Xan,

Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.

Xan · May 06, 2008, 05:28:41 AM

Quote from: hutch-- on May 01, 2008, 09:39:42 AM
Xan,

Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.

Thanks for the reply Hutch! Yeah, I tried to just 'slim this down' and didn't worry about saving register values and whatnot, just to try to get it faster. Also, I agree about using AL with AH, but the real kicker is if you look at the strstr.asm file in the zip I attached, that's exactly what the 'standard' implementation does. Also, that said implementation only reads 1 byte at a time versus my code reading 2 at a time ( I tried altering mine to only read 1 byte and got even worse results ). Its really perplexing me something fierce! ;)

hutch-- · May 06, 2008, 05:53:34 AM

Xan,

The old VC algo was never that fast and there are a couple of approaches that are worth playing with. The traditional search algos use a byte scanner which properly optimised are reasonably fast and not subject to read/write alignment problems that WORD and DWORD scanners suffer. The other is a different type of search algo and here I have a Boyer Moore algo in mind which on plain text are generally much faster than scanner type search algos of any size.

With a reasonable scanner you get it optimised as best as possible then try unrolling it to see if it gets faster. have a look at al algo in the masm32 library called bsearch.asm as it uses a scanner technique that is unrolled to get its pace up. The other but no joy to code is the 3 Boyer Moore variants in the lirary which truly have the legs on search string over about 6 or 7 bytes long.

Apart from the BM type algos, all scanners tend to be limited by memory access speed but if you can get up to that speed you will have come close to the best hardware can do at the moment.

Mark_Larson · May 19, 2008, 01:41:19 AM

Xan can you post what you used for str2 and str1 in your algo?

I tried both your algo you posted here, and the one in the .zip. The one posted here had an error finding my string. So I used the one in the .zip. I also have a core 2 duo. I am not sure how you are timing your code, but in my case your code came out a few cycles faster.

Code Select


	str1	db	'http',0
	str2    db	'asf asdf asdf asfd werygjtqlka;fdjva;l alsdjv aljdskvj alijdfcv acdv http',0

Code Select


    LOOP_COUNT EQU 10000000
	print		chr$("Xan strstr: ")
    counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
    lea			eax,str1
    push		eax
    lea			eax,STR2
    push		eax
	call		strstr
	add			esp,8
    counter_end

Code Select

I got 176 CPU cycles on your algo, and 180 on Microsoft's. I'll show you h'ow to speed it up tomorrow. There is a lot you can do

NightWare · May 22, 2008, 10:24:57 PM

xan,
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :

Code Select

 xor al,0Dh	; <- work on al + flags fix
 jz  strs_eos

by :

Code Select

 cmp al,0Dh	; <- only flags fix...
 je  strs_eos]

jj2007 · May 24, 2008, 07:56:40 PM

Quote from: NightWare on May 22, 2008, 10:24:57 PM
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :

You made me curious: what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?

NightWare · May 25, 2008, 01:43:50 AM

Quote from: jj2007 on May 24, 2008, 07:56:40 PM
what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?

in fact, no serious implications with the clock cycles system (made to avoid to count the bit changes, by just counting cycles where max possible operations were defined). but, more work mean slowdown, here it's simply a physic logic (but i must admit all the speed tests i've made, were made on my old celeron 700, some time ago...). more work also mean more heat... (it was the case for old processors with clock cycles system, a bit less now...).

now we speak about that, maybe we could ask to intel why they never changed their system, and give us the possibility to use a fast set of instructions (without the sporadically needed flags fix...). i mean something like qor, qand, qxor, ... (after all, those bolean operators are the base of coding...). the automatic flags fix are clearly an amount of useless work made by the cpu... (and here i speak only of the bolean operators but it's also the case for a lot of instructions...). by doing this they also don't give us the possibility to explore/create our own fast algorithms...

News:

strstr rewrite / core 2 6400 issue

Xan

hutch--

Xan

hutch--

Mark_Larson

NightWare

jj2007

NightWare