Hi Everyone,
I posted this topic in a different section ( here (http://www.masm32.com/board/index.php?topic=9155.0) ) and Hutch suggested I post my implementation on this board. The just of the issue is I've rewritten the standard strstr implementation found in Visual Studio to search for a substring in a carriage return terminated string ( that is also always 2*n bytes in lenght ). My code executes equal to or faster than the standard strstr function on my P4 at work, however, its slower on my Core 2 Duo 6400 at home ( 125 ticks vs 109 on a 40 MB string to search )! I've tried various rewrites and I can't get my version to operate as fast as the standard version. Its really puzzling me as I'm trying to understand what it is about the standard version that grants it faster processing time than mine on the 6400?
Here's my code:
strnstr proc C string:dword, tofind:dword
mov esi, string
mov edi, tofind
mov ecx, esi
mov dx, [edi]
strs_start:
mov esi, ecx
strs_loop:
mov ax, [esi]
add esi, 2
cmp al, dl
je strs_f1
xor al, 0Dh
jz strs_eos
cmp ah, dl
je strs_fprep
xor ah, 0Dh
jz strs_eos
jmp strs_loop
strs_f1:
cmp ah, dh
jne strs_loop
xor ah, 0Dh
jz strs_eos
strs_fprep:
mov ebx, edi
mov ecx, esi
strs_floop:
mov ah, [ebx+2]
test ah, ah
jz strs_gotit
mov al, [esi]
cmp al, ah
jne strs_start
mov ah, [ebx+3]
test ah, ah
jz strs_gotit
mov al, [esi+1]
cmp al, ah
jne strs_start
mov ah, [ebx+4]
test ah, ah
jz strs_gotit
mov al, [esi+2]
cmp al, ah
jne strs_start
mov ah, [ebx+5]
test ah, ah
jz strs_gotit
mov al, [esi+3]
cmp al, ah
jne strs_start
add ebx, 4
add esi, 4
jmp strs_floop
strs_gotit:
mov eax, ecx
sub eax, 2
ret
strs_eos:
xor eax, eax
ret
strnstr endp
The version from Visual Studio is attached. Any ideas?
Thanks!
Xan
[attachment deleted by admin]
Xan,
Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.
Quote from: hutch-- on May 01, 2008, 09:39:42 AM
Xan,
Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.
Thanks for the reply Hutch! Yeah, I tried to just 'slim this down' and didn't worry about saving register values and whatnot, just to try to get it faster. Also, I agree about using AL with AH, but the real kicker is if you look at the strstr.asm file in the zip I attached, that's exactly what the 'standard' implementation does. Also, that said implementation only reads 1 byte at a time versus my code reading 2 at a time ( I tried altering mine to only read 1 byte and got even worse results ). Its really perplexing me something fierce! ;)
Xan,
The old VC algo was never that fast and there are a couple of approaches that are worth playing with. The traditional search algos use a byte scanner which properly optimised are reasonably fast and not subject to read/write alignment problems that WORD and DWORD scanners suffer. The other is a different type of search algo and here I have a Boyer Moore algo in mind which on plain text are generally much faster than scanner type search algos of any size.
With a reasonable scanner you get it optimised as best as possible then try unrolling it to see if it gets faster. have a look at al algo in the masm32 library called bsearch.asm as it uses a scanner technique that is unrolled to get its pace up. The other but no joy to code is the 3 Boyer Moore variants in the lirary which truly have the legs on search string over about 6 or 7 bytes long.
Apart from the BM type algos, all scanners tend to be limited by memory access speed but if you can get up to that speed you will have come close to the best hardware can do at the moment.
Xan can you post what you used for str2 and str1 in your algo?
I tried both your algo you posted here, and the one in the .zip. The one posted here had an error finding my string. So I used the one in the .zip. I also have a core 2 duo. I am not sure how you are timing your code, but in my case your code came out a few cycles faster.
str1 db 'http',0
str2 db 'asf asdf asdf asfd werygjtqlka;fdjva;l alsdjv aljdskvj alijdfcv acdv http',0
LOOP_COUNT EQU 10000000
print chr$("Xan strstr: ")
counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
lea eax,str1
push eax
lea eax,STR2
push eax
call strstr
add esp,8
counter_end
I got 176 CPU cycles on your algo, and 180 on Microsoft's. I'll show you h'ow to speed it up tomorrow. There is a lot you can do
xan,
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :
xor al,0Dh ; <- work on al + flags fix
jz strs_eos
by :
cmp al,0Dh ; <- only flags fix...
je strs_eos]
Quote from: NightWare on May 22, 2008, 10:24:57 PM
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :
You made me curious: what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?
Quote from: jj2007 on May 24, 2008, 07:56:40 PM
what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?
in fact, no serious implications with the clock cycles system (made to avoid to count the bit changes, by just counting cycles where max possible operations were defined). but, more work mean slowdown, here it's simply a physic logic (but i must admit all the speed tests i've made, were made on my old celeron 700, some time ago...). more work also mean more heat... (it was the case for old processors with clock cycles system, a bit less now...).
now we speak about that, maybe we could ask to intel why they never changed their system, and give us the possibility to use a fast set of instructions (without the sporadically needed flags fix...). i mean something like qor, qand, qxor, ... (after all, those bolean operators are the base of coding...). the automatic flags fix are clearly an amount of useless work made by the cpu... (and here i speak only of the bolean operators but it's also the case for a lot of instructions...). by doing this they also don't give us the possibility to explore/create our own fast algorithms...