News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

strstr rewrite / core 2 6400 issue

Started by Xan, April 30, 2008, 07:23:20 AM

Previous topic - Next topic

Xan

Hi Everyone,
I posted this topic in a different section ( here ) and Hutch suggested I post my implementation on this board. The just of the issue is I've rewritten the standard strstr implementation found in Visual Studio to search for a substring in a carriage return terminated string ( that is also always 2*n bytes in lenght ). My code executes equal to or faster than the standard strstr function on my P4 at work, however, its slower on my Core 2 Duo 6400 at home ( 125 ticks vs 109 on a 40 MB string to search )! I've tried various rewrites and I can't get my version to operate as fast as the standard version. Its really puzzling me as I'm trying to understand what it is about the standard version that grants it faster processing time than mine on the 6400?

Here's my code:
strnstr proc C string:dword, tofind:dword
mov  esi, string
mov  edi, tofind
mov  ecx, esi
mov   dx, [edi]

strs_start:
mov  esi, ecx

strs_loop:
mov  ax, [esi]
add  esi, 2

cmp  al, dl
je   strs_f1

xor  al, 0Dh
jz   strs_eos

cmp  ah, dl
je   strs_fprep

xor  ah, 0Dh
jz   strs_eos

jmp  strs_loop

strs_f1:
cmp  ah, dh
jne  strs_loop

xor  ah, 0Dh
jz   strs_eos

strs_fprep:
mov  ebx, edi
mov  ecx, esi

strs_floop:
mov  ah, [ebx+2]
test ah, ah
jz   strs_gotit

mov  al, [esi]
cmp  al, ah
jne  strs_start

mov  ah, [ebx+3]
test ah, ah
jz   strs_gotit

mov  al, [esi+1]
cmp  al, ah
jne  strs_start

mov  ah, [ebx+4]
test ah, ah
jz   strs_gotit

mov  al, [esi+2]
cmp  al, ah
jne  strs_start

mov  ah, [ebx+5]
test ah, ah
jz   strs_gotit

mov  al, [esi+3]
cmp  al, ah
jne  strs_start

add  ebx, 4
add  esi, 4

jmp  strs_floop

strs_gotit:
mov  eax, ecx
sub  eax, 2
ret

strs_eos:
xor  eax, eax
ret

strnstr endp


The version from Visual Studio is attached. Any ideas?

Thanks!
Xan

[attachment deleted by admin]

hutch--

Xan,

Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Xan

Quote from: hutch-- on May 01, 2008, 09:39:42 AM
Xan,

Sorry to be a bit slow but I have had a lot to do lately. Just a quick look at your code but there are a couple of things I can see. First unless the register preservation scheme is peculiar to VC, it should preserve EBX ESI and EDI which has not been done. RE the speed of the algo, on most late hardware it is not a good thing to split the 2 low bytes of a register (AL AH) and especially if you are doing comparisons with them as it involves reading the same memory address and masking the results at a processor level.
Thanks for the reply Hutch! Yeah, I tried to just 'slim this down' and didn't worry about saving register values and whatnot, just to try to get it faster. Also, I agree about using AL with AH, but the real kicker is if you look at the strstr.asm file in the zip I attached, that's exactly what the 'standard' implementation does. Also, that said implementation only reads 1 byte at a time versus my code reading 2 at a time ( I tried altering mine to only read 1 byte and got even worse results ). Its really perplexing me something fierce! ;)

hutch--

Xan,

The old VC algo was never that fast and there are a couple of approaches that are worth playing with. The traditional search algos use a byte scanner which properly optimised are reasonably fast and not subject to read/write alignment problems that WORD and DWORD scanners suffer. The other is a different type of search algo and here I have a Boyer Moore algo in mind which on plain text are generally much faster than scanner type search algos of any size.

With a reasonable scanner you get it optimised as best as possible then try unrolling it to see if it gets faster. have a look at al algo in the masm32 library called bsearch.asm as it uses a scanner technique that is unrolled to get its pace up. The other but no joy to code is the 3 Boyer Moore variants in the lirary which truly have the legs on search string over about 6 or 7 bytes long.

Apart from the BM type algos, all scanners tend to be limited by memory access speed but if you can get up to that speed you will have come close to the best hardware can do at the moment.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark_Larson

Xan can you post what you used for str2 and str1 in your algo?

I tried both your algo you posted here, and the one in the .zip.  The one posted here had an error finding my string.  So I used the one in the .zip.  I also have a core 2 duo.  I am not sure how you are timing your code, but in my case your code came out a few cycles faster.


str1 db 'http',0
str2    db 'asf asdf asdf asfd werygjtqlka;fdjva;l alsdjv aljdskvj alijdfcv acdv http',0



    LOOP_COUNT EQU 10000000
print chr$("Xan strstr: ")
    counter_begin LOOP_COUNT, REALTIME_PRIORITY_CLASS
    lea eax,str1
    push eax
    lea eax,STR2
    push eax
call strstr
add esp,8
    counter_end


  I got 176 CPU cycles on your algo, and 180 on Microsoft's.  I'll show you h'ow to speed it up tomorrow.  There is a lot you can do
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

NightWare

xan,
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :
xor al,0Dh ; <- work on al + flags fix
jz  strs_eos

by :
cmp al,0Dh ; <- only flags fix...
je  strs_eos]


jj2007

Quote from: NightWare on May 22, 2008, 10:24:57 PM
for optimization, you must not think only in terms of cpu cycles, but also of work made by the cpu...so replace :
You made me curious: what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?

NightWare

Quote from: jj2007 on May 24, 2008, 07:56:40 PM
what are the implications if you choose an alternative opcode that scores an equal number of cycles but makes the CPU work harder?
in fact, no serious implications with the clock cycles system (made to avoid to count the bit changes, by just counting cycles where max possible operations were defined). but, more work mean slowdown, here it's simply a physic logic (but i must admit all the speed tests i've made, were made on my old celeron 700, some time ago...). more work also mean more heat... (it was the case for old processors with clock cycles system, a bit less now...).

now we speak about that, maybe we could ask to intel why they never changed their system, and give us the possibility to use a fast set of instructions (without the sporadically needed flags fix...). i mean something like qor, qand, qxor, ... (after all, those bolean operators are the base of coding...). the automatic flags fix are clearly an amount of useless work made by the cpu... (and here i speak only of the bolean operators but it's also the case for a lot of instructions...). by doing this they also don't give us the possibility to explore/create our own fast algorithms...