News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Checking whether a string contains another string

Started by uosunsetter, April 18, 2012, 11:30:01 PM

Previous topic - Next topic

MichaelW

One of the reasons for the slowness of the API functions may be built-in buffer-overrun protections that the CRT functions lack. See the Security Note here:

http://msdn.microsoft.com/en-us/library/z9da80kz(v=vs.71).aspx

But this does not explain why StrStrI is so much slower than StrStr. I was able to throw together a replacement for StrStrI:

        push esi
        push edi
        invoke crt__strdup, ADDR str1
        mov esi, eax
        invoke crt__strdup, ADDR str2
        mov edi, eax
        invoke crt__strlwr, esi
        invoke crt__strlwr, edi
        invoke StrStr, esi, edi
        invoke crt_free, esi
        invoke crt_free, edi
        pop esi
        pop edi


That running on my P3 is some 7 times faster than StrStrI.
eschew obfuscation

SteveAsm

Quote from: MichaelW on April 19, 2012, 06:49:50 PM
...See the Security Note here:

Yeh, scary stuff.

Quote
But this does not explain why StrStrI is so much slower than StrStr. I was able to throw together a replacement for StrStrI: ...
That running on my P3 is some 7 times faster than StrStrI.

7x is a significant improvement.

jj2007

It's probably bad design. Even the Unicode version of MasmBasic Instr, wInstr, runs in 1600 cycles instead of 35000:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
852     cycles for MyTestCode (UoSunSetter)
430     cycles for InString, result=100
391     cycles for Instr_ (MasmBasic), result=100
379     cycles for Instr_, case-insensitive, result=100
1117    cycles for wInstr, case-sensitive, result=100
1564    cycles for wInstr, case-insensitive, result=100
383     cycles for C strstr, result=100
937     cycles for StrStr, result=100
35480   cycles for StrStrI, result=100

... and that one works fine with case-insensitive Russian Unicode strings, for example. Why this would need a check for buffer overrun is a mystery - we are talking zero-delimited strings here ::)

uosunsetter

OMG! How about this? :D I hope I didn't make any mistake on benchmark... I put an attachment which includes benchmark. I set MasmBasic flag to 0 'cause couldn't wait to see the results :D

Intel(R) Core(TM) i7 CPU       Q 720  @ 1.60GHz (SSE4)
56   cycles for MyTestCode (UoSunSetter)
560   cycles for InString, result=100
524   cycles for C strstr, result=100
1163   cycles for StrStr, result=100
44536   cycles for StrStrI, result=100

56   cycles for MyTestCode (UoSunSetter)
571   cycles for InString, result=100
468   cycles for C strstr, result=100
1176   cycles for StrStr, result=100
43617   cycles for StrStrI, result=100

.386
.model flat,stdcall
option casemap:none

include     \masm32\include\windows.inc
include     \masm32\include\kernel32.inc
include     \masm32\include\masm32.inc
include     \masm32\include\msvcrt.inc

includelib  \masm32\lib\kernel32.lib
includelib  \masm32\lib\msvcrt.lib
includelib  \masm32\lib\masm32.lib



.data
msgyes db 'yes ',0
msgno db 'no ',0

.data?
str1 db 100 dup(?)
str2 db 100 dup(?)

.code

start:

;Get 2 strings
invoke StdIn,ADDR str1,100
invoke StdIn,ADDR str2,100

;Get the length of str1 and save it to bl
xor bl, bl ;Zero bl
lea esi, str1

repeatstr1: ;repeat until the end of str1
mov al, [esi]
inc esi
inc bh
cmp al, 0
jne repeatstr1



;Get the length of str2 and save it to bh
xor bh, bh ;Zero bh
lea esi, str2

repeatstr2: ;repeat until the end of str2
mov al, [esi]
inc esi
inc bh
cmp al, 0
jne repeatstr2

;Decrease bl and bh because there will be no need to compare '0'
dec bl
dec bh

;Set cl to know how many times to check(in my case bl will be always greater then bh)
mov cl, bl
sub cl, bh
inc cl

;Set initials
lea esi, str1
lea edi, str2
lea edx, str1 ;will need a copy
mov ch, bh; another copy for len of str2
jmp MatchFirstLetters ;start by finding first match


PrepareForNextMatchingCheck:     ;Prepare registers for finding next first letter match also test if it is the last check
inc edx
mov esi, edx
dec cl
cmp cl, 0
je NotFound



MatchFirstLetters:
mov al, [esi]
mov ah, [edi]
cmp al, ah
jne PrepareForNextMatchingCheck ;no luck? try next

lea edi, str2 ;need to reset edi since it will manipulated by CompareTheRest
inc edi ;first letter already checked so start from second one
inc esi ;same as above
mov bh, ch ;reset the iterator used in CompareTheRest


CompareTheRest: ;Match the rest of the chars one by one
mov al, [esi]
mov ah, [edi]
cmp al, ah
jne PrepareForNextMatchingCheck
dec bh
cmp bh, 1
je Found ;if comparision is done str2's length-1 times and there is no mismatch then found
inc esi
inc edi
jmp CompareTheRest




NotFound:
mov eax, 0
jmp finish
Found:
mov eax, 1
jmp finish

;end
finish:
invoke ExitProcess,0

END start

uosunsetter

Forgot to delete driver prefix from include lines in asm file included in benchmark.zip :(

Here is the attachment with fix...

SteveAsm

In every test I ran this, the C strstr ran the fastest by about 8-9%.


Intel(R) Celeron(R) M processor         1.70GHz (SSE2)
856         cycles for MyTestCode (UoSunSetter)
430         cycles for InString, result=100
399         cycles for Instr_ (MasmBasic), result=100
404         cycles for Instr_, case-insensitive, result=100
1124        cycles for wInstr, case-sensitive, result=100
1606        cycles for wInstr, case-insensitive, result=100
369         cycles for C strstr, result=100
984         cycles for StrStr, result=100
35407       cycles for StrStrI, result=100

849         cycles for MyTestCode (UoSunSetter)
430         cycles for InString, result=100
403         cycles for Instr_ (MasmBasic), result=100
405         cycles for Instr_, case-insensitive, result=100
1262        cycles for wInstr, case-sensitive, result=100
1428        cycles for wInstr, case-insensitive, result=100
369         cycles for C strstr, result=100
930         cycles for StrStr, result=100
35429       cycles for StrStrI, result=100



MasmBasic was always a close second.

uosunsetter

Can it be the old codes I've written? Did you download the latest attachment I posted?

SteveAsm

Quote from: uosunsetter on April 19, 2012, 11:51:59 PM
Can it be the old codes I've written?

Yes, it is the old code.
Not a mistake, I was trying to figure why the API functions are so slow.

Quote
Did you download the latest attachment I posted?

Yes, I did download the new code.
I haven't had the chance to look at the source code yet.
But, the result is quite impressive, if it's right.

SteveAsm

Quote from: uosunsetter on April 19, 2012, 10:39:49 PM
OMG! How about this? :D

Okay, here is your latest benchmark:

Intel(R) Celeron(R) M processor         1.70GHz (SSE2)
51          cycles for MyTestCode (UoSunSetter)
427         cycles for InString, result=100
369         cycles for C strstr, result=100
938         cycles for StrStr, result=100
35567       cycles for StrStrI, result=100

49          cycles for MyTestCode (UoSunSetter)
431         cycles for InString, result=100
393         cycles for C strstr, result=100
944         cycles for StrStr, result=100
35595       cycles for StrStrI, result=100



My concern, is that since MyTestCode does not output a value, we don't really know if it's working correctly.
I think it must be short circuiting somewhere.

SteveAsm

Okay,
looking at your code for MyTestCode, I discovered a flaw in the execution.

When the benchmark executes, it does not wait for any input.
All the other programs use a static string to test against, MyTestCode does not.
As the code is written, it wants the user to enter str1 and str2. This never happens.

So, it proceeds to compare two empty strings and it fails.
For varification, I made two modifications to the code to test this.

Additionally, when I assemble and test the code, it works, but, fails the test as well.
Here are the two changes I made:

.data
    msgyes         db  'yes ',0
    msgno          db  'no ',0
    msgFailed      db  'procedure failed',0
    msgSuccess     db  'procedure suceeded',0
   


NotFound:
    mov eax, 0
    printstr msgFailed    ; my own print routine
    printstr lf

    jmp finish
Found:
    mov eax, 1
    printstr msgSuccess    ; my own print routine
    printstr lf

    jmp finish

;end
finish:
    call _getch          ; pause

    invoke ExitProcess,0
END start



The output message is: "procedure failed"

EDIT:
I just compared the code you posted above with the code in "benchmark.asm" and I see a difference.
I was testing the code above.
The "benchmark.asm" code does use static strings for comparison, but, I believe it is still failing.

MichaelW

I would expect the CRT strstr function to be fairly fast on short strings. But my previous tests against code that utilizes a Boyer-Moore search (the FreeBASIC Instr function), and the tests below, suggest that strstr is nothing fancy, just simple compiler-optimized code.

;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
    str1 db "The three algorithms are variations of a Boyer Moore design.",0
    size1 equ $-str1
    str2 db "design.",0
    size2 equ $-str2
    str3 db "The three algorithms are variations of a Boyer Moore design."
         db "The original design BMBinSearch is the most flexible across "
         db "different search conditions. The two variations BMHBinsearch "
         db "and SBMBinSearch are respectively a Horspool variation and a "
         db "simplified variation which perform well depending on the "
         db "hardware and the patterns being searched."
         db "The minimum pattern length is 2 bytes but these algorithms are "
         db "slower on very short paterns than a classic byte scanner."
         db "The algorithms have about a 300 character penalty at startup "
         db "as they must construct a table so for either very short "
         db "patterns or short data it is more efficient to use a byte "
         db "scanner.",0
    size3 equ $-str3
    str4 db "it is more efficient to use a byte scanner.",0
    size4 equ $-str4
.code
;==============================================================================
start:

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke crt_strstr, ADDR str1, ADDR str2
    sub eax, OFFSET str1
    printf("%d\n",eax)
    invoke BMBinSearch, 0, ADDR str1, size1, ADDR str2, size2
    printf("%d\n\n",eax)

    invoke crt_strstr, ADDR str3, ADDR str4
    sub eax, OFFSET str3
    printf("%d\n",eax)
    invoke BMBinSearch, 0, ADDR str3, size3, ADDR str4, size4
    printf("%d\n\n",eax)

    invoke Sleep, 3000

    counter_begin 1000000, HIGH_PRIORITY_CLASS
        invoke crt_strstr, ADDR str1, ADDR str2
    counter_end
    printf("%d cycles, strstr\n", eax)

    counter_begin 1000000, HIGH_PRIORITY_CLASS
        invoke BMBinSearch, 0, ADDR str1, size1, ADDR str2, size2
    counter_end
    printf("%d cycles, BMBinSearch\n\n", eax)

    counter_begin 1000000, HIGH_PRIORITY_CLASS
        invoke crt_strstr, ADDR str3, ADDR str4
    counter_end
    printf("%d cycles, strstr\n", eax)

    counter_begin 1000000, HIGH_PRIORITY_CLASS
        invoke BMBinSearch, 0, ADDR str3, size3, ADDR str4, size4
    counter_end
    printf("%d cycles, BMBinSearch\n\n", eax)

    inkey
    exit
;==============================================================================
end start


Running on a P3:

53
53

600
600

211 cycles, strstr
812 cycles, BMBinSearch

2579 cycles, strstr
1396 cycles, BMBinSearch


Oops, I'm off by one on my sizes, but it's not worth fixing.
eschew obfuscation

dedndave

QuoteThe output message is: "procedure failed"

good time on that, though   :P
i am actually a little surprised

jj2007

Quote from: SteveAsm on April 19, 2012, 11:43:05 PM
In every test I ran this, the C strstr ran the fastest by about 8-9%.
...
MasmBasic was always a close second.

Timings may change by some % depending on the location of the code (see here), and of course they differ by CPU:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
754     cycles for MyTestCode (UoSunSetter)
615     cycles for InString, result=100
412     cycles for Instr_ (MasmBasic), result=100
399     cycles for Instr_, case-insensitive, result=100
554     cycles for C strstr, result=100
86779   cycles for StrStrI, result=100


The MasmBasic implementation is pretty fast especially for longer strings, but it has a bit of overhead since it handles also case-insensitive and Unicode strings. The C functions are generally very competitive, though. The only surprise here is the StrStrI result. By the way, Lingo wrote SSE4 code that beats everything here, but beware of GPFs :wink

uosunsetter

From the beginning I didn't noticed one of the fixes that jj2007 (about getting strings' pointer) made to my codes. SteveAsm was also right :) I just paid too much attention to algorithm that I missed other points.

Also, after spending 2-3 more hours, I accepted some facts: Whoever wrote the InString function is far more experienced than me. Also the time he/she spent to write the codes is probably more than me. So understanding what it is written and manipulating it according to the cases that I will face seems quite beneficial.

But before I go, I got a question: Is there anyway to check if the CPU supports SSE4 instruction set so that I can take the advantage of the link below(thanks jj2007 for pointing and lingo for codes)???? Also if the CPU doesn't support I should be able to call standart(or edited) InString function. And of course it should be checked on execution.
http://www.masm32.com/board/index.php?topic=9370.msg132644#msg132644

dedndave

if you enable pentium instructions...
        .586
you can use CPUID
with all it's caveats, it is pretty easy to get the level of SSE support
most everyone has SSE2 or SSE3 - at least (not all, though)
more recent processors support SSE4 and better

http://www.intel.com/Assets/PDF/appnote/241618.pdf
http://support.amd.com/us/Embedded_TechDocs/25481.pdf