The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Ian_B on January 05, 2006, 06:32:36 PM

Title: String searching
Post by: Ian_B on January 05, 2006, 06:32:36 PM
I'll probably get my knuckles rapped that this has been done to death before...  ::)  but I have a need for much string searching in my app and I've been playing with search algorithms to see if I can wring some more speed out of the various approaches. My needs come in two forms, one is moving quickly through (potentially) large files to find header strings of embedded files and packetted data, and then when I have the header/packet location doing short sub-searches to find data that I know should be very close by. Testing the various search algorithms has suggested I'd be better off with the Boyer-Moore approach for the first brute-force search, and simpler byte-scanner type of searching for the second, that has less setup overhead for short search distances.

So I put together a speed comparison of these approaches, using some varied test data that I thought would be challenging. I took the simple BinSearch and the BMBin/BMHBinsearch Boyer-Moore algorithms and pitted them against one I rolled myself. That pulls in whole DWORDs and tests not for single bytes but for the first two test-string chars in all even/odd locations in the DWORD, to try and reduce the false-positives testing only for the single first char will give, while testing for data alignment for faster memory access. I also hot-rodded the MASM Lib BMBinsearch with my own optimisations to remove use of local variables and better register use, which proved well-worthwhile...  :toothy

So here's the test code, if anyone is interested. The most important thing to notice is that the "Horspool" Boyer-Moore variation as given in the latest MASM Lib, which I cut-and-pasted without alteration, fails some of these tests by not finding the strings at all, so it needs to be bug-fixed to guarantee correct output. I haven't checked that yet, or attempted an optimisation of it, as I don't know the algorithm it uses.

Results on my Northwood P4 suggest that my optimised BMBinSearch is up to 25% faster than the original for some data. It is especially improved for short searches where it cut a third off the clocks needed, removing much of the setup disadvantage compared to the byte/word scanners. My home-grown search is around a third faster than the simple BinSearch much of the time, and for non-manufactured data (ie. "normal" non-repeating text) it pleasingly approaches the speed of the original Boyer-Moore, only 20% slower or so. I'd be interested to see how both optimisations hold up on other platforms, thanks.

IanB


[attachment deleted by admin]
Title: Re: String searching
Post by: Human on January 05, 2006, 11:10:49 PM
well i dont think that single string searching speed is important on big files or even small, due time depends on hdd speed,is hdd defragmented etc, only place where it really matters is compression. and there alignment check just slowdowns all, due most of string arent aligned:( max dword compare because most strings are 2,3,4 bytes long,indexing of 2 byte and 3 offsets. 3 byte indexing gave me even 1000% speedup.
http://sourceforge.net/projects/humantools
check my hlz compressor, all ideas are mine, i coded all, and even i cant optimize it more, well maybe someday,from time to time i have small ideas. most important there is hash indexing and string searching procedure based on hashed indexes. but 3 byte indexes take 96mb for 8mb window. i dont think anything for compression is faster than that.
Title: Re: String searching
Post by: Ian_B on January 06, 2006, 12:04:03 AM
Quote from: Human on January 05, 2006, 11:10:49 PM
well i dont think that single string searching speed is important on big files or even small, due time depends on hdd speed,is hdd defragmented etc

That really depends on your application and whether you are using a brute-force approach or working smarter. If you are trying to read a huge multi-megabyte file in and then search it or process it, then of course your string searching will be small fry compared to the overhead of reading the file in the first place. But using overlapped IO with an optimum buffer size means I can read good amounts of data in and do string searching and whatever processing I need based on what I parse VERY usefully while the next chunk of the file is coming into the next allocated buffer. Smaller/smarter works much more efficiently overall, so it's still useful for me to improve search performance to leave more time for other processing while I'm waiting. YMMV for YOUR application.

Having seen the significant improvement of the Boyer-Moore approach over byte scanning, I am now wondering about pre-building shift tables for my app as I basically perform the same four or five searches repeatedly, which should make this usable for shorter searches too.

EDIT: But not by feeding in a pointer for the pre-built table and holding it in a register, as this definitely slows up the routine, perhaps because indexing [EDX+ECX*4] is slower than [ECX*4+tableaddress], plus there's now another memory read in the code where I'd lose that register. I think I'll have to test whether simply copying the 256 DWORDS from another location into the general ShiftTable buffer is slower than building the table from scratch, or I simply have 5 super-optimised routines handling specific string searches that avoid a lot of general overhead where I don't need to save input values...

IanB
Title: Re: String searching
Post by: lingo on January 06, 2006, 02:01:11 AM
 :lol
http://www.masmforum.com/simple/index.php?topic=2286.msg18039#msg18039
Title: Re: String searching
Post by: Ian_B on January 06, 2006, 04:26:23 AM
Yeah, OK, thanks for the link. It's interesting, but I hate using code I can't understand or rewrite myself, and I'm avoiding anything MMX/SSE as it restricts the processors my code can work with.

It's also noticeable that of the three BM code versions by Hutch, you chose to compare the one that my own little test shows is not only slower than the one Hutch flags clearly as the most flexible, BMBinsearch, but we now know doesn't even work correctly. And while hyper-optimising your own code, you chose not to bother trying to fix the obvious little things in Hutch's which would improve speed in that, like replacing INC and DEC with ADD 1 and SUB 1. So it does seem like a deliberately rigged comparison to me.

It also can't work well as a general search tool for comparing strings in files, as it relies on the string you're searching for being at the end of the buffer you're searching. Your code has no parameter input to allow setting the end of the area to search. If you are reading files into a buffer using normal or overlapped IO, it's kind of inconvenient to then have to copy all that data again to a second buffer just so that you can solve that wrinkle, and would probably over-ride any speed gains you made in the searching. Particularly for overlapped IO, you'd want consecutive buffers of consecutive data to actually be... consecutive, not with extra string search bytes between them, that really throws things when you're doing CRCs and MD5s on the data...  :eek

I'd have plugged it into my paltry test to see how it fared against a better optimised version of Hutch's BM, but I can't compile this code:

    punpcklqdq XMM0, XMM0 ;// XMM0=00000009000000090000000900000009
error A2008: syntax error : xmm
    movdqa [eax+edx-1*16], XMM0 ;// XMM0= lenSrchPattern + 1
error A2008: syntax error : [

etc. etc...

IanB
Title: Re: String searching
Post by: hutch-- on January 06, 2006, 06:19:41 AM
Ian,

I wrote those about 5 years ago on a PIII and it did not seem at the time to be able to be made any faster on that hardware. I don't know of a final proof for a BM style algo, I ran all of the three for hundreds of millions of tests and they performed them all but that does not garrantee reliable results in all situations. I know Lingo had a faster one and he also referenced another one by a Russian author that was faster than the PIII code I wrote.

The things that impnge of the final speed of a BM is memory access speed and unpredictable branching and it is difficult to find a way around this due to the fundamental algorithm design. ADD and SUB will be faster on later Intel hardware but not necessarily on late AMD.

I still have this page up which was the technical assumptions I used with the 3 algos. http://www.movsd.com/bm.htm

I imagine the best way to work on the design is to code them from scratch again and see if you can find a proof of accuracy.
Title: Re: String searching
Post by: Ian_B on January 06, 2006, 05:01:04 PM
Quote from: hutch-- on January 06, 2006, 06:19:41 AM
The things that impnge of the final speed of a BM is memory access speed and unpredictable branching and it is difficult to find a way around this due to the fundamental algorithm design.
That's why I re-optimised the algorithm originally to remove the "cval" local variable, which remained unchanged throughout and was merely the value of one of the unchanging registers (the checkstring length) minus 1, which was easy to recode for. That removed one memory access at least per iteration. And as I noted in my edit and comment in the code, moving the shift table from a local variable which required indexing using [REG+ECX*4+offset], where the reg happened to be ESP, to a known .DATA? location where this could be simplified to [ECX*4+offset] made most of the difference in speed, possibly because that complex indexing form is slower to compute.

There are other optimisations that can be made - for instance, if you know that the string length will never be more than 256 characters, which is pretty likely for most user applications, you can optimise the shift table to use only bytes rather than DWORDs and that indexing will simplify even further to [ECX+offset] which will compute even faster. I am currently rewriting my critical searches using this technique with prebuilt 256-byte shift tables, which will make them very lean and mean indeed.

QuoteADD and SUB will be faster on later Intel hardware but not necessarily on late AMD.
Agreed, but if optimising for the largest mass of user hardware out there, what are you gonna choose?

QuoteI still have this page up which was the technical assumptions I used with the 3 algos. http://www.movsd.com/bm.htm
Thanks for that, it'll make analysing the algorithm much easier.

QuoteI imagine the best way to work on the design is to code them from scratch again and see if you can find a proof of accuracy.
I suspect that there's a wrinkle right at the end of the Horspool code, as it appears from my timings to be doing a full iteration and flunking only at the very end. I haven't tried tracing it in a debugger yet, but from my own varied test data the Horspool variation doesn't seem to perform anywhere near as well as the "original", so I'm not too worried about that. However, it is important that the health warning is given in the MASM library code or that it's fixed, so my foray into searching hasn't been a complete waste of time...  ::)

IanB
Title: Re: String searching
Post by: Ian_B on January 06, 2006, 07:51:37 PM
OK, now THIS is a timetest!

I've done a revised optimisation of Hutch's BM using the byte-length shift table approach. It is significantly faster, and for short searches is now only half as slow as the fast word-scanner I wrote, and that's including building the table! For other searches it ranges from 75% to 25% faster, the closer speeds on more "average" text data. I've also included as a comparison the "Cresta" BM from Lingo's comparison test.

Rather than just comparing speed through vast quantities of random data, which may or may not be the sort of data being searched, my tests have quite varied synthetic repetitions which test the ability of each type of search to discard mismatches early. The Cresta BM shows poorly on most of these synthetic tests, and especially on the shorter length test, but does pull into the front on the last "average text" test. The strengths and weaknesses of these searches are clearly highlighted, but my optimisation of Hutch's BM does seem to have an advantage with many types of synthetic data, which might or might not match the sort of data being tested. For instance, test 3 where the search is for data very different to the "background".

At a quick glance, there might be optimisations possible to Cresta's code, so I'll see if that can be improved further. But knowing the type of data you are searching and matching it to the search tool does appear to have significant benefits.

EDIT: I've added to this timetest my optimised version of Cresta's code. This gets a speed boost of almost 40% in one test, but still doesn't like some of the artificial data, where my optimised version of Hutch's code chews it up. It's way ahead now in the "average" text test, though, which is possibly the most important one here, and is also even more competitive in the short search league. So it's looking like an overall winner...

IanB

Please terminate any high-priority tasks and press ENTER to begin...

String search tests:


Search Test 1 - value expected 37
BinSearch:        37 ; clocks: 194
BInStrWithLen:    37 ; clocks: 124
BMBinSearch:      37 ; clocks: 684
BMBinSearch2:     37 ; clocks: 327
FJT by Cresta:    37 ; clocks: 759
FJT2 Cresta/IanB: 37 ; clocks: 299

Search Test 2 - value expected 1007
BinSearch:        1007 ; clocks: 26382
BInStrWithLen:    1007 ; clocks: 24840
BMBinSearch:      1007 ; clocks: 32817
BMBinSearch2:     1007 ; clocks: 30997
FJT by Cresta:    1007 ; clocks: 11053
FJT2 Cresta/IanB: 1007 ; clocks: 11474

Search Test 3 - value expected 1008
BinSearch:        1008 ; clocks: 2869
BInStrWithLen:    1008 ; clocks: 2061
BMBinSearch:      1008 ; clocks: 1112
BMBinSearch2:     1008 ; clocks: 718
FJT by Cresta:    1008 ; clocks: 1095
FJT2 Cresta/IanB: 1008 ; clocks: 780

Search Test 4 - value expected 1008
BinSearch:        1008 ; clocks: 4743
BInStrWithLen:    1008 ; clocks: 4681
BMBinSearch:      1008 ; clocks: 1237
BMBinSearch2:     1008 ; clocks: 752
FJT by Cresta:    1008 ; clocks: 2067
FJT2 Cresta/IanB: 1008 ; clocks: 1562

Search Test 5 - value expected 1008
BinSearch:        1008 ; clocks: 4748
BInStrWithLen:    1008 ; clocks: 4074
BMBinSearch:      1008 ; clocks: 1237
BMBinSearch2:     1008 ; clocks: 752
FJT by Cresta:    1008 ; clocks: 2243
FJT2 Cresta/IanB: 1008 ; clocks: 1842

Search Test 6 - value expected 1008
BinSearch:        1008 ; clocks: 2863
BInStrWithLen:    1008 ; clocks: 2057
BMBinSearch:      1008 ; clocks: 4012
BMBinSearch2:     1008 ; clocks: 3583
FJT by Cresta:    1008 ; clocks: 1253
FJT2 Cresta/IanB: 1008 ; clocks: 835

Search Test 7 - value expected 1009
BinSearch:        1009 ; clocks: 5130
BInStrWithLen:    1009 ; clocks: 2773
BMBinSearch:      1009 ; clocks: 2117
BMBinSearch2:     1009 ; clocks: 1692
FJT by Cresta:    1009 ; clocks: 1341
FJT2 Cresta/IanB: 1009 ; clocks: 1120



[attachment deleted by admin]
Title: Re: String searching
Post by: hutch-- on January 07, 2006, 01:03:27 AM
Ian,

I would be careful with this assumption.

Quote
There are other optimisations that can be made - for instance, if you know that the string length will never be more than 256 characters, which is pretty likely for most user applications, you can optimise the shift table to use only bytes rather than DWORDs and that indexing will simplify even further to [ECX+offset]

Strings of 256 byte and shorter are into the "who cares" catagory and the real strength of a BM is in far larger strings where you may need to search for a couple of hundred virus signatures on every file on a hard drive where you will see very large timing differences to a classic byte scanner.

Using an aligned data section table is a good idea if you make use of the opcode that allows an offset as the base address as it saves you a register but it may not be any faster when you only use an offset without the scale *4. A DWORD table is usually faster than a BYTE table so there may be other factors involved.
Title: Re: String searching
Post by: Ian_B on January 07, 2006, 02:06:55 AM
Point taken. But in testing the code, the biggest difference I found was in replacing the [ESP+ECX*4+offset] style of addressing into the shift table with an [ECX*4+offset], which kinda went against the usual perceived wisdom of local variables, which is that the stack is more likely to be in the cache and will therefore produce less cache misses. Having the shift table in the stack wiped out almost all the performance gains I made with other optimisations, which tells me that at least on my P4 this complex addressing is slower. Logically, therefore, simplifying even further should yield better performance.

If you allowed WORD-size tables for string matches of up to 64K, which is possibly large enough for your virus signature needs, you'd still be able to get significantly better performance by replacing [ECX*2+offset] with [ECX+ECX+offset], so I think it's a valid experiment. It's easy for people, knowing this, to optimise the code themselves for the level of string-length cover they will need in their application.

And this also bears on another discussion somewhere in the forums about Intel processors being more CISC-like, happy with complex instructions, and AMD ones being more RISC-like, preferring simpler ones. You can't attack my optimisation for being too Intel-specific when this complex addressing is being removed in favour of simpler forms...  :eek

Quote from: Strings of 256 byte and shorter are into the "who cares" catagory
I can assure you that none of the packet/encoding header strings I am searching for in my mainstream real-world application are longer than 14 characters, and I'm sure there are a lot of other people out there who use the MASM library stuff for "trivial" searches... I certainly care. :wink

Here's some results where the only code changes are the use of BYTE/WORD/DWORD-sized tables and the indexing into them is respectively ShiftTable[ECX], ShiftTable[ECX+ECX] and ShiftTable[ECX*4]...


String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB, byte-length shifts:  37 ; clocks: 303
FJT3 Cresta/IanB, word-length shifts:  37 ; clocks: 356
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 456

Search Test 2 - value expected 1007
FJT2 Cresta/IanB, byte-length shifts:  1007 ; clocks: 11413
FJT3 Cresta/IanB, word-length shifts:  1007 ; clocks: 11420
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 11555

Search Test 3 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 718
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 785
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 905

Search Test 4 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1568
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1642
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1758

Search Test 5 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1890
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1966
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2083

Search Test 6 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 747
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 818
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 984

Search Test 7 - value expected 1009
FJT2 Cresta/IanB, byte-length shifts:  1009 ; clocks: 1078
FJT3 Cresta/IanB, word-length shifts:  1009 ; clocks: 1149
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 1241

Press ENTER to exit...


Convinces me...   :bg

IanB


[attachment deleted by admin]
Title: Re: String searching
Post by: Ian_B on January 09, 2006, 05:49:47 PM
As an addendum to the subject of table-indexing speed, I've empirically tested the following on long CRC loops.

Unequivocally, on P4 using the [EBX*4] style of addressing like so:


    MOV EAX, Table[EBX*4]    ; Table is DWORD aligned
    MOV ECX, Table[EDX*4]


is slower, even if only by 1% or so, than replacing with the following longer code:


    ADD EBX, EBX
    ADD EDX, EDX
    MOV EAX, Table[EBX+EBX]
    MOV ECX, Table[EDX+EDX]


which might be useful for long code loops.

IanB
Title: Re: String searching
Post by: lingo on April 11, 2009, 11:01:00 PM
"but I hate using code I can't understand or rewrite myself, and I'm avoiding anything MMX/SSE as it restricts the processors my code can work with.
..And while hyper-optimising your own code, you chose not to bother trying to fix the obvious little things in Hutch's which would improve speed in that, like replacing INC and DEC with ADD 1 and SUB 1. So it does seem like a deliberately rigged comparison to me. by Ian_B"


Some years later I included my algo with Ian_B's testbad: :wink 
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Please terminate any high-priority tasks and press ENTER to begin...


String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB, byte-length shifts:  37 ; clocks: 275
FJT3 Cresta/IanB, word-length shifts:  37 ; clocks: 216
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 275
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 141

Search Test 2 - value expected 1007
FJT2 Cresta/IanB, byte-length shifts:  1007 ; clocks: 9884
FJT3 Cresta/IanB, word-length shifts:  1007 ; clocks: 9900
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 9850
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 8540

Search Test 3 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 653
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 602
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 663
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 531

Search Test 4 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1866
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1306
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1916
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1178

Search Test 5 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1951
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1299
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1925
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1165

Search Test 6 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 770
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 716
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 775
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 652

Search Test 7 - value expected 1009
FJT2 Cresta/IanB, byte-length shifts:  1009 ; clocks: 961
FJT3 Cresta/IanB, word-length shifts:  1009 ; clocks: 914
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 970
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 842

Press ENTER to exit...

[attachment deleted by admin]
Title: Re: String searching
Post by: jj2007 on April 17, 2009, 10:38:05 AM
Hi Lingo,
I promised to look at your code. I worked out something last night, but it is still buggy - some of the tests simply don't work, one (Test 7) is far too slow.
If I find time over the weekend, I'll be back...

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Search Test 1 - value expected 37
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 404
Plain byte scanner jj2007            : 37 ; clocks: 97

Search Test 3 - value expected 1008
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1053
Plain byte scanner jj2007            : 1008 ; clocks: 884

Search Test 6 - value expected 1008
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1131
Plain byte scanner jj2007            : 1008 ; clocks: 887

Search Test 7 - value expected 1009
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 1275
Plain byte scanner jj2007            : 1009 ; clocks: 2344
Title: Re: String searching
Post by: lingo on April 17, 2009, 01:27:23 PM
The practical question here is :
If lenSrchPattern is > N bytes then use BM, else use other scanner
For now N= 8 bytes but due to new SSE2,3,4 instructions, value of N should be much bigger  :wink

Title: Re: String searching
Post by: jj2007 on April 17, 2009, 02:20:12 PM
Thanks, Lingo. Your code works fine and is very fast, but I seem not to be able to get this running:

.code
FindInWinInc db "Duplicat", 0

Start:
Invoke Main
Invoke ExitProcess, 0

Main    proc

invoke ShowCPU, 1
print   chr$(13,10)

invoke read_disk_file, chr$("\masm32\include\windows.inc"), offset fbuffer, offset flen
.if eax
print "Finding "
print offset FindInWinInc, " in windows.inc:"

(... other algos)

print chr$(13, 10, "BMLin12: ", 9)
counter_begin 1000, HIGH_PRIORITY_CLASS
invoke BMLin12, fbuffer, offset FindInWinInc, sizeof FindInWinInc-1
mov edi, eax
counter_end
mov ebx, eax
print ustr$(edi)
print chr$(" ; clocks: ")
print ustr$(ebx)
invoke Sleep, 250


Output:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Finding Duplicat in windows.inc:
Lib InString:   849668          ; clocks: 4976550
FJT3:           849667          ; clocks: 1357491
BMLin12:        0               ; clocks: 1331421
InstrSse2:      849667          ; clocks: 2093956

Code sizes:
MasmLib         156
FJT3            249
BMLin12         745
InstrSse2       275


Any idea why it returns zero? The cycle count looks plausible...
I wanted a real life test...

By the way: SSE2 is not very competitive, as you can see :wink
Title: Re: String searching
Post by: lingo on April 17, 2009, 03:34:37 PM
FindInWinInc should be after last byte of the fbuffer! :wink
Title: Re: String searching
Post by: lingo on April 18, 2009, 06:14:35 PM
If I'm not wrong FJT3 is  "FJT3 Cresta/IanB,  word-length shifts" algo
and BMLin12 is "Boyer-Moore Lingo,dword-length shifts:" algo.
They are diferent algos and it is unreal to compare them... :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Please terminate any high-priority tasks and press ENTER to begin...


String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB,  byte-length shifts: 37 ; clocks: 339
FJT3 Cresta/IanB,  word-length shifts: 37 ; clocks: 278
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 331
Boyer-Moore Lingo, byte-length shifts: 37 ; clocks: 88
Boyer-Moore Lingo, word-length shifts: 37 ; clocks: 107
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 137

Search Test 2 - value expected 1007
FJT2 Cresta/IanB,  byte-length shifts: 1007 ; clocks: 19015
FJT3 Cresta/IanB,  word-length shifts: 1007 ; clocks: 18973
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 19046
Boyer-Moore Lingo, byte-length shifts: 1007 ; clocks: 7862
Boyer-Moore Lingo, word-length shifts: 1007 ; clocks: 7867
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 7848

Search Test 3 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 652
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 601
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 763
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 474
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 498
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 599

Search Test 4 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 1853
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1303
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1863
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1096
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1120
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1156

Search Test 5 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2066
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 1299
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2056
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1092
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1114
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1145

Search Test 6 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 767
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 711
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 776
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 583
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 608
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 638

Search Test 7 - value expected 1009
FJT2 Cresta/IanB,  byte-length shifts: 1009 ; clocks: 959
FJT3 Cresta/IanB,  word-length shifts: 1009 ; clocks: 905
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 962
Boyer-Moore Lingo, byte-length shifts: 1009 ; clocks: 768
Boyer-Moore Lingo, word-length shifts: 1009 ; clocks: 792
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 826

Press ENTER to exit...




[attachment deleted by admin]
Title: Re: String searching
Post by: jj2007 on April 18, 2009, 08:14:15 PM
Hi Lingo,
I can give you some timings, but I won't have the time to add my algo to your testbed.

By the way, not everybody has Masm32 installed on C:  :wink
    include \masm32\include\windows.inc
instead of
    include C:\masm32\include\windows.inc
would be helpful for others who want to assemble your code.


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB,  byte-length shifts: 37 ; clocks: 424
FJT3 Cresta/IanB,  word-length shifts: 37 ; clocks: 461
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 506
Boyer-Moore Lingo, byte-length shifts: 37 ; clocks: 169
Boyer-Moore Lingo, word-length shifts: 37 ; clocks: 190
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 241

Search Test 2 - value expected 1007
FJT2 Cresta/IanB,  byte-length shifts: 1007 ; clocks: 17932
FJT3 Cresta/IanB,  word-length shifts: 1007 ; clocks: 17944
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 18007
Boyer-Moore Lingo, byte-length shifts: 1007 ; clocks: 10085
Boyer-Moore Lingo, word-length shifts: 1007 ; clocks: 8589
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 9015

Search Test 3 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 778
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 818
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 865
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 550
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 581
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 655

Search Test 4 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 2269
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 2393
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2367
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1767
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1754
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1322

Search Test 5 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 1759
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 2332
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1867
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 1181
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 1784
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1632

Search Test 6 - value expected 1008
FJT2 Cresta/IanB,  byte-length shifts: 1008 ; clocks: 878
FJT3 Cresta/IanB,  word-length shifts: 1008 ; clocks: 910
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 951
Boyer-Moore Lingo, byte-length shifts: 1008 ; clocks: 668
Boyer-Moore Lingo, word-length shifts: 1008 ; clocks: 685
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 756

Search Test 7 - value expected 1009
FJT2 Cresta/IanB,  byte-length shifts: 1009 ; clocks: 1069
FJT3 Cresta/IanB,  word-length shifts: 1009 ; clocks: 1098
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 1143
Boyer-Moore Lingo, byte-length shifts: 1009 ; clocks: 849
Boyer-Moore Lingo, word-length shifts: 1009 ; clocks: 875
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 933
Title: Re: String searching
Post by: jj2007 on April 20, 2009, 12:27:51 PM
Quote from: lingo on April 17, 2009, 03:34:37 PM
FindInWinInc should be after last byte of the fbuffer! :wink

Hi Lingo,

To be honest, I don't get the message: Why must the pattern be after the buffer? That is not a realistic constraint imho.

I have incorporated the Windows.inc test into the new version of the testbed, and all your three algos seem to search (as the cycle count demonstrates) but return zero. For comparison, FJT3 and all others return the correct number of bytes.

Another little problem is that if I replace Duplicate inc with Duplicate inx, your algos throw an exception ::)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Finding 'Duplicate inc' at the end of \masm32\include\windows.inc:

Lib InString:   849668          ; clocks: 4863499
FJT3:           849667          ; clocks: 1372417
BMLinDB:        0               ; clocks: 1393513
BMLinDW:        0               ; clocks: 1395150
BMLinDD:        0               ; clocks: 1352871
InstrJJ:        849668          ; clocks: 1063916

Code sizes:
FJT3            249
BMLinDB         395
InstrJJ         282


EDIT: Managed to squeeze out a few percent, see timings above and new attachment.

[attachment deleted by admin]
Title: Re: String searching
Post by: lingo on April 20, 2009, 06:06:33 PM
BM is exact pattern matching algorithm rather than substring (only) searching  algorithm (as InString).
So,  "Usage would include tasks like recursively searching files for virus patterns, searching databases for keys or data, text and word processing and any other task that requires handling large amounts of data at very high speed" by Hutch from here (http://www.movsd.com/bm.htm)
Example: With BM we can search for bytes->  SrcPattern  DB 12,0,5,99,0,0,27,0,3
with InString we can't...  :wink

If you want to use my BM algo I have really a constraint:
-your SrcPattern should be more than 8 bytes long
-the SrcPattern must be after the buffer

"That is not a realistic constraint imho."
I don't need such info...

Title: Re: String searching
Post by: jj2007 on April 20, 2009, 07:25:08 PM
Quote from: lingo on April 20, 2009, 06:06:33 PM
BM is exact pattern matching algorithm rather than substring (only) searching  algorithm (as InString).
Well, the title says "string searching"  :bg

Quote
So,  "Usage would include tasks like recursively searching files for virus patterns, searching databases for keys or data, text and word processing and any other task that requires handling large amounts of data at very high speed" by Hutch from here (http://www.movsd.com/bm.htm)
Example: With BM we can search for bytes->  SrcPattern  DB 12,0,5,99,0,0,27,0,3
with InString we can't...  :wink
If you mean the zero delimiters: I could adjust my algo accordingly. But I won't because I am mainly interested in a fast InString replacement. Your algo is extremely fast on long search patterns, which makes it excellent for virus detection. My version will often be faster for simple text searches.

Quote
If you want to use my BM algo I have really a constraint:
-your SrcPattern should be more than 8 bytes long
-the SrcPattern must be after the buffer
Strange that the other BM algos don't have these constraints, and don't throw an exception if there is no match... ::)

Quote
"That is not a realistic constraint imho."
I don't need such info...

Lucky you :bg