News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

String searching

Started by Ian_B, January 05, 2006, 06:32:36 PM

Previous topic - Next topic

Ian_B

I'll probably get my knuckles rapped that this has been done to death before...  ::)  but I have a need for much string searching in my app and I've been playing with search algorithms to see if I can wring some more speed out of the various approaches. My needs come in two forms, one is moving quickly through (potentially) large files to find header strings of embedded files and packetted data, and then when I have the header/packet location doing short sub-searches to find data that I know should be very close by. Testing the various search algorithms has suggested I'd be better off with the Boyer-Moore approach for the first brute-force search, and simpler byte-scanner type of searching for the second, that has less setup overhead for short search distances.

So I put together a speed comparison of these approaches, using some varied test data that I thought would be challenging. I took the simple BinSearch and the BMBin/BMHBinsearch Boyer-Moore algorithms and pitted them against one I rolled myself. That pulls in whole DWORDs and tests not for single bytes but for the first two test-string chars in all even/odd locations in the DWORD, to try and reduce the false-positives testing only for the single first char will give, while testing for data alignment for faster memory access. I also hot-rodded the MASM Lib BMBinsearch with my own optimisations to remove use of local variables and better register use, which proved well-worthwhile...  :toothy

So here's the test code, if anyone is interested. The most important thing to notice is that the "Horspool" Boyer-Moore variation as given in the latest MASM Lib, which I cut-and-pasted without alteration, fails some of these tests by not finding the strings at all, so it needs to be bug-fixed to guarantee correct output. I haven't checked that yet, or attempted an optimisation of it, as I don't know the algorithm it uses.

Results on my Northwood P4 suggest that my optimised BMBinSearch is up to 25% faster than the original for some data. It is especially improved for short searches where it cut a third off the clocks needed, removing much of the setup disadvantage compared to the byte/word scanners. My home-grown search is around a third faster than the simple BinSearch much of the time, and for non-manufactured data (ie. "normal" non-repeating text) it pleasingly approaches the speed of the original Boyer-Moore, only 20% slower or so. I'd be interested to see how both optimisations hold up on other platforms, thanks.

IanB


[attachment deleted by admin]

Human

well i dont think that single string searching speed is important on big files or even small, due time depends on hdd speed,is hdd defragmented etc, only place where it really matters is compression. and there alignment check just slowdowns all, due most of string arent aligned:( max dword compare because most strings are 2,3,4 bytes long,indexing of 2 byte and 3 offsets. 3 byte indexing gave me even 1000% speedup.
http://sourceforge.net/projects/humantools
check my hlz compressor, all ideas are mine, i coded all, and even i cant optimize it more, well maybe someday,from time to time i have small ideas. most important there is hash indexing and string searching procedure based on hashed indexes. but 3 byte indexes take 96mb for 8mb window. i dont think anything for compression is faster than that.

Ian_B

#2
Quote from: Human on January 05, 2006, 11:10:49 PM
well i dont think that single string searching speed is important on big files or even small, due time depends on hdd speed,is hdd defragmented etc

That really depends on your application and whether you are using a brute-force approach or working smarter. If you are trying to read a huge multi-megabyte file in and then search it or process it, then of course your string searching will be small fry compared to the overhead of reading the file in the first place. But using overlapped IO with an optimum buffer size means I can read good amounts of data in and do string searching and whatever processing I need based on what I parse VERY usefully while the next chunk of the file is coming into the next allocated buffer. Smaller/smarter works much more efficiently overall, so it's still useful for me to improve search performance to leave more time for other processing while I'm waiting. YMMV for YOUR application.

Having seen the significant improvement of the Boyer-Moore approach over byte scanning, I am now wondering about pre-building shift tables for my app as I basically perform the same four or five searches repeatedly, which should make this usable for shorter searches too.

EDIT: But not by feeding in a pointer for the pre-built table and holding it in a register, as this definitely slows up the routine, perhaps because indexing [EDX+ECX*4] is slower than [ECX*4+tableaddress], plus there's now another memory read in the code where I'd lose that register. I think I'll have to test whether simply copying the 256 DWORDS from another location into the general ShiftTable buffer is slower than building the table from scratch, or I simply have 5 super-optimised routines handling specific string searches that avoid a lot of general overhead where I don't need to save input values...

IanB


Ian_B

Yeah, OK, thanks for the link. It's interesting, but I hate using code I can't understand or rewrite myself, and I'm avoiding anything MMX/SSE as it restricts the processors my code can work with.

It's also noticeable that of the three BM code versions by Hutch, you chose to compare the one that my own little test shows is not only slower than the one Hutch flags clearly as the most flexible, BMBinsearch, but we now know doesn't even work correctly. And while hyper-optimising your own code, you chose not to bother trying to fix the obvious little things in Hutch's which would improve speed in that, like replacing INC and DEC with ADD 1 and SUB 1. So it does seem like a deliberately rigged comparison to me.

It also can't work well as a general search tool for comparing strings in files, as it relies on the string you're searching for being at the end of the buffer you're searching. Your code has no parameter input to allow setting the end of the area to search. If you are reading files into a buffer using normal or overlapped IO, it's kind of inconvenient to then have to copy all that data again to a second buffer just so that you can solve that wrinkle, and would probably over-ride any speed gains you made in the searching. Particularly for overlapped IO, you'd want consecutive buffers of consecutive data to actually be... consecutive, not with extra string search bytes between them, that really throws things when you're doing CRCs and MD5s on the data...  :eek

I'd have plugged it into my paltry test to see how it fared against a better optimised version of Hutch's BM, but I can't compile this code:

    punpcklqdq XMM0, XMM0 ;// XMM0=00000009000000090000000900000009
error A2008: syntax error : xmm
    movdqa [eax+edx-1*16], XMM0 ;// XMM0= lenSrchPattern + 1
error A2008: syntax error : [

etc. etc...

IanB

hutch--

Ian,

I wrote those about 5 years ago on a PIII and it did not seem at the time to be able to be made any faster on that hardware. I don't know of a final proof for a BM style algo, I ran all of the three for hundreds of millions of tests and they performed them all but that does not garrantee reliable results in all situations. I know Lingo had a faster one and he also referenced another one by a Russian author that was faster than the PIII code I wrote.

The things that impnge of the final speed of a BM is memory access speed and unpredictable branching and it is difficult to find a way around this due to the fundamental algorithm design. ADD and SUB will be faster on later Intel hardware but not necessarily on late AMD.

I still have this page up which was the technical assumptions I used with the 3 algos. http://www.movsd.com/bm.htm

I imagine the best way to work on the design is to code them from scratch again and see if you can find a proof of accuracy.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ian_B

#6
Quote from: hutch-- on January 06, 2006, 06:19:41 AM
The things that impnge of the final speed of a BM is memory access speed and unpredictable branching and it is difficult to find a way around this due to the fundamental algorithm design.
That's why I re-optimised the algorithm originally to remove the "cval" local variable, which remained unchanged throughout and was merely the value of one of the unchanging registers (the checkstring length) minus 1, which was easy to recode for. That removed one memory access at least per iteration. And as I noted in my edit and comment in the code, moving the shift table from a local variable which required indexing using [REG+ECX*4+offset], where the reg happened to be ESP, to a known .DATA? location where this could be simplified to [ECX*4+offset] made most of the difference in speed, possibly because that complex indexing form is slower to compute.

There are other optimisations that can be made - for instance, if you know that the string length will never be more than 256 characters, which is pretty likely for most user applications, you can optimise the shift table to use only bytes rather than DWORDs and that indexing will simplify even further to [ECX+offset] which will compute even faster. I am currently rewriting my critical searches using this technique with prebuilt 256-byte shift tables, which will make them very lean and mean indeed.

QuoteADD and SUB will be faster on later Intel hardware but not necessarily on late AMD.
Agreed, but if optimising for the largest mass of user hardware out there, what are you gonna choose?

QuoteI still have this page up which was the technical assumptions I used with the 3 algos. http://www.movsd.com/bm.htm
Thanks for that, it'll make analysing the algorithm much easier.

QuoteI imagine the best way to work on the design is to code them from scratch again and see if you can find a proof of accuracy.
I suspect that there's a wrinkle right at the end of the Horspool code, as it appears from my timings to be doing a full iteration and flunking only at the very end. I haven't tried tracing it in a debugger yet, but from my own varied test data the Horspool variation doesn't seem to perform anywhere near as well as the "original", so I'm not too worried about that. However, it is important that the health warning is given in the MASM library code or that it's fixed, so my foray into searching hasn't been a complete waste of time...  ::)

IanB

Ian_B

#7
OK, now THIS is a timetest!

I've done a revised optimisation of Hutch's BM using the byte-length shift table approach. It is significantly faster, and for short searches is now only half as slow as the fast word-scanner I wrote, and that's including building the table! For other searches it ranges from 75% to 25% faster, the closer speeds on more "average" text data. I've also included as a comparison the "Cresta" BM from Lingo's comparison test.

Rather than just comparing speed through vast quantities of random data, which may or may not be the sort of data being searched, my tests have quite varied synthetic repetitions which test the ability of each type of search to discard mismatches early. The Cresta BM shows poorly on most of these synthetic tests, and especially on the shorter length test, but does pull into the front on the last "average text" test. The strengths and weaknesses of these searches are clearly highlighted, but my optimisation of Hutch's BM does seem to have an advantage with many types of synthetic data, which might or might not match the sort of data being tested. For instance, test 3 where the search is for data very different to the "background".

At a quick glance, there might be optimisations possible to Cresta's code, so I'll see if that can be improved further. But knowing the type of data you are searching and matching it to the search tool does appear to have significant benefits.

EDIT: I've added to this timetest my optimised version of Cresta's code. This gets a speed boost of almost 40% in one test, but still doesn't like some of the artificial data, where my optimised version of Hutch's code chews it up. It's way ahead now in the "average" text test, though, which is possibly the most important one here, and is also even more competitive in the short search league. So it's looking like an overall winner...

IanB

Please terminate any high-priority tasks and press ENTER to begin...

String search tests:


Search Test 1 - value expected 37
BinSearch:        37 ; clocks: 194
BInStrWithLen:    37 ; clocks: 124
BMBinSearch:      37 ; clocks: 684
BMBinSearch2:     37 ; clocks: 327
FJT by Cresta:    37 ; clocks: 759
FJT2 Cresta/IanB: 37 ; clocks: 299

Search Test 2 - value expected 1007
BinSearch:        1007 ; clocks: 26382
BInStrWithLen:    1007 ; clocks: 24840
BMBinSearch:      1007 ; clocks: 32817
BMBinSearch2:     1007 ; clocks: 30997
FJT by Cresta:    1007 ; clocks: 11053
FJT2 Cresta/IanB: 1007 ; clocks: 11474

Search Test 3 - value expected 1008
BinSearch:        1008 ; clocks: 2869
BInStrWithLen:    1008 ; clocks: 2061
BMBinSearch:      1008 ; clocks: 1112
BMBinSearch2:     1008 ; clocks: 718
FJT by Cresta:    1008 ; clocks: 1095
FJT2 Cresta/IanB: 1008 ; clocks: 780

Search Test 4 - value expected 1008
BinSearch:        1008 ; clocks: 4743
BInStrWithLen:    1008 ; clocks: 4681
BMBinSearch:      1008 ; clocks: 1237
BMBinSearch2:     1008 ; clocks: 752
FJT by Cresta:    1008 ; clocks: 2067
FJT2 Cresta/IanB: 1008 ; clocks: 1562

Search Test 5 - value expected 1008
BinSearch:        1008 ; clocks: 4748
BInStrWithLen:    1008 ; clocks: 4074
BMBinSearch:      1008 ; clocks: 1237
BMBinSearch2:     1008 ; clocks: 752
FJT by Cresta:    1008 ; clocks: 2243
FJT2 Cresta/IanB: 1008 ; clocks: 1842

Search Test 6 - value expected 1008
BinSearch:        1008 ; clocks: 2863
BInStrWithLen:    1008 ; clocks: 2057
BMBinSearch:      1008 ; clocks: 4012
BMBinSearch2:     1008 ; clocks: 3583
FJT by Cresta:    1008 ; clocks: 1253
FJT2 Cresta/IanB: 1008 ; clocks: 835

Search Test 7 - value expected 1009
BinSearch:        1009 ; clocks: 5130
BInStrWithLen:    1009 ; clocks: 2773
BMBinSearch:      1009 ; clocks: 2117
BMBinSearch2:     1009 ; clocks: 1692
FJT by Cresta:    1009 ; clocks: 1341
FJT2 Cresta/IanB: 1009 ; clocks: 1120



[attachment deleted by admin]

hutch--

Ian,

I would be careful with this assumption.

Quote
There are other optimisations that can be made - for instance, if you know that the string length will never be more than 256 characters, which is pretty likely for most user applications, you can optimise the shift table to use only bytes rather than DWORDs and that indexing will simplify even further to [ECX+offset]

Strings of 256 byte and shorter are into the "who cares" catagory and the real strength of a BM is in far larger strings where you may need to search for a couple of hundred virus signatures on every file on a hard drive where you will see very large timing differences to a classic byte scanner.

Using an aligned data section table is a good idea if you make use of the opcode that allows an offset as the base address as it saves you a register but it may not be any faster when you only use an offset without the scale *4. A DWORD table is usually faster than a BYTE table so there may be other factors involved.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Ian_B

#9
Point taken. But in testing the code, the biggest difference I found was in replacing the [ESP+ECX*4+offset] style of addressing into the shift table with an [ECX*4+offset], which kinda went against the usual perceived wisdom of local variables, which is that the stack is more likely to be in the cache and will therefore produce less cache misses. Having the shift table in the stack wiped out almost all the performance gains I made with other optimisations, which tells me that at least on my P4 this complex addressing is slower. Logically, therefore, simplifying even further should yield better performance.

If you allowed WORD-size tables for string matches of up to 64K, which is possibly large enough for your virus signature needs, you'd still be able to get significantly better performance by replacing [ECX*2+offset] with [ECX+ECX+offset], so I think it's a valid experiment. It's easy for people, knowing this, to optimise the code themselves for the level of string-length cover they will need in their application.

And this also bears on another discussion somewhere in the forums about Intel processors being more CISC-like, happy with complex instructions, and AMD ones being more RISC-like, preferring simpler ones. You can't attack my optimisation for being too Intel-specific when this complex addressing is being removed in favour of simpler forms...  :eek

Quote from: Strings of 256 byte and shorter are into the "who cares" catagory
I can assure you that none of the packet/encoding header strings I am searching for in my mainstream real-world application are longer than 14 characters, and I'm sure there are a lot of other people out there who use the MASM library stuff for "trivial" searches... I certainly care. :wink

Here's some results where the only code changes are the use of BYTE/WORD/DWORD-sized tables and the indexing into them is respectively ShiftTable[ECX], ShiftTable[ECX+ECX] and ShiftTable[ECX*4]...


String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB, byte-length shifts:  37 ; clocks: 303
FJT3 Cresta/IanB, word-length shifts:  37 ; clocks: 356
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 456

Search Test 2 - value expected 1007
FJT2 Cresta/IanB, byte-length shifts:  1007 ; clocks: 11413
FJT3 Cresta/IanB, word-length shifts:  1007 ; clocks: 11420
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 11555

Search Test 3 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 718
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 785
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 905

Search Test 4 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1568
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1642
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1758

Search Test 5 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1890
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1966
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 2083

Search Test 6 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 747
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 818
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 984

Search Test 7 - value expected 1009
FJT2 Cresta/IanB, byte-length shifts:  1009 ; clocks: 1078
FJT3 Cresta/IanB, word-length shifts:  1009 ; clocks: 1149
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 1241

Press ENTER to exit...


Convinces me...   :bg

IanB


[attachment deleted by admin]

Ian_B

As an addendum to the subject of table-indexing speed, I've empirically tested the following on long CRC loops.

Unequivocally, on P4 using the [EBX*4] style of addressing like so:


    MOV EAX, Table[EBX*4]    ; Table is DWORD aligned
    MOV ECX, Table[EDX*4]


is slower, even if only by 1% or so, than replacing with the following longer code:


    ADD EBX, EBX
    ADD EDX, EDX
    MOV EAX, Table[EBX+EBX]
    MOV ECX, Table[EDX+EDX]


which might be useful for long code loops.

IanB

lingo

#11
"but I hate using code I can't understand or rewrite myself, and I'm avoiding anything MMX/SSE as it restricts the processors my code can work with.
..And while hyper-optimising your own code, you chose not to bother trying to fix the obvious little things in Hutch's which would improve speed in that, like replacing INC and DEC with ADD 1 and SUB 1. So it does seem like a deliberately rigged comparison to me. by Ian_B"


Some years later I included my algo with Ian_B's testbad: :wink 
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Please terminate any high-priority tasks and press ENTER to begin...


String search tests:


Search Test 1 - value expected 37
FJT2 Cresta/IanB, byte-length shifts:  37 ; clocks: 275
FJT3 Cresta/IanB, word-length shifts:  37 ; clocks: 216
FJT4 Cresta/IanB, dword-length shifts: 37 ; clocks: 275
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 141

Search Test 2 - value expected 1007
FJT2 Cresta/IanB, byte-length shifts:  1007 ; clocks: 9884
FJT3 Cresta/IanB, word-length shifts:  1007 ; clocks: 9900
FJT4 Cresta/IanB, dword-length shifts: 1007 ; clocks: 9850
Boyer-Moore Lingo,dword-length shifts: 1007 ; clocks: 8540

Search Test 3 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 653
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 602
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 663
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 531

Search Test 4 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1866
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1306
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1916
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1178

Search Test 5 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 1951
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 1299
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 1925
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1165

Search Test 6 - value expected 1008
FJT2 Cresta/IanB, byte-length shifts:  1008 ; clocks: 770
FJT3 Cresta/IanB, word-length shifts:  1008 ; clocks: 716
FJT4 Cresta/IanB, dword-length shifts: 1008 ; clocks: 775
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 652

Search Test 7 - value expected 1009
FJT2 Cresta/IanB, byte-length shifts:  1009 ; clocks: 961
FJT3 Cresta/IanB, word-length shifts:  1009 ; clocks: 914
FJT4 Cresta/IanB, dword-length shifts: 1009 ; clocks: 970
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 842

Press ENTER to exit...

[attachment deleted by admin]

jj2007

Hi Lingo,
I promised to look at your code. I worked out something last night, but it is still buggy - some of the tests simply don't work, one (Test 7) is far too slow.
If I find time over the weekend, I'll be back...

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Search Test 1 - value expected 37
Boyer-Moore Lingo,dword-length shifts: 37 ; clocks: 404
Plain byte scanner jj2007            : 37 ; clocks: 97

Search Test 3 - value expected 1008
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1053
Plain byte scanner jj2007            : 1008 ; clocks: 884

Search Test 6 - value expected 1008
Boyer-Moore Lingo,dword-length shifts: 1008 ; clocks: 1131
Plain byte scanner jj2007            : 1008 ; clocks: 887

Search Test 7 - value expected 1009
Boyer-Moore Lingo,dword-length shifts: 1009 ; clocks: 1275
Plain byte scanner jj2007            : 1009 ; clocks: 2344

lingo

The practical question here is :
If lenSrchPattern is > N bytes then use BM, else use other scanner
For now N= 8 bytes but due to new SSE2,3,4 instructions, value of N should be much bigger  :wink


jj2007

Thanks, Lingo. Your code works fine and is very fast, but I seem not to be able to get this running:

.code
FindInWinInc db "Duplicat", 0

Start:
Invoke Main
Invoke ExitProcess, 0

Main    proc

invoke ShowCPU, 1
print   chr$(13,10)

invoke read_disk_file, chr$("\masm32\include\windows.inc"), offset fbuffer, offset flen
.if eax
print "Finding "
print offset FindInWinInc, " in windows.inc:"

(... other algos)

print chr$(13, 10, "BMLin12: ", 9)
counter_begin 1000, HIGH_PRIORITY_CLASS
invoke BMLin12, fbuffer, offset FindInWinInc, sizeof FindInWinInc-1
mov edi, eax
counter_end
mov ebx, eax
print ustr$(edi)
print chr$(" ; clocks: ")
print ustr$(ebx)
invoke Sleep, 250


Output:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Finding Duplicat in windows.inc:
Lib InString:   849668          ; clocks: 4976550
FJT3:           849667          ; clocks: 1357491
BMLin12:        0               ; clocks: 1331421
InstrSse2:      849667          ; clocks: 2093956

Code sizes:
MasmLib         156
FJT3            249
BMLin12         745
InstrSse2       275


Any idea why it returns zero? The cycle count looks plausible...
I wanted a real life test...

By the way: SSE2 is not very competitive, as you can see :wink