BoyerMoore routines possibly inefficient if invoked repeatedly?

l_d_allan · February 21, 2006, 07:14:22 AM

I was looking at the Boyer-Moore search routines for finding a substring within a string, and wanted to check on something.

The three available BM variants seem to work, but are perhaps ill-suited for some (many?) usages when the same substring is being searched for over and over .... which would seem like a common usage.

A scenario: suppose you have a 4 meg memory buffer that is the result of reading in 30,000 lines from a file and you want to determine which lines and how many of these lines contain the substring "Is this long substring in the line?" (... a separate data structure has the lengths of each CrLf delimited line)

For the Horspool (and Simplified?) implementation, the code seems to require that the BadChar shift-table be reloaded each time. For the original Boyer-Moore algorithm, both the BadChar and GoodChar shift-tables are reloaded for each call.

If my understanding is more or less correct, perhaps a preferred implementation would involve two separate steps .... load the shift table(s) ... and a separate call to find the next match .... or perhaps something like Bmh_FindFirst and Bmh_FindNext for the Horspool variant.

For the scenario described above, the Bmh_FindFirst function would perform the LoadBadCharShiftTable and would be called once. The Bmh_FindNext would reuse this BadCharShiftTable and be called 30,000 times.

Or perhaps my understanding is incorrect ... I don't suppose there is a way to bypass the BadCharShiftTable loading .... unless the StartPos parameter controls this .... but my less than informed look at SBM.asm and BMH.asm would suggest that startPos doesn't seem to be used that way.

If you think it might be appropriate to implement a two-step approach (something like Bmh_FindFirst and Bmh_FindNext), my x86 skills are far too rusty to be of any direct help, but I could try to code and test C routines .... I don't like to suggest something and not offer to help.

hutch-- · February 21, 2006, 07:40:02 AM

I coded those about 5 years ago for a PIII and the usage in mind was searching very long strings but with the application you have in mind, you would use a remote table and pass its address to the BM style routine. This would save you the table build time for each call when you are repeatedly searching the same text in different files or similar.

Lingo has posted some more modern ones that are faster again but note that a BM style algo is not universally useful as they are not well suited for words under about 6 characters and they perform badly with highly repetitious source data to search through. I still prefer a classic byte scanner as a general purpose search algo but where you can use a BM to its full advantage, they are clearly faster by a long way.

l_d_allan · February 21, 2006, 02:22:29 PM

QuoteBM style algo is not universally useful as they are not well suited for words under about 6 characters

My experience with reusing the BadCharShiftTable is that with reasonably normal text data like a book, BM style algos work about as fast as a classic byte scanner (or strstr) with a three letter substring, and typicallly are faster with four letters and longer.

I was surprised the Horspool algo did as well as it did with short search strings .... (but with the caveat that this is very "well behaved" data). There doesn't seem to be much benefit from shifting algos for short search patterns. However, I can agree that the situation might be VERY different with binary data.

The table below uses Windows QueryPerformanceCounter from five repetitions, and has the elapsed microseconds to find all occurrences in a 45kb memory buffer.

Pentium-II (microseconds)
Pattern Matches StrStr Bmh
----------------------------------------------
of 453 473 616
for 167 405 499
the 1036 563 682
four 11 400 378
earth 90 550 397
ground 26 379 286

Pentium-III (microseconds)
Pattern Matches StrStr Bmh
----------------------------------------------
of 453 211 275
for 167 179 211
the 1036 255 298
four 11 176 161
earth 90 248 169
ground 26 165 116

AMD-64 (microseconds)
Pattern Matches StrStr Bmh
----------------------------------------------
of 453 89 128
for 167 74 92
the 1036 105 123
four 11 73 68
earth 90 112 69
ground 26 69 50

l_d_allan · February 21, 2006, 02:33:56 PM

BTW ... there is a minor inconsistency with the naming of the three BM algos ...

BMBinSearch <--- upper case S in "Search"
BMHBinsearch <--- lower case s in "search"
SBMBinSearch

Just a nit ....

News:

BoyerMoore routines possibly inefficient if invoked repeatedly?

l_d_allan

hutch--

l_d_allan

l_d_allan