BoyerMoore routines possibly inefficient if invoked repeatedly?

Started by l_d_allan, February 21, 2006, 07:14:22 AM

Previous topic - Next topic

l_d_allan

I was looking at the Boyer-Moore search routines for finding a substring within a string, and wanted to check on something.

The three available BM variants seem to work, but are perhaps ill-suited for some (many?) usages when the same substring is being searched for over and over .... which would seem like a common usage.

A scenario: suppose you have a 4 meg memory buffer that is the result of reading in 30,000 lines from a file and you want to determine which lines and how many of these lines contain the substring "Is this long substring in the line?" (... a separate data structure has the lengths of each CrLf delimited line)

For the Horspool (and Simplified?) implementation, the code seems to require that the BadChar shift-table be reloaded each time. For the original Boyer-Moore algorithm, both the BadChar and GoodChar shift-tables are reloaded for each call.

If my understanding is more or less correct, perhaps a preferred implementation would involve two separate steps .... load the shift table(s) ... and a separate call to find the next match .... or perhaps something like Bmh_FindFirst and Bmh_FindNext for the Horspool variant.

For the scenario described above, the Bmh_FindFirst function would perform the LoadBadCharShiftTable and would be called once. The Bmh_FindNext would reuse this BadCharShiftTable and be called 30,000 times.

Or perhaps my understanding is incorrect ... I don't suppose there is a way to bypass the BadCharShiftTable loading .... unless the StartPos parameter controls this .... but my less than informed look at SBM.asm and BMH.asm would suggest that startPos doesn't seem to be used that way.

If you think it might be appropriate to implement a two-step approach (something like Bmh_FindFirst and Bmh_FindNext), my x86 skills are far too rusty to be of any direct help, but I could try to code and test C routines .... I don't like to suggest something and not offer to help.


hutch--

I coded those about 5 years ago for a PIII and the usage in mind was searching very long strings but with the application you have in mind, you would use a remote table and pass its address to the BM style routine. This would save you the table build time for each call when you are repeatedly searching the same text in different files or similar.

Lingo has posted some more modern ones that are faster again but note that a BM style algo is not universally useful as they are not well suited for words under about 6 characters and they perform badly with highly repetitious source data to search through. I still prefer a classic byte scanner as a general purpose search algo but where you can use a BM to its full advantage, they are clearly faster by a long way.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

l_d_allan

QuoteBM style algo is not universally useful as they are not well suited for words under about 6 characters
My experience with reusing the BadCharShiftTable is that with reasonably normal text data like a book, BM style algos work about as fast as a classic byte scanner (or strstr) with a three letter substring, and typicallly are faster with four letters and longer.

I was surprised the Horspool algo did as well as it did with short search strings .... (but with the caveat that this is very "well behaved" data). There doesn't seem to be much benefit from shifting algos for short search patterns. However, I can agree that the situation might be VERY different with binary data.

The table below uses Windows QueryPerformanceCounter from five repetitions, and has the elapsed microseconds to find all occurrences in a 45kb memory buffer. 

Pentium-II (microseconds)
Pattern     Matches   StrStr   Bmh
----------------------------------------------
of            453        473     616
for           167        405     499
the          1036        563     682
four           11        400     378
earth          90        550     397
ground         26        379     286

Pentium-III (microseconds)
Pattern     Matches   StrStr   Bmh
----------------------------------------------
of            453        211     275
for           167        179     211
the          1036        255     298
four           11        176     161
earth          90        248     169
ground         26        165     116

AMD-64 (microseconds)
Pattern     Matches   StrStr   Bmh
----------------------------------------------
of            453         89     128
for           167         74      92
the          1036        105     123
four           11         73      68
earth          90        112      69
ground         26         69      50

l_d_allan

BTW ... there is a minor inconsistency with the naming of the three BM algos ...

BMBinSearch   <--- upper case S in "Search"
BMHBinsearch <--- lower case s in "search"
SBMBinSearch

Just a nit ....