Faster BM algo

lingo · July 19, 2005, 12:16:48 AM

:bg
I played with buliaNaza's and Hutch's BM algos and also
with cresta's BM algo (link here):
www.wasm.ru/forum/index.php?action=vthread&forum=3&topic=9972&page=3

I implemented some new ideas in faster BM algo so please test it
and indicate your CPU too

A. If we work without End of Buffer Control:
- it will be faster
- we will free one register
- and 1 function parameter less

If we move the second buffer with SearchPattern immediately after the big Buffer, our proc will find ALWAYS the SearchPattern,
hence we don't need the End of Buffer Control code.

For example we decide to read external file in the big Buffer and we allocate 4096 bytes for it and 255 bytes long SearchPattern
buffer and 1 byte more for my program too:

Code Select



invoke	GlobalAlloc, GMEM_ZEROINIT, 4096+255+1
	mov  hmem, eax
	....
and later
	mov  ecx, hmem	     ; ecx->lpBuffer1
	lea  edx, [ecx+4096] ; edx->address of the  SearchPattern buffer
        invoke  BMLin12, ecx, edx, 34	 ; where  34 < 255 -> length of  the SearchPattern

Our code will be faster too:

B. If we work with dword comparing code only (without byte by byte comparing code)

C. If we try to implement a mechanism (without additional code) to handle the occurrence of most repeated sequences of 4 bytes (first and last searching dwords at the beginning and the end) of the Search Pattern

Code Select


Example: If our SearchPattern is 1234567890 and if we have in the big Buffer:  ....12347756340,....12346857380, .... 1234567890 and:
	cmp	[edi], ebp     ; ebp->first dword of the SrchPattern = 1234
	jne	@@1_1
we will have THREE comparisons in our case if we don’t switch to other code:
	cmp	[edi+edx],ebx  ; ebx->last dword of the SrchPattern = 7890
	jne	@@2_1
if we switch to it after the first comparison (because the last four bytes are not 7890) we will have total just TWO comparisons

Here is the result with my P4 3.6 GHz Prescott

Code Select


Timing routines by MichaelW  - www.masmforum.com
Please terminate any high-priority tasks and press ENTER to begin.

Boyer - Moore Tests:

BMBinS(BM Hutch -> Begining of the Buffer): 1731 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 1453 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 987 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 48214 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 36940 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 21571 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 67942 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 43276 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 27869 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 94338 clocks; Return value: -1
BMFJT (BM Cresta-> Not found): 62079 clocks; Return value: -1
BMLin (BM Lingo -> Not found): 35697 clocks; Return value: -1

Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]

Jimg · July 19, 2005, 01:53:00 AM

Hi Lingo-

Won't run for me,
Is this more of that sse2 stuff I can't run on my lowly Athlon XP 3100+ ?

hutch-- · July 19, 2005, 02:50:21 AM

Hi Lingo,

Thanks for setting up the tests. I was growing old waiting for the 10 million iterations so I changed it to 1 million. here are the results.

BMBinS(BM Hutch -> Begining of the Buffer): 1462 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 1331 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 734 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 37004 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 38232 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 21527 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 51599 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 45006 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 28496 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 75695 clocks; Return value: 4294967295
BMFJT (BM Cresta-> Not found): 63013 clocks; Return value: 4294967295
BMLin (BM Lingo -> Not found): 34594 clocks; Return value: 4294967295

The results look good. I would be inclined to set up a test with a far longer sample, somewhere in the order of 10 meg or greater. Just test for a file name and if its not present, create a file using a standard piece of text and duplicate it enough times to get the file size, then overwrite the 3 samples you want to test on at the beginning, middle and end of file. The longer sample is far better suited to testing a BM algo and it should demonstrate the additional speed you new algo has properly.

It is probably worth the effort to use different pattern lengths as well as a BM get a lot faster as the pattern length gets longer.

One request, keep the iteration count down so that the test is not too long. When I ran the original, I thought the box had locked up. :bg

PS : The tests are on my 2.8 gig Prescott PIV. DDR400 memory, 800 meg FSB Intel board.

lingo · July 19, 2005, 04:40:15 PM

Thanks Jimg, :toothy

Unfortunately I can't help right now :'(

Regards,
Lingo

lingo · July 19, 2005, 04:41:32 PM

Thanks Hutch, :toothy

"...When I ran the original, I thought the box had locked up."
It is cofee time for me :bdg

Regards,
Lingo

Human · January 06, 2006, 01:36:34 PM

looks on xp1700+ with 768mb sdr133 with kt133 board
they are faster than p4 3.6GHZ

BMBinS(BM Hutch -> Begining of the Buffer): 1267 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 946 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 28526 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 15564 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 41916 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 17992 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 66088 clocks; Return value: 4294967295
BMFJT (BM Cresta-> Not found): 26358 clocks; Return value: 4294967295

deleted all sse2 by lingo due no sse2 in my cpu
seems amd wins with p4 due no screwed simple risc instructions like inc and etc, also bigger cache 64/64KB L1 and 256KB L2
also moded to 1M runs because it was too long

2 first for 10M runs
BMBinS(BM Hutch -> Begining of the Buffer): 1347 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 946 clocks; Return value: 111
also faster

[attachment deleted by admin]

lingo · January 07, 2006, 07:50:31 PM

Human, :lol

On my Aspire 5002 lapi with AMD Turion64 ML-30 CPU (1 MB L2 cache, 1.6 Ghz)
and 512M DDR:

Code Select



Windows XP Pro SP2:
BMBinS(BM Hutch -> Begining of the Buffer): 1225 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 941 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 445 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 29369 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 15754 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 10901 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 43623 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 18199 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 12592 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 68114 clocks; Return value: -1
BMFJT (BM Cresta-> Not found): 26795 clocks; Return value: -1
BMLin (BM Lingo -> Not found): 18761 clocks; Return value: -1


Windows XP Pro 64bit edition SP1:
BMBinS(BM Hutch -> Begining of the Buffer): 1215 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 931 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 440 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 29010 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 15583 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 10779 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 43107 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 17980 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 12451 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 67247 clocks; Return value: -1
BMFJT (BM Cresta-> Not found): 26347 clocks; Return value: -1
BMLin (BM Lingo -> Not found): 18420 clocks; Return value: -1

So, the speed depends on OS too...

Regards,
Lingo

zooba · January 08, 2006, 05:18:46 AM

Quote from: lingo on January 07, 2006, 07:50:31 PM
So, the speed depends on OS too...

I wouldn't have thought so. The speed differences are so small as to be negligible. I'm pretty sure the 64-bit AMDs run 32-bit stuff in a second core regardless of the operating system so for pure memory crunching I doubt the OS will make any difference at all (assuming that it doesn't make it crash :U)

EduardoS · January 08, 2006, 01:34:37 PM

On an A64

Code Select


BMBinS(BM Hutch -> Begining of the Buffer): 1212 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 930 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 440 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 28969 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 15561 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 10768 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 43054 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 17970 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 12437 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 67204 clocks; Return value: -1
BMFJT (BM Cresta-> Not found): 26336 clocks; Return value: -1
BMLin (BM Lingo -> Not found): 18422 clocks; Return value: -1

I change the loop count to 100 000 cause i hate wait more than 1 sec for an algo

Ian_B · January 08, 2006, 05:33:56 PM

I'd just point out that Cresta's code being tested here is VERY inefficient, and the version I optimised in the "String searching" thread is running about 25% to 40% faster than his original, depending on the type of data being searched. So for anyone who can't compile Lingo's SSE2 code, or doesn't want to have to target only processors that can run it, this is an alternative that is almost as fast, as these timings are all showing the Cresta code lagging by about 30% speedwise.

http://www.masmforum.com/simple/index.php?topic=3601.msg27022#msg27022

IanB

News:

Faster BM algo

lingo

Jimg

hutch--

lingo

lingo

Human

lingo

zooba

EduardoS

Ian_B