News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InstrCi - case-insensitive string search

Started by jj2007, June 15, 2008, 10:09:46 PM

Previous topic - Next topic

jj2007

> InstrCi is already a little bit faster than finstr

To my embarassment, this is not true for my other machine, a P4 2.4 GHz. But here InstrCi has an edge over the Boyer-Moore algo, at least for short patterns. Any more benchmarks?

----------------------
9508 average crt_strstr res=4233787
7381 average finstr       res=2619
5539 average InStringL res=4233787
7499 average InstrCi 1s res=4233787
7462 average InstrCi 2s res=4233787
7691 average InstrCi 1L res=4233787
7683 average InstrCi 2L res=4233787
7583 * average InstrCi
-202 diff finstr-InstrCi (neg = finstr is faster)
798 diff BM s - InstrCi (neg = BM is faster)
8381 avg Boyer-Moore s  res=2619
7409 avg Boyer-Moore L  res=2616
11857 Hll istring       res=2620
1945881 average StrStrI res=4233787

Sizes:
596 finstr
305 InStringL
187 InstrCi
199 Boyer-Moore
----------------------

[attachment deleted by admin]

Ian_B

Hutch, I'm honoured you remember my optimisation experiments.  :red

Here is the original test piece I put together. I'm sure other people (Lingo) have better versions. My starting point was the existing MASM library, especially Hutch's Boyer-Moore which I thought I could see scope for improvement in. There was a different Boyer-Moore design floating around by Cresta that was extremely poorly optimised that looked as if it had potential, especially once an existing bug had been removed. I also wanted to confirm that the speed of the Boyer-Moore designs can be varied by the size of the shift-table used. For a 255-byte table, you can search strings up to 254 chars in length, or for 65535 for strings 65534 in length, etc. Obviously the most efficient searching takes place with smaller tables, because they take less time setting up, and because the algorithm has to do some memory operations using [ECX*n] modifiers, and if you can avoid using the *4 modifier needed for the 4Gb stringlength search table (256 DWORDs in table) you get less latency on those ops. For max efficiency you just need to select the version that handles strings at least as long as the one you want to search for.

I wanted to see how the variations compared not just on classic random text, their prime use for most people I guess, but also on some synthetic data. The reason was that Hutch's original complaint about my reducing the table to 255 chars IIRC was that the BMs were good for doing viral pattern matching with extremely large search lengths, which obviously isn't the same as plain text searching. That's why the search strings are a little odd. There's a random text search at the end of the test suite, though, but really it should be done with a much larger text sample for better results.

There's some comments in the file about the results on my P4 Northwood. The results I got indicated the Cresta rewrites are definitely the best overall, due to their efficient DWORD test approach, and they might be a good basis for further optimisation if anyone's interested. They are also easily altered to produce compact dedicated search routines for specific text where you don't need a general algo, and you can add the search table in as pre-calculated data that avoids the setup overhead, useful if you need the same few searches over and over in an app.

Regarding case-insensitive searching, if I was using this code to create such a search I'd add a setup section to the algo in which both the search string and the text to search were  capitalised before running the search using your OR 20202020h method. That way you only do the OR once for each byte in the string, rather than speculatively over and over in the algo.

Note my preference for more efficient functions using params passed in registers over stack params and frames. It also allows easier use of EBP as a spare reg. Sorry if that annoys people. You can recode to your preference if you like.  :P


[attachment deleted by admin]

jj2007

#17
Update on Instring:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Average cycle count:
5035     InstrJJ (jj2007)
4097     BMLinDB (Lingo)
16176    MasmLib InstringL
InstrJJ : InstringL = 31 %




Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Average cycle counts:
2476     InstrJJ (jj2007)
2990     BMLinDB (Lingo)
6773     BM (Boyer-Moore Hutch)
10158    InString (Masm32 Lib)
InstrJJ : InString = 24 %

Code sizes:
InstrJJ      = 388
BM           = 199
BMLinDB      = 395
InStringL    = 305

Requires SSE2. I would be interested to see some AMD and Dual Core timings :bg

EDIT: Shaved off 7 bytes and >50 cycles.
EDIT(2): Minor bugfix in the one-byte pattern section, 2 bytes less (they caused the bug  :red).
EDIT(3): Attachment removed - will be back when certain oddities are fixed.
EDIT(4): Oddities are fixed, attachment is back, but code is longer now :bg
EDIT(5): Could not resist to add the find "Duplicate inc" in Windows.inc test :wink

I have tried to test quite a number of odd mainstring/pattern combinations, and it seems to be foolproof now, but there might still be surprises (also for other included algos). Use at your own risk... and please give me feedback if it bounces.

[attachment deleted by admin]

lingo

#18
Latest result:
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ:                         38 ; clocks: 104
InString - Lingo:                      37 ; clocks: 39

Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ:                         1008 ; clocks: 23455
InString - Lingo:                      1007 ; clocks: 6395

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 686
InString - Lingo:                      1008 ; clocks: 502

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 4326
InString - Lingo:                      1008 ; clocks: 1417

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 3367
InString - Lingo:                      1008 ; clocks: 1308

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 641
InString - Lingo:                      1008 ; clocks: 498

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ:                         1010 ; clocks: 621
InString - Lingo:                      1009 ; clocks: 502

Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ:                         1002 ; clocks: 324
InString - Lingo:                      1001 ; clocks: 102

Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ:                         1002 ; clocks: 629
InString - Lingo:                      1001 ; clocks: 470

Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ:                         1002 ; clocks: 638
InString - Lingo:                      1001 ; clocks: 448

Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ:                         1002 ; clocks: 643
InString - Lingo:                      1001 ; clocks: 504

Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ:                         1002 ; clocks: 787
InString - Lingo:                      1001 ; clocks: 639

Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ:                         1127625 ; clocks: 679641
InString - Lingo:                      1127624 ; clocks: 542808

Press ENTER to exit...

dedndave

what makes it faster ?

(where's the beef?)

Mark Jones

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
...
Average cycle count:
2955     InstrJJ (jj2007)
3891     BMLinDB (Lingo)
7274     Boyer-Moore
13815    MasmLib InstringL
InstrJJ : InstringL = 21 %
Code size InstrJJ=366
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: dedndave on May 07, 2009, 10:56:07 PM
what makes it faster ?

(where's the beef?)

Here is the inner loop (simplified):

L1: movdqa xmm1, [esi] ; load 16 bytes from current aligned address
movdqu xmm4, [esi+1] ; load another 16 bytes UNALIGNED - must be in the middle
movdqa xmm2, [esi] ; save 16 bytes for testing the zero delimiter - faster than movdqa xmm2, xmm1
Lu: lea esi, [esi+16] ; position in mainstring (moving up/down lea or add costs cycles)
pcmpeqw xmm1, xmm3 ; compare packed words in xmm1 and xmm3 for equality
pcmpeqw xmm4, xmm3 ; compare packed words in xmm4 and xmm3 for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pmovmskb eax, xmm2 ; set byte mask in ecx for zero delimiter
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb ecx, xmm4 ; set byte mask in ecx for search pattern word
shl ecx, 1 ; adjust for esi+1 (add ecx, ecx is a lot slower)
test eax, eax ; zero byte found?
jnz @F ; check/clear ebp, then ChkNull
or edx, ecx ; one of them needs to have the word
je L1 ; 0=no pattern byte found, go back


The secret is in pcmpeqw:
Compare packed words in xmm2/m128 and xmm1 for equality

The SSE2 loop is very fast, much faster than all the rest, because it reads 16 bytes at a time. The art is not to leave it too often. Traditional Instring implementations have a byte scanner as the inner loop: Take the first character of your search pattern, e.g. d for dedndave, and loop until you find it; then check the rest, if it's edndave, great, otherwise back to the inner loop. You can imagine that you have to leave the inner loop very often for e, n, s etc. - and that your algo gets slowed down drastically, in spite of the fast inner loop.

Now the trick here is to use a word scanner: de is a lot less frequent that a single d - statistically speaking 256 times less frequent. The inner loop is a little bit slower, because you need some acrobatics (and one slow unaligned movedqu) to implement a word scanner, but it pays off in general. The only algo that comes close is Lingo's Boyer-Moore implementation (and it is even faster on a P4).

dedndave

very cool JJ - I have never used any of the SSE instructions
I am finding it hard to remember all the regular ones - lol
Not as young as I used to be, but I will get there
For my brain to remember, I have to use an instruction someplace at least once

Mark Jones

Whatever happened with measuring code in "LAMPs"? :bg

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Cycle counts:
3224    BMLinDB, addr Mainstr, addr TestSubEx, (SIZEOF TestSubEx)-1
4408    BMLinDB, addr Mainstr, addr TestSubDx, (SIZEOF TestSubDx)-1
4386    BMLinDB, addr Mainstr, addr TestSubXx, (SIZEOF TestSubXx)-1

11922   InString, 1, addr Mainstr, addr TestSubE
13401   InString, 1, addr Mainstr, addr TestSubD
16522   InString, 1, addr Mainstr, addr TestSubX

1774    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1770    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1771    InstrJJ, 1, addr Mainstr, addr TestSubE, 0

1834    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1823    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1835    InstrJJ, 1, addr Mainstr, addr TestSubD, 0

5656    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5609    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5633    InstrJJ, 1, addr Mainstr, addr TestSubX, 0

Average cycle count:
3078     InstrJJ (jj2007)
4006     BMLinDB (Lingo)
13948    MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ      = 359
BM           = 199
BMLinDB      = 395
InStringL    = 305
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

dedndave

never heard of LAMPS

we used KOPS
            MOPS
           FLOPS
kinda like "gallons per mile" - lol

they were really meaningless terms because noone could agree on what constituted one operation
i think each benchmark i saw used a slightly different definition

many benchmarks measured how fast they could do a bunch of NOP's

the real-world standard was how fast you could re-calculate some pre-defined spread-sheet with Lotus 123 - lol


jj2007

Quote from: Mark Jones on May 08, 2009, 09:44:37 PM
Whatever happened with measuring code in "LAMPs"? :bg


Right now I am a bit confused with my macros :dazzled:

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

3078     InstrJJ (jj2007)
4006     BMLinDB (Lingo)
13948    MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ      = 359
BM           = 199
BMLinDB      = 395
InStringL    = 305


I love it, thanks for the timings :bg

Anyway, it seems even LAMP-wise the algo is competitive - SSE2 rocks :toothy

I had to post a bugfix, see above. The algo reported wrong values when a one-byte pattern was found behind the main strings zero delimiter.

Mark Jones

Welcome. Keep in mind that Lingo optimizes for his hardware, so his timings are still valid. In fact, AMD has been continually losing market share for a long time now, and AMD is in the vast majority. Someday, there may not be any more AMD chips to worry about timing. ::)

Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Mark Jones on May 09, 2009, 12:07:01 AM

Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?


There is no "exact science" behind this formula, but it reflects a bit the philosophy of many who come to this forum: Lean and mean performance, i.e. code that is small and fast - the equivalent to WIN32_LEAN_AND_MEAN. Of course, nobody would choose code that is half in size but twice as slow, but on the other hand many here would hesitate to put into their libraries code that is 2% faster but twice as big. So I chose the square root of size, in order to reduce its influence.

In the end, you will often opt for the faster code, but the LAMP's are an incentive to care also for size.
:bg

MinLampSize = 200
LAMP$ MACRO cycles:REQ, csize:REQ ; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
push ecx
push cycles
ffree ST(7) ;; free the stack for pushing
mov ecx, csize
.if ecx==0
add ecx, MinLampSize
.endif
push ecx
fild dword ptr [esp] ;; move into ST (0)
pop ecx
fsqrt
push cycles
fild dword ptr [esp] ;; push cycles on FPU
fmul ST, ST(1) ;; multiply with sqrt(size)
fistp dword ptr [esp]
pop eax
ifndef LampsBuffer
.data?
LampsCycles dd ?
LampsBuffer dd 80 dup (?)
.code
endif
shr eax, 10 ;; kiloLAMPS
invoke dwtoa, eax, addr LampsBuffer
pop eax
shr eax, 10
mov LampsCycles, eax
pop ecx
EXITM <offset LampsBuffer>
ENDM


Usage:
print "getlines:", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke getlines, Win$, cbFile
mov esi, eax
counter_end
print LAMP$(eax, offset getlinesE-getlines), " kiloLAMPs"
print str$(LampsCycles), " kilocycles"

jj2007

Bugfix for the InstrJJ SSE2 version posted, see attachment above. Size increased a bit, speed is still the same:

Average cycle counts:
2485     InstrJJ (jj2007)
2990     BMLinDB (Lingo)
10139    InString (Masm32 Lib)
InstrJJ : InString = 24 %


Apologies to those who downloaded the previous versions.

jj2007

Hi everybody,
I have switched from SSE2 word scanner to byte scanner with improved non-SSE loops, it is now a few cycles slower on a P4 but I have a suspicion that it rocks on Core & AMD.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Average cycle counts:
4280     InstrJJ (jj2007)
16486    InString (Masm32 Lib)


Can I have some timings, please? Thanks.

[attachment deleted by admin]