InstrCi - case-insensitive string search

jj2007 · June 17, 2008, 09:31:18 PM

> InstrCi is already a little bit faster than finstr

To my embarassment, this is not true for my other machine, a P4 2.4 GHz. But here InstrCi has an edge over the Boyer-Moore algo, at least for short patterns. Any more benchmarks?

----------------------
9508 average crt_strstr res=4233787
7381 average finstr res=2619
5539 average InStringL res=4233787
7499 average InstrCi 1s res=4233787
7462 average InstrCi 2s res=4233787
7691 average InstrCi 1L res=4233787
7683 average InstrCi 2L res=4233787
7583 * average InstrCi
-202 diff finstr-InstrCi (neg = finstr is faster)
798 diff BM s - InstrCi (neg = BM is faster)
8381 avg Boyer-Moore s res=2619
7409 avg Boyer-Moore L res=2616
11857 Hll istring res=2620
1945881 average StrStrI res=4233787

Sizes:
596 finstr
305 InStringL
187 InstrCi
199 Boyer-Moore
----------------------

[attachment deleted by admin]

Ian_B · June 20, 2008, 05:11:16 AM

Hutch, I'm honoured you remember my optimisation experiments. :red

Here is the original test piece I put together. I'm sure other people (Lingo) have better versions. My starting point was the existing MASM library, especially Hutch's Boyer-Moore which I thought I could see scope for improvement in. There was a different Boyer-Moore design floating around by Cresta that was extremely poorly optimised that looked as if it had potential, especially once an existing bug had been removed. I also wanted to confirm that the speed of the Boyer-Moore designs can be varied by the size of the shift-table used. For a 255-byte table, you can search strings up to 254 chars in length, or for 65535 for strings 65534 in length, etc. Obviously the most efficient searching takes place with smaller tables, because they take less time setting up, and because the algorithm has to do some memory operations using [ECX*n] modifiers, and if you can avoid using the *4 modifier needed for the 4Gb stringlength search table (256 DWORDs in table) you get less latency on those ops. For max efficiency you just need to select the version that handles strings at least as long as the one you want to search for.

I wanted to see how the variations compared not just on classic random text, their prime use for most people I guess, but also on some synthetic data. The reason was that Hutch's original complaint about my reducing the table to 255 chars IIRC was that the BMs were good for doing viral pattern matching with extremely large search lengths, which obviously isn't the same as plain text searching. That's why the search strings are a little odd. There's a random text search at the end of the test suite, though, but really it should be done with a much larger text sample for better results.

There's some comments in the file about the results on my P4 Northwood. The results I got indicated the Cresta rewrites are definitely the best overall, due to their efficient DWORD test approach, and they might be a good basis for further optimisation if anyone's interested. They are also easily altered to produce compact dedicated search routines for specific text where you don't need a general algo, and you can add the search table in as pre-calculated data that avoids the setup overhead, useful if you need the same few searches over and over in an app.

Regarding case-insensitive searching, if I was using this code to create such a search I'd add a setup section to the algo in which both the search string and the text to search were capitalised before running the search using your OR 20202020h method. That way you only do the OR once for each byte in the string, rather than speculatively over and over in the algo.

Note my preference for more efficient functions using params passed in registers over stack params and frames. It also allows easier use of EBP as a spare reg. Sorry if that annoys people. You can recode to your preference if you like. :P

[attachment deleted by admin]

jj2007 · May 07, 2009, 06:30:06 PM

Update on Instring:

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Average cycle count:
5035     InstrJJ (jj2007)
4097     BMLinDB (Lingo)
16176    MasmLib InstringL
InstrJJ : InstringL = 31 %

Code Select


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

Average cycle counts:
2476     InstrJJ (jj2007)
2990     BMLinDB (Lingo)
6773     BM (Boyer-Moore Hutch)
10158    InString (Masm32 Lib)
InstrJJ : InString = 24 %

Code sizes:
InstrJJ      = 388
BM           = 199
BMLinDB      = 395
InStringL    = 305

Requires SSE2. I would be interested to see some AMD and Dual Core timings :bg

EDIT: Shaved off 7 bytes and >50 cycles.
EDIT(2): Minor bugfix in the one-byte pattern section, 2 bytes less (they caused the bug :red).
EDIT(3): Attachment removed - will be back when certain oddities are fixed.
EDIT(4): Oddities are fixed, attachment is back, but code is longer now :bg
EDIT(5): Could not resist to add the find "Duplicate inc" in Windows.inc test :wink

I have tried to test quite a number of odd mainstring/pattern combinations, and it seems to be foolproof now, but there might still be surprises (also for other included algos). Use at your own risk... and please give me feedback if it bounces.

[attachment deleted by admin]

lingo · May 07, 2009, 10:14:50 PM

Latest result:

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)

Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ:                         38 ; clocks: 104
InString - Lingo:                      37 ; clocks: 39

Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ:                         1008 ; clocks: 23455
InString - Lingo:                      1007 ; clocks: 6395

Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 686
InString - Lingo:                      1008 ; clocks: 502

Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 4326
InString - Lingo:                      1008 ; clocks: 1417

Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 3367
InString - Lingo:                      1008 ; clocks: 1308

Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ:                         1009 ; clocks: 641
InString - Lingo:                      1008 ; clocks: 498

Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ:                         1010 ; clocks: 621
InString - Lingo:                      1009 ; clocks: 502

Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ:                         1002 ; clocks: 324
InString - Lingo:                      1001 ; clocks: 102

Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ:                         1002 ; clocks: 629
InString - Lingo:                      1001 ; clocks: 470

Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ:                         1002 ; clocks: 638
InString - Lingo:                      1001 ; clocks: 448

Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ:                         1002 ; clocks: 643
InString - Lingo:                      1001 ; clocks: 504

Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ:                         1002 ; clocks: 787
InString - Lingo:                      1001 ; clocks: 639

Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ:                         1127625 ; clocks: 679641
InString - Lingo:                      1127624 ; clocks: 542808

 Press ENTER to exit...

dedndave · May 07, 2009, 10:56:07 PM

what makes it faster ?

(where's the beef?)

Mark Jones · May 07, 2009, 11:48:19 PM

Code Select

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
...
Average cycle count:
2955     InstrJJ (jj2007)
3891     BMLinDB (Lingo)
7274     Boyer-Moore
13815    MasmLib InstringL
InstrJJ : InstringL = 21 %
Code size InstrJJ=366

jj2007 · May 08, 2009, 05:48:14 AM

Quote from: dedndave on May 07, 2009, 10:56:07 PM
what makes it faster ?

(where's the beef?)

Here is the inner loop (simplified):

Code Select

L1:	movdqa xmm1, [esi]	; load 16 bytes from current aligned address
	movdqu xmm4, [esi+1]	; load another 16 bytes UNALIGNED - must be in the middle
	movdqa xmm2, [esi]	; save 16 bytes for testing the zero delimiter - faster than movdqa xmm2, xmm1
Lu:	lea esi, [esi+16]	; position in mainstring (moving up/down lea or add costs cycles)
	pcmpeqw xmm1, xmm3	; compare packed words in xmm1 and xmm3 for equality
	pcmpeqw xmm4, xmm3	; compare packed words in xmm4 and xmm3 for equality
	pcmpeqb xmm2, xmm1	; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
	pmovmskb eax, xmm2	; set byte mask in ecx for zero delimiter
	pmovmskb edx, xmm1	; set byte mask in edx for search pattern word
	pmovmskb ecx, xmm4	; set byte mask in ecx for search pattern word
	shl ecx, 1		; adjust for esi+1 (add ecx, ecx is a lot slower)
	test eax, eax		; zero byte found?
	jnz @F			; check/clear ebp, then ChkNull
	or edx, ecx		; one of them needs to have the word
	je L1			; 0=no pattern byte found, go back

The secret is in pcmpeqw:
Compare packed words in xmm2/m128 and xmm1 for equality

The SSE2 loop is very fast, much faster than all the rest, because it reads 16 bytes at a time. The art is not to leave it too often. Traditional Instring implementations have a byte scanner as the inner loop: Take the first character of your search pattern, e.g. d for dedndave, and loop until you find it; then check the rest, if it's edndave, great, otherwise back to the inner loop. You can imagine that you have to leave the inner loop very often for e, n, s etc. - and that your algo gets slowed down drastically, in spite of the fast inner loop.

Now the trick here is to use a word scanner: de is a lot less frequent that a single d - statistically speaking 256 times less frequent. The inner loop is a little bit slower, because you need some acrobatics (and one slow unaligned movedqu) to implement a word scanner, but it pays off in general. The only algo that comes close is Lingo's Boyer-Moore implementation (and it is even faster on a P4).

dedndave · May 08, 2009, 06:05:42 AM

very cool JJ - I have never used any of the SSE instructions
I am finding it hard to remember all the regular ones - lol
Not as young as I used to be, but I will get there
For my brain to remember, I have to use an instruction someplace at least once

Mark Jones · May 08, 2009, 09:44:37 PM

Whatever happened with measuring code in "LAMPs"? :bg

Code Select

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Cycle counts:
3224    BMLinDB, addr Mainstr, addr TestSubEx, (SIZEOF TestSubEx)-1
4408    BMLinDB, addr Mainstr, addr TestSubDx, (SIZEOF TestSubDx)-1
4386    BMLinDB, addr Mainstr, addr TestSubXx, (SIZEOF TestSubXx)-1

11922   InString, 1, addr Mainstr, addr TestSubE
13401   InString, 1, addr Mainstr, addr TestSubD
16522   InString, 1, addr Mainstr, addr TestSubX

1774    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1770    InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1771    InstrJJ, 1, addr Mainstr, addr TestSubE, 0

1834    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1823    InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1835    InstrJJ, 1, addr Mainstr, addr TestSubD, 0

5656    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5609    InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5633    InstrJJ, 1, addr Mainstr, addr TestSubX, 0

Average cycle count:
3078     InstrJJ (jj2007)
4006     BMLinDB (Lingo)
13948    MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ      = 359
BM           = 199
BMLinDB      = 395
InStringL    = 305

dedndave · May 08, 2009, 10:22:10 PM

never heard of LAMPS

we used KOPS
MOPS
FLOPS
kinda like "gallons per mile" - lol

they were really meaningless terms because noone could agree on what constituted one operation
i think each benchmark i saw used a slightly different definition

many benchmarks measured how fast they could do a bunch of NOP's

the real-world standard was how fast you could re-calculate some pre-defined spread-sheet with Lotus 123 - lol

jj2007 · May 08, 2009, 11:31:32 PM

Quote from: Mark Jones on May 08, 2009, 09:44:37 PM
Whatever happened with measuring code in "LAMPs"? :bg

Right now I am a bit confused with my macros :dazzled:

Code Select

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

3078     InstrJJ (jj2007)
4006     BMLinDB (Lingo)
13948    MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ      = 359
BM           = 199
BMLinDB      = 395
InStringL    = 305

I love it, thanks for the timings :bg

Anyway, it seems even LAMP-wise the algo is competitive - SSE2 rocks :toothy

I had to post a bugfix, see above. The algo reported wrong values when a one-byte pattern was found behind the main strings zero delimiter.

Mark Jones · May 09, 2009, 12:07:01 AM

Welcome. Keep in mind that Lingo optimizes for his hardware, so his timings are still valid. In fact, AMD has been continually losing market share for a long time now, and AMD is in the vast majority. Someday, there may not be any more AMD chips to worry about timing. ::)

Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?

jj2007 · May 09, 2009, 06:50:22 AM

Quote from: Mark Jones on May 09, 2009, 12:07:01 AM

Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?

There is no "exact science" behind this formula, but it reflects a bit the philosophy of many who come to this forum: Lean and mean performance, i.e. code that is small and fast - the equivalent to WIN32_LEAN_AND_MEAN. Of course, nobody would choose code that is half in size but twice as slow, but on the other hand many here would hesitate to put into their libraries code that is 2% faster but twice as big. So I chose the square root of size, in order to reduce its influence.

In the end, you will often opt for the faster code, but the LAMP's are an incentive to care also for size.
:bg

Code Select

MinLampSize = 200
LAMP$ MACRO cycles:REQ, csize:REQ	; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
	push ecx
	push cycles
	ffree ST(7)			;; free the stack for pushing
	mov ecx, csize
	.if ecx==0
		add ecx, MinLampSize
	.endif
	push ecx
	fild dword ptr [esp]		;; move into ST (0)
	pop ecx
	fsqrt
	push cycles
	fild dword ptr [esp]		;; push cycles on FPU
	fmul ST, ST(1)		;; multiply with sqrt(size)
	fistp dword ptr [esp]
	pop eax
	ifndef LampsBuffer
	.data?
	LampsCycles	dd ?
	LampsBuffer	dd 80 dup (?)
	.code
	endif
	shr eax, 10					;; kiloLAMPS
	invoke dwtoa, eax, addr LampsBuffer
	pop eax
	shr eax, 10
	mov LampsCycles, eax
	pop ecx
EXITM <offset LampsBuffer>
ENDM

Usage:

Code Select

	print "getlines:", 13, 10
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		invoke getlines, Win$, cbFile
		mov esi, eax
	counter_end
	print LAMP$(eax, offset getlinesE-getlines), " kiloLAMPs"
	print str$(LampsCycles), " kilocycles"

jj2007 · May 09, 2009, 11:45:58 PM

Bugfix for the InstrJJ SSE2 version posted, see attachment above. Size increased a bit, speed is still the same:

Code Select

Average cycle counts:
2485     InstrJJ (jj2007)
2990     BMLinDB (Lingo)
10139    InString (Masm32 Lib)
InstrJJ : InString = 24 %

Apologies to those who downloaded the previous versions.

jj2007 · May 11, 2009, 09:58:51 AM

Hi everybody,
I have switched from SSE2 word scanner to byte scanner with improved non-SSE loops, it is now a few cycles slower on a P4 but I have a suspicion that it rocks on Core & AMD.

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Average cycle counts:
4280     InstrJJ (jj2007)
16486    InString (Masm32 Lib)

Can I have some timings, please? Thanks.

[attachment deleted by admin]

News:

InstrCi - case-insensitive string search

Ian_B