News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InstrCi - case-insensitive string search

Started by jj2007, June 15, 2008, 10:09:46 PM

Previous topic - Next topic

jj2007

I needed a case-insensitive string search and didn't find anything usable provided by the OS, so I got inspired by the old InStringL/finstr etc posts.
My new function InstrCi (instr case-insensitive) uses brute force and still has a limitation: the pattern should have 4 bytes, not 3, and not 88. However, that's minor polishing. Here are some results - see attachment for code:

----------------------
9441 average crt_strstr res=4217403
7332 average finstr     res=2619 (should be 2620)
5805 average InStringL  res=4217403
9760 average InstrCi    res=4217403
1942446 average StrStrI res=4217403

So it is far from Hutch & Lingo performance but can compete with msvcrt strstr. Interesting the performance of the shlwapi StrStrI, the only function offered by the OS - it's a factor 200 slower. The other code was taken from a 22 Nov 2006 post; finstr has a minor bug, it returns a position one too low (e.g. 0 if the pattern is already found in the first byte of the source).

Now I wonder, of course, what our champions here could tickle out of this code...
One more point: I use a simple OR 20h for getting rid of the case; there are two variants here:
1. strictly case-insensitive:  mov edi, 20202020h   ; lowercase bit mask for finding "TEST" in "This is a Test"
2. intelligently case-insensitive: mov edi, 20h      ; lowercase bit mask for finding "test" and "Test" but not "TEST" in "This is a Test"

InstrCi proc uses edi lpSource:DWORD, lpPattern:DWORD
  mov eax, lpSource
  mov edx, lpPattern
  mov edx, [edx]
; mov edi, 20202020h ; lowercase bit mask for finding "TEST" in "This is a Test"
  mov edi, 20h ; lowercase bit mask for finding "test" and "Test" but not "TEST" in "This is a Test"
  or edx, edi ; make pattern lowercase
@@: mov ecx, [eax]
test cl, cl ; source end?
jz @F
or ecx, edi ; make source lowercase
add eax, 1 ; inc eax would cost around 5%
cmp ecx, edx
jne @B
  dec eax
  ret
@@:
  xor eax, eax
  ret
InstrCi endp

[attachment deleted by admin]

hutch--

jj,

It looks like a good short algo which be useful in almost all applications of case insensitive searches. If you need an extended version I would be inclined to try a table based version with the option of providing your own table to suit different languages that have different characters.

One other comment, for doing repeat searches for text, its useful to be able to specify the end of the last result so you can just re-enter the algo and add that result to the start address of the source for the next search. You can just add the last search result to the start address yourself as the caller but its tidier to have the algo do it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 16, 2008, 12:03:35 AM
It looks like a good short algo which be useful in almost all applications of case insensitive searches. If you need an extended version I would be inclined to try a table based version with the option of providing your own table to suit different languages that have different characters.
For searching non-English literature, yes. For tasks related to IT and programming, the simple or 32 algo should do, I guess.

QuoteOne other comment, for doing repeat searches for text, its useful to be able to specify the end of the last result so you can just re-enter the algo and add that result to the start address of the source for the next search. You can just add the last search result to the start address yourself as the caller but its tidier to have the algo do it.
I had taken that option out for making the benchmarks easier to understand. Attached the full version; it has grown in size but speed is still on par with crt_strstr:

----------------------
12414 average crt_strstr        res=4217403
10789 average finstr    res=2619
6538 average InStringL  res=4217403
12365 average InstrCi   res=4217403
1520117 average StrStrI res=4217403
----------------------

InstrCi knows three search modes, passed with the fourth para:

StartPos equ 1

1.   invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi, 0
   start at pos 1, search case-sensitive for SubstrxCi in Mainstr

2.   invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi, 1
   start at pos 1, search case-insensitive for SubstrxCi in Mainstr

3.   invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi, 2
   start at pos 1, search for SubstrxCi in Mainstr, ignore case of first character

The first para indicates the start position. Use 1 for a standard search.
If this para is 0, then InstrCi continues directly after the last match.

[attachment deleted by admin]

hutch--

jj,

I have had a good play with your example and it looks like your algo has the legs on the VCRT version by a very long way. I added a boyer moore search algo to the test bed just to show you the speed difference with an algo of this type. They get faster the longer the pattern is but they are usually slower on strings under about 6 or 7 chars long.

One of our members IanB did an optimisation on a BM design that made it a lot faster as well but I don't readily have a copy of Ian's algo.

With your algo there are some reasonable optimisations to do to get it faster. Replace the REPE CMPSB with incremented pointer code, try and aviod code like the following,


mov cx, [eax]
test cl, cl


Try to aviod following partial register read and writes, one of the exceptions is MOVZX to a 32 bit reg followed by a CMP AL, ???


These are the times I am getting with this attached test piece.


20342 timing crt_strstr
13686 timing finstr
9491 timing InStringL
21796 timing InstrCi
1862545 timing StrStrI
2307 timing BM

20940 timing crt_strstr
13475 timing finstr
10670 timing InStringL
21811 timing InstrCi
1863999 timing StrStrI
2327 timing BM

21103 timing crt_strstr
14070 timing finstr
11019 timing InStringL
21925 timing InstrCi
1797823 timing StrStrI
2354 timing BM

20406 timing crt_strstr
14026 timing finstr
10766 timing InStringL
21892 timing InstrCi
1791850 timing StrStrI
2344 timing BM

20554 timing crt_strstr
13678 timing finstr
10507 timing InStringL
21800 timing InstrCi
1797879 timing StrStrI
2310 timing BM

20038 timing crt_strstr
13881 timing finstr
10708 timing InStringL
21916 timing InstrCi
1820599 timing StrStrI
2353 timing BM

20139 timing crt_strstr
13461 timing finstr
10493 timing InStringL
21789 timing InstrCi
1811869 timing StrStrI
2300 timing BM

20382 timing crt_strstr
13860 timing finstr
10883 timing InStringL
22386 timing InstrCi
1800658 timing StrStrI
2368 timing BM

----------------------
20488 average crt_strstr        res=4221479
13767 average finstr    res=2597
10567 average InStringL res=4221479
21914 average InstrCi   res=4221479
1818402 average StrStrI res=4221479
2332 average BM res=2597
----------------------
Press any key to continue ...

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Hutch, thanks a lot for testing this. Really appreciated!

Quote from: hutch-- on June 16, 2008, 02:57:14 PM
(BM is) usually slower on strings under about 6 or 7 chars long.
In my experience, most real life searches are 2-5 characters...

QuoteWith your algo there are some reasonable optimisations to do to get it faster. Replace the REPE CMPSB
I know it's slow, but I use it only in an outer loop, so it doesn't affect speed.


mov cx, [eax]
test cl, cl


So what's the faster alternative here? Mov doesn't set the flag.

@@: mov ecx, [eax]
test cl, cl ; end of source string?
jz SrcEnd
or ecx, ebx ; make source lowercase
add eax, 1 ; (inc eax would cost around 5%)
cmp ecx, edx ; compare 4 bytes of source with pattern (faster than 2 bytes cmp)
jne @B


Quote
These are the times I am getting with this attached test piece.
No attachment in your post, sorry...

Minor change in my code:

InstrCi proc
  mov eax, lpSource
  .if StartPos         ; e.g. invoke InstrCi, 1, addr Mainstr, addr SubstrxCi, 0 means
   push StartPos   ; start at pos1, find Substr in Mainstr, case-sensitive
   pop CurPos      ; invoke InstrCi, 0, ... means continue with Current Position
   dec CurPos      ; 1-based index, so if 1 is passed, add nothing to eax
  .endif
  add eax, CurPos      ; start with current position or StartPos

It would not find the match if it was at position 0 of the source.

hutch--

Sorry about that, attached now.



mov cx, [eax]
test cl, cl

Becomes

mov ecx, [eax]
test ecx, ecx


This code is problematic as it uses two partial registers which must be obtained internally by masking.


  .if cl && ch ; pattern length >=4?


This code,


@@: lodsb ; mov al, [esi], esi++

replace with commented code or

movzx eax, BYTE PTR [esi]
add esi, 1
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 16, 2008, 03:36:19 PM

mov cx, [eax]
test cl, cl


Becomes


mov ecx, [eax]
test ecx, ecx


Replacing the mov cx with mov ecx does indeed increase the speed significantly, see new attachment. But I have to stick with the test cl, cl because only the 4th character will be the source's null terminator.


[attachment deleted by admin]

jj2007

Hutch,
Thanks for this. BM speeds are indeed impressive; when I reduce, however, the pattern to "wings",
I get these timings:
----------------------
12114 average crt_strstr        res=4221501
11082 average finstr    res=2619
6694 average InStringL  res=4221501
12296 average InstrCi   res=4221501
1534310 average StrStrI res=4221501
10121 average BM        res=2619

One minor point:
    mov slen, len(OFFSET Mainstr)
    mov plen, len(OFFSET Substrx)

I guess the BM would not suffer too much from taking these off - in real life, you don't know the size beforehand.

Is it difficult to make BM case-insensitive?

hutch--

jj,

> when I reduce, however, the pattern to "wings",

This is consistent with the design, the step size is not big enough to counter the extra overhead.

> Is it difficult to make BM case-insensitive?

I think it can be done with an extra table but it will be no joy to code. A long time ago I saw one written by Jeremy Collake but I don't know where it is any longer, it was about 9 years ago.

Re the lengths, the version of the algo was designed to do binary search as well but I have seen variants that read until the terminator in both source and pattern. I think there is an ancient Microsoft one in an early version of MASM for 32 bit winnt 3.5.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 16, 2008, 04:11:04 PM
I have seen variants that read until the terminator in both source and pattern

Not a big issue anyway. Len(pattern) is short and fast, while Len(source) might often be known through other sources, e.g. if you load a file for searching into memory then you know the file size.

Do you have any way to get statistics on the average length of search words here in the forum?

jj2007

Hutch,
I have refined the prog a little bit, allowing to compare long and short patterns.
It needs a minimum pattern of 2 bytes but has no further limitations afaik.
Here is one result - note that all InstrCi versions perform a
case-insensitive search, so they must be a bit slower:
----------------------
9600 average crt_strstr res=4229691
7341 average finstr     res=2619
5649 average InStringL  res=4229691
8443 average InstrCi 1s res=4229691
8643 average InstrCi 2s res=4229691
8663 average InstrCi 1L res=4229691
8612 average InstrCi 2L res=4229691
8225 avg Boyer-Moore s  res=2619
968 avg Boyer-Moore L   res=147

This is quite unexpected, so here is one that looks more "expected":
----------------------
9490 average crt_strstr res=4229691
7346 average finstr     res=2619
5786 average InStringL  res=4229691
8405 average InstrCi 1s res=4229691
8473 average InstrCi 2s res=4229691
8665 average InstrCi 1L res=4229691
8649 average InstrCi 2L res=4229691
8147 avg Boyer-Moore s  res=2619
8470 avg Boyer-Moore L  res=2619

Version A looks for "and whether pigs have wings and fly like elephants in the sky"
Version B looks for "wings and fly like elephants in the sky"
Not sure whether it's my fault, or whether the BM code has a little bug...


[attachment deleted by admin]

jj2007

More on the Boyer-Moore problem (see attachment above):
end of MainStr:
   db "and whether pigs have wings and fly like elephants in the sky...", 0

db "ave wings and fly like elephants in the sky", 0   ; yields fast but wrong result, 43 characters plus null
db "ve wings and fly like elephants in the sky", 0   ; yields correct result, 42 characters plus null

jj2007

New algo, slightly faster and shorter. Speedwise it is now on par with finstr, sizewise still very compact, see below (I could not calculate sizes for the external strstr and StrStrI functions).
The problem with the BM algo persists, see source.
In any case, if Lingo succeeds to make his search case-insensitive without sacrificing too much on speed, we might have a clear winner, at least for short patterns  :wink

----------------------
12031 average crt_strstr        res=4229691
11006 average finstr    res=2619
6615 average InStringL  res=4229691
10907 average InstrCi 1s        res=4229691
10563 average InstrCi 2s        res=4229691
11007 average InstrCi 1L        res=4229691
10744 average InstrCi 2L        res=4229691
10805 * average InstrCi
201 diff finstr-InstrCi (neg = finstr is faster)
9751 avg Boyer-Moore s  res=2619
7768 avg Boyer-Moore L  res=2616
1548024 average StrStrI res=4229691

Sizes:
596 finstr
305 InStringL
150 InstrCi
199 Boyer-Moore
----------------------

[attachment deleted by admin]

hutch--

jj,

I think you can do a reasonably fast version using a table depending on the design you use to do it. Once the table is in cache you don't keep getting the same penalty as first time. Even as a classic byte scanner if you convert the first character in the pattern to a single case that matches the table you get a memory read for the character and one from the table but it should only be marginally slower than the case sensitive scanner.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on June 17, 2008, 10:51:21 AM
I think you can do a reasonably fast version using a table depending on the design you use to do it. Once the table is in cache you don't keep getting the same penalty as first time. Even as a classic byte scanner if you convert the first character in the pattern to a single case that matches the table you get a memory read for the character and one from the table but it should only be marginally slower than the case sensitive scanner.

Sure it could be done, but why? InstrCi is already a little bit faster than finstr, and a lot faster than the Masm32 HLL istring - comparing apples and oranges because the latter perform only a case-sensitive search. If there is a true need for doing case-insensitive search on Chinese text, then the correct way is to copy the whole source to a buffer, call CharLower, do the same with the pattern and run Lingo's super fast algo over it; same with BM for long patterns.

----------------------
12176 average crt_strstr        res=4233
10966 average finstr    res=2619
7619 average InStringL  res=4233787
10708 average InstrCi 1s        res=4233
10785 average InstrCi 2s        res=4233
10757 average InstrCi 1L        res=4233
10935 average InstrCi 2L        res=4233
10796 * average InstrCi
170 diff finstr-InstrCi (neg = finstr is faster)
10247 avg Boyer-Moore s res=2619
7954 avg Boyer-Moore L  res=2616
15156 Hll istring       res=2620
1501106 average StrStrI res=4233787

Sizes:
596 finstr
305 InStringL
150 InstrCi
199 Boyer-Moore
----------------------