I needed a case-insensitive string search and didn't find anything usable provided by the OS, so I got inspired by the old InStringL/finstr (http://www.masm32.com/board/index.php?topic=6181.15) etc posts.
My new function InstrCi (instr case-insensitive) uses brute force and still has a limitation: the pattern should have 4 bytes, not 3, and not 88. However, that's minor polishing. Here are some results - see attachment for code:
----------------------
9441 average crt_strstr res=4217403
7332 average finstr res=2619 (should be 2620)
5805 average InStringL res=4217403
9760 average InstrCi res=4217403
1942446 average StrStrI res=4217403
So it is far from Hutch & Lingo performance but can compete with msvcrt strstr. Interesting the performance of the shlwapi StrStrI, the only function offered by the OS - it's a factor 200 slower. The other code was taken from a 22 Nov 2006 post; finstr has a minor bug, it returns a position one too low (e.g. 0 if the pattern is already found in the first byte of the source).
Now I wonder, of course, what our champions here could tickle out of this code...
One more point: I use a simple OR 20h for getting rid of the case; there are two variants here:
1. strictly case-insensitive: mov edi, 20202020h ; lowercase bit mask for finding "TEST" in "This is a Test"
2. intelligently case-insensitive: mov edi, 20h ; lowercase bit mask for finding "test" and "Test" but not "TEST" in "This is a Test"
InstrCi proc uses edi lpSource:DWORD, lpPattern:DWORD
mov eax, lpSource
mov edx, lpPattern
mov edx, [edx]
; mov edi, 20202020h ; lowercase bit mask for finding "TEST" in "This is a Test"
mov edi, 20h ; lowercase bit mask for finding "test" and "Test" but not "TEST" in "This is a Test"
or edx, edi ; make pattern lowercase
@@: mov ecx, [eax]
test cl, cl ; source end?
jz @F
or ecx, edi ; make source lowercase
add eax, 1 ; inc eax would cost around 5%
cmp ecx, edx
jne @B
dec eax
ret
@@:
xor eax, eax
ret
InstrCi endp
[attachment deleted by admin]
jj,
It looks like a good short algo which be useful in almost all applications of case insensitive searches. If you need an extended version I would be inclined to try a table based version with the option of providing your own table to suit different languages that have different characters.
One other comment, for doing repeat searches for text, its useful to be able to specify the end of the last result so you can just re-enter the algo and add that result to the start address of the source for the next search. You can just add the last search result to the start address yourself as the caller but its tidier to have the algo do it.
Quote from: hutch-- on June 16, 2008, 12:03:35 AM
It looks like a good short algo which be useful in almost all applications of case insensitive searches. If you need an extended version I would be inclined to try a table based version with the option of providing your own table to suit different languages that have different characters.
For searching non-English literature, yes. For tasks related to IT and programming, the simple
or 32 algo should do, I guess.
QuoteOne other comment, for doing repeat searches for text, its useful to be able to specify the end of the last result so you can just re-enter the algo and add that result to the start address of the source for the next search. You can just add the last search result to the start address yourself as the caller but its tidier to have the algo do it.
I had taken that option out for making the benchmarks easier to understand. Attached the full version; it has grown in size but speed is still on par with crt_strstr:
----------------------
12414 average crt_strstr res=4217403
10789 average finstr res=2619
6538 average InStringL res=4217403
12365 average InstrCi res=4217403
1520117 average StrStrI res=4217403
----------------------
Instr
Ci knows three search modes, passed with the fourth para:
StartPos equ 1
1. invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi,
0 start at pos 1, search case-
sensitive for SubstrxCi in Mainstr
2. invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi,
1 start at pos 1, search case-
insensitive for SubstrxCi in Mainstr
3. invoke InstrCi, StartPos, addr Mainstr, addr SubstrxCi,
2 start at pos 1, search for SubstrxCi in Mainstr,
ignore case of first characterThe first para indicates the start position. Use 1 for a standard search.
If this para is 0, then InstrCi continues directly after the last match.
[attachment deleted by admin]
jj,
I have had a good play with your example and it looks like your algo has the legs on the VCRT version by a very long way. I added a boyer moore search algo to the test bed just to show you the speed difference with an algo of this type. They get faster the longer the pattern is but they are usually slower on strings under about 6 or 7 chars long.
One of our members IanB did an optimisation on a BM design that made it a lot faster as well but I don't readily have a copy of Ian's algo.
With your algo there are some reasonable optimisations to do to get it faster. Replace the REPE CMPSB with incremented pointer code, try and aviod code like the following,
mov cx, [eax]
test cl, cl
Try to aviod following partial register read and writes, one of the exceptions is MOVZX to a 32 bit reg followed by a CMP AL, ???
These are the times I am getting with this attached test piece.
20342 timing crt_strstr
13686 timing finstr
9491 timing InStringL
21796 timing InstrCi
1862545 timing StrStrI
2307 timing BM
20940 timing crt_strstr
13475 timing finstr
10670 timing InStringL
21811 timing InstrCi
1863999 timing StrStrI
2327 timing BM
21103 timing crt_strstr
14070 timing finstr
11019 timing InStringL
21925 timing InstrCi
1797823 timing StrStrI
2354 timing BM
20406 timing crt_strstr
14026 timing finstr
10766 timing InStringL
21892 timing InstrCi
1791850 timing StrStrI
2344 timing BM
20554 timing crt_strstr
13678 timing finstr
10507 timing InStringL
21800 timing InstrCi
1797879 timing StrStrI
2310 timing BM
20038 timing crt_strstr
13881 timing finstr
10708 timing InStringL
21916 timing InstrCi
1820599 timing StrStrI
2353 timing BM
20139 timing crt_strstr
13461 timing finstr
10493 timing InStringL
21789 timing InstrCi
1811869 timing StrStrI
2300 timing BM
20382 timing crt_strstr
13860 timing finstr
10883 timing InStringL
22386 timing InstrCi
1800658 timing StrStrI
2368 timing BM
----------------------
20488 average crt_strstr res=4221479
13767 average finstr res=2597
10567 average InStringL res=4221479
21914 average InstrCi res=4221479
1818402 average StrStrI res=4221479
2332 average BM res=2597
----------------------
Press any key to continue ...
[attachment deleted by admin]
Hutch, thanks a lot for testing this. Really appreciated!
Quote from: hutch-- on June 16, 2008, 02:57:14 PM
(BM is) usually slower on strings under about 6 or 7 chars long.
In my experience, most real life searches are 2-5 characters...
QuoteWith your algo there are some reasonable optimisations to do to get it faster. Replace the REPE CMPSB
I know it's slow, but I use it only in an outer loop, so it doesn't affect speed.
mov cx, [eax]
test cl, cl
So what's the faster alternative here? Mov doesn't set the flag.
@@: mov ecx, [eax]
test cl, cl ; end of source string?
jz SrcEnd
or ecx, ebx ; make source lowercase
add eax, 1 ; (inc eax would cost around 5%)
cmp ecx, edx ; compare 4 bytes of source with pattern (faster than 2 bytes cmp)
jne @B
Quote
These are the times I am getting with this attached test piece.
No attachment in your post, sorry...
Minor change in my code:
InstrCi proc
mov eax, lpSource
.if StartPos ; e.g. invoke InstrCi, 1, addr Mainstr, addr SubstrxCi, 0 means
push StartPos ; start at pos1, find Substr in Mainstr, case-sensitive
pop CurPos ; invoke InstrCi, 0, ... means continue with Current Position
dec CurPos ; 1-based index, so if 1 is passed, add nothing to eax .endif
add eax, CurPos ; start with current position or StartPos
It would not find the match if it was at position 0 of the source.
Sorry about that, attached now.
mov cx, [eax]
test cl, cl
Becomes
mov ecx, [eax]
test ecx, ecx
This code is problematic as it uses two partial registers which must be obtained internally by masking.
.if cl && ch ; pattern length >=4?
This code,
@@: lodsb ; mov al, [esi], esi++
replace with commented code or
movzx eax, BYTE PTR [esi]
add esi, 1
Quote from: hutch-- on June 16, 2008, 03:36:19 PM
mov cx, [eax]
test cl, cl
Becomes
mov ecx, [eax]
test ecx, ecx
Replacing the mov cx with mov ecx does indeed increase the speed significantly, see new attachment. But I have to stick with the test cl, cl because only the 4th character will be the source's null terminator.
[attachment deleted by admin]
Hutch,
Thanks for this. BM speeds are indeed impressive; when I reduce, however, the pattern to "wings",
I get these timings:
----------------------
12114 average crt_strstr res=4221501
11082 average finstr res=2619
6694 average InStringL res=4221501
12296 average InstrCi res=4221501
1534310 average StrStrI res=4221501
10121 average BM res=2619
One minor point:
mov slen, len(OFFSET Mainstr)
mov plen, len(OFFSET Substrx)
I guess the BM would not suffer too much from taking these off - in real life, you don't know the size beforehand.
Is it difficult to make BM case-insensitive?
jj,
> when I reduce, however, the pattern to "wings",
This is consistent with the design, the step size is not big enough to counter the extra overhead.
> Is it difficult to make BM case-insensitive?
I think it can be done with an extra table but it will be no joy to code. A long time ago I saw one written by Jeremy Collake but I don't know where it is any longer, it was about 9 years ago.
Re the lengths, the version of the algo was designed to do binary search as well but I have seen variants that read until the terminator in both source and pattern. I think there is an ancient Microsoft one in an early version of MASM for 32 bit winnt 3.5.
Quote from: hutch-- on June 16, 2008, 04:11:04 PM
I have seen variants that read until the terminator in both source and pattern
Not a big issue anyway. Len(pattern) is short and fast, while Len(source) might often be known through other sources, e.g. if you load a file for searching into memory then you know the file size.
Do you have any way to get statistics on the average length of search words here in the forum?
Hutch,
I have refined the prog a little bit, allowing to compare long and short patterns.
It needs a minimum pattern of 2 bytes but has no further limitations afaik.
Here is one result - note that all InstrCi versions perform a
case-insensitive search, so they must be a bit slower:
----------------------
9600 average crt_strstr res=4229691
7341 average finstr res=2619
5649 average InStringL res=4229691
8443 average InstrCi 1s res=4229691
8643 average InstrCi 2s res=4229691
8663 average InstrCi 1L res=4229691
8612 average InstrCi 2L res=4229691
8225 avg Boyer-Moore s res=2619
968 avg Boyer-Moore L res=147
This is quite unexpected, so here is one that looks more "expected":
----------------------
9490 average crt_strstr res=4229691
7346 average finstr res=2619
5786 average InStringL res=4229691
8405 average InstrCi 1s res=4229691
8473 average InstrCi 2s res=4229691
8665 average InstrCi 1L res=4229691
8649 average InstrCi 2L res=4229691
8147 avg Boyer-Moore s res=2619
8470 avg Boyer-Moore L res=2619
Version A looks for "and whether pigs have wings and fly like elephants in the sky"
Version B looks for "wings and fly like elephants in the sky"
Not sure whether it's my fault, or whether the BM code has a little bug...
[attachment deleted by admin]
More on the Boyer-Moore problem (see attachment above):
end of MainStr:
db "and whether pigs have wings and fly like elephants in the sky...", 0
db "ave wings and fly like elephants in the sky", 0 ; yields fast but wrong result, 43 characters plus null
db "ve wings and fly like elephants in the sky", 0 ; yields correct result, 42 characters plus null
New algo, slightly faster and shorter. Speedwise it is now on par with finstr, sizewise still very compact, see below (I could not calculate sizes for the external strstr and StrStrI functions).
The problem with the BM algo persists, see source.
In any case, if Lingo succeeds to make his search case-insensitive without sacrificing too much on speed, we might have a clear winner, at least for short patterns :wink
----------------------
12031 average crt_strstr res=4229691
11006 average finstr res=2619
6615 average InStringL res=4229691
10907 average InstrCi 1s res=4229691
10563 average InstrCi 2s res=4229691
11007 average InstrCi 1L res=4229691
10744 average InstrCi 2L res=4229691
10805 * average InstrCi
201 diff finstr-InstrCi (neg = finstr is faster)
9751 avg Boyer-Moore s res=2619
7768 avg Boyer-Moore L res=2616
1548024 average StrStrI res=4229691
Sizes:
596 finstr
305 InStringL
150 InstrCi
199 Boyer-Moore
----------------------
[attachment deleted by admin]
jj,
I think you can do a reasonably fast version using a table depending on the design you use to do it. Once the table is in cache you don't keep getting the same penalty as first time. Even as a classic byte scanner if you convert the first character in the pattern to a single case that matches the table you get a memory read for the character and one from the table but it should only be marginally slower than the case sensitive scanner.
Quote from: hutch-- on June 17, 2008, 10:51:21 AM
I think you can do a reasonably fast version using a table depending on the design you use to do it. Once the table is in cache you don't keep getting the same penalty as first time. Even as a classic byte scanner if you convert the first character in the pattern to a single case that matches the table you get a memory read for the character and one from the table but it should only be marginally slower than the case sensitive scanner.
Sure it could be done, but why? InstrCi is already a little bit faster than finstr, and a lot faster than the Masm32 HLL istring - comparing apples and oranges because the latter perform only a case-sensitive search. If there is a true need for doing case-insensitive search on Chinese text, then the correct way is to copy the whole source to a buffer, call
CharLower, do the same with the pattern and run Lingo's super fast algo over it; same with BM for long patterns.
----------------------
12176 average crt_strstr res=4233
10966 average finstr res=2619
7619 average InStringL res=4233787
10708 average InstrCi 1s res=4233
10785 average InstrCi 2s res=4233
10757 average InstrCi 1L res=4233
10935 average InstrCi 2L res=4233
10796 * average InstrCi
170 diff finstr-InstrCi (neg = finstr is faster)
10247 avg Boyer-Moore s res=2619
7954 avg Boyer-Moore L res=2616
15156 Hll istring res=2620
1501106 average StrStrI res=4233787
Sizes:
596 finstr
305 InStringL
150 InstrCi199 Boyer-Moore
----------------------
> InstrCi is already a little bit faster than finstr
To my embarassment, this is not true for my other machine, a P4 2.4 GHz. But here InstrCi has an edge over the Boyer-Moore algo, at least for short patterns. Any more benchmarks?
----------------------
9508 average crt_strstr res=4233787
7381 average finstr res=2619
5539 average InStringL res=4233787
7499 average InstrCi 1s res=4233787
7462 average InstrCi 2s res=4233787
7691 average InstrCi 1L res=4233787
7683 average InstrCi 2L res=4233787
7583 * average InstrCi
-202 diff finstr-InstrCi (neg = finstr is faster)
798 diff BM s - InstrCi (neg = BM is faster)
8381 avg Boyer-Moore s res=2619
7409 avg Boyer-Moore L res=2616
11857 Hll istring res=2620
1945881 average StrStrI res=4233787
Sizes:
596 finstr
305 InStringL
187 InstrCi
199 Boyer-Moore
----------------------
[attachment deleted by admin]
Hutch, I'm honoured you remember my optimisation experiments. :red
Here is the original test piece I put together. I'm sure other people (Lingo) have better versions. My starting point was the existing MASM library, especially Hutch's Boyer-Moore which I thought I could see scope for improvement in. There was a different Boyer-Moore design floating around by Cresta that was extremely poorly optimised that looked as if it had potential, especially once an existing bug had been removed. I also wanted to confirm that the speed of the Boyer-Moore designs can be varied by the size of the shift-table used. For a 255-byte table, you can search strings up to 254 chars in length, or for 65535 for strings 65534 in length, etc. Obviously the most efficient searching takes place with smaller tables, because they take less time setting up, and because the algorithm has to do some memory operations using [ECX*n] modifiers, and if you can avoid using the *4 modifier needed for the 4Gb stringlength search table (256 DWORDs in table) you get less latency on those ops. For max efficiency you just need to select the version that handles strings at least as long as the one you want to search for.
I wanted to see how the variations compared not just on classic random text, their prime use for most people I guess, but also on some synthetic data. The reason was that Hutch's original complaint about my reducing the table to 255 chars IIRC was that the BMs were good for doing viral pattern matching with extremely large search lengths, which obviously isn't the same as plain text searching. That's why the search strings are a little odd. There's a random text search at the end of the test suite, though, but really it should be done with a much larger text sample for better results.
There's some comments in the file about the results on my P4 Northwood. The results I got indicated the Cresta rewrites are definitely the best overall, due to their efficient DWORD test approach, and they might be a good basis for further optimisation if anyone's interested. They are also easily altered to produce compact dedicated search routines for specific text where you don't need a general algo, and you can add the search table in as pre-calculated data that avoids the setup overhead, useful if you need the same few searches over and over in an app.
Regarding case-insensitive searching, if I was using this code to create such a search I'd add a setup section to the algo in which both the search string and the text to search were capitalised before running the search using your OR 20202020h method. That way you only do the OR once for each byte in the string, rather than speculatively over and over in the algo.
Note my preference for more efficient functions using params passed in registers over stack params and frames. It also allows easier use of EBP as a spare reg. Sorry if that annoys people. You can recode to your preference if you like. :P
[attachment deleted by admin]
Update on Instring:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Average cycle count:
5035 InstrJJ (jj2007)
4097 BMLinDB (Lingo)
16176 MasmLib InstringL
InstrJJ : InstringL = 31 %
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Average cycle counts:
2476 InstrJJ (jj2007)
2990 BMLinDB (Lingo)
6773 BM (Boyer-Moore Hutch)
10158 InString (Masm32 Lib)
InstrJJ : InString = 24 %
Code sizes:
InstrJJ = 388
BM = 199
BMLinDB = 395
InStringL = 305
Requires SSE2. I would be interested to see some AMD and Dual Core timings :bg
EDIT: Shaved off 7 bytes and >50 cycles.
EDIT(2): Minor bugfix in the one-byte pattern section, 2 bytes less (they caused the bug :red).
EDIT(3): Attachment removed - will be back when certain oddities are fixed.
EDIT(4): Oddities are fixed, attachment is back, but code is longer now :bg
EDIT(5): Could not resist to add the find "Duplicate inc" in Windows.inc test :wink
I have tried to test quite a number of odd mainstring/pattern combinations, and it seems to be foolproof now, but there might still be surprises (also for other included algos). Use at your own risk... and please give me feedback if it bounces.
[attachment deleted by admin]
Latest result:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ: 38 ; clocks: 104
InString - Lingo: 37 ; clocks: 39
Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ: 1008 ; clocks: 23455
InString - Lingo: 1007 ; clocks: 6395
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 686
InString - Lingo: 1008 ; clocks: 502
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 4326
InString - Lingo: 1008 ; clocks: 1417
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 3367
InString - Lingo: 1008 ; clocks: 1308
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 641
InString - Lingo: 1008 ; clocks: 498
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ: 1010 ; clocks: 621
InString - Lingo: 1009 ; clocks: 502
Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ: 1002 ; clocks: 324
InString - Lingo: 1001 ; clocks: 102
Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ: 1002 ; clocks: 629
InString - Lingo: 1001 ; clocks: 470
Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ: 1002 ; clocks: 638
InString - Lingo: 1001 ; clocks: 448
Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ: 1002 ; clocks: 643
InString - Lingo: 1001 ; clocks: 504
Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ: 1002 ; clocks: 787
InString - Lingo: 1001 ; clocks: 639
Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ: 1127625 ; clocks: 679641
InString - Lingo: 1127624 ; clocks: 542808
Press ENTER to exit...
what makes it faster ?
(where's the beef?)
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
...
Average cycle count:
2955 InstrJJ (jj2007)
3891 BMLinDB (Lingo)
7274 Boyer-Moore
13815 MasmLib InstringL
InstrJJ : InstringL = 21 %
Code size InstrJJ=366
Quote from: dedndave on May 07, 2009, 10:56:07 PM
what makes it faster ?
(where's the beef?)
Here is the inner loop (simplified):
L1: movdqa xmm1, [esi] ; load 16 bytes from current aligned address
movdqu xmm4, [esi+1] ; load another 16 bytes UNALIGNED - must be in the middle
movdqa xmm2, [esi] ; save 16 bytes for testing the zero delimiter - faster than movdqa xmm2, xmm1
Lu: lea esi, [esi+16] ; position in mainstring (moving up/down lea or add costs cycles)
pcmpeqw xmm1, xmm3 ; compare packed words in xmm1 and xmm3 for equality
pcmpeqw xmm4, xmm3 ; compare packed words in xmm4 and xmm3 for equality
pcmpeqb xmm2, xmm1 ; xmm1 is filled with either 0 or FF; if it's FF, the byte at that position cannot be zero
pmovmskb eax, xmm2 ; set byte mask in ecx for zero delimiter
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern word
pmovmskb ecx, xmm4 ; set byte mask in ecx for search pattern word
shl ecx, 1 ; adjust for esi+1 (add ecx, ecx is a lot slower)
test eax, eax ; zero byte found?
jnz @F ; check/clear ebp, then ChkNull
or edx, ecx ; one of them needs to have the word
je L1 ; 0=no pattern byte found, go back
The secret is in
pcmpeqw:
Compare packed words in xmm2/m128 and xmm1 for equality (http://www.intel.com/software/products/documentation/vlin/mergedprojects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc231.htm)
The SSE2 loop is very fast, much faster than all the rest, because it reads 16 bytes at a time. The art is
not to leave it too often. Traditional Instring implementations have a byte scanner as the inner loop: Take the first character of your search pattern, e.g.
d for dedndave, and loop until you find it; then check the rest, if it's
edndave, great, otherwise back to the inner loop. You can imagine that you have to leave the inner loop very often for e, n, s etc. - and that your algo gets slowed down drastically, in spite of the fast inner loop.
Now the trick here is to use a
word scanner:
de is a lot less frequent that a single
d - statistically speaking 256 times less frequent. The inner loop is a little bit slower, because you need some acrobatics (and one slow unaligned movedqu) to implement a word scanner, but it pays off in general. The only algo that comes close is Lingo's Boyer-Moore implementation (http://www.masm32.com/board/index.php?topic=3601.msg83512#msg83512) (and it is even faster on a P4).
very cool JJ - I have never used any of the SSE instructions
I am finding it hard to remember all the regular ones - lol
Not as young as I used to be, but I will get there
For my brain to remember, I have to use an instruction someplace at least once
Whatever happened with measuring code in "LAMPs"? :bg
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Cycle counts:
3224 BMLinDB, addr Mainstr, addr TestSubEx, (SIZEOF TestSubEx)-1
4408 BMLinDB, addr Mainstr, addr TestSubDx, (SIZEOF TestSubDx)-1
4386 BMLinDB, addr Mainstr, addr TestSubXx, (SIZEOF TestSubXx)-1
11922 InString, 1, addr Mainstr, addr TestSubE
13401 InString, 1, addr Mainstr, addr TestSubD
16522 InString, 1, addr Mainstr, addr TestSubX
1774 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1770 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1771 InstrJJ, 1, addr Mainstr, addr TestSubE, 0
1834 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1823 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
1835 InstrJJ, 1, addr Mainstr, addr TestSubD, 0
5656 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5609 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
5633 InstrJJ, 1, addr Mainstr, addr TestSubX, 0
Average cycle count:
3078 InstrJJ (jj2007)
4006 BMLinDB (Lingo)
13948 MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ = 359
BM = 199
BMLinDB = 395
InStringL = 305
never heard of LAMPS
we used KOPS
MOPS
FLOPS
kinda like "gallons per mile" - lol
they were really meaningless terms because noone could agree on what constituted one operation
i think each benchmark i saw used a slightly different definition
many benchmarks measured how fast they could do a bunch of NOP's
the real-world standard was how fast you could re-calculate some pre-defined spread-sheet with Lotus 123 - lol
Quote from: Mark Jones on May 08, 2009, 09:44:37 PM
Whatever happened with measuring code in "LAMPs"? :bg
Right now I am a bit confused with my macros :dazzled:
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
3078 InstrJJ (jj2007)
4006 BMLinDB (Lingo)
13948 MasmLib InstringL
InstrJJ : InstringL = 22 %
Code sizes:
InstrJJ = 359
BM = 199
BMLinDB = 395
InStringL = 305
I love it, thanks for the timings :bg
Anyway, it seems even LAMP (http://www.masm32.com/board/index.php?topic=9437.msg68698#msg68698)-wise the algo is competitive - SSE2 rocks :toothy
I had to post a bugfix, see above. The algo reported wrong values when a one-byte pattern was found behind the main strings zero delimiter.
Welcome. Keep in mind that Lingo optimizes for his hardware, so his timings are still valid. In fact, AMD has been continually losing market share (http://www.xbitlabs.com/news/cpu/display/20080407081556_AMD_Continues_to_Lose_Market_Share_to_Intel_Analysts.html) for a long time now, and AMD is in the vast majority. Someday, there may not be any more AMD chips (http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~131177,00.html) to worry about timing. ::)
Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?
Quote from: Mark Jones on May 09, 2009, 12:07:01 AM
Curious, why did you choose CLOCKS * SQRT(SIZE) for the LAMP rating?
There is no "exact science" behind this formula, but it reflects a bit the philosophy of many who come to this forum: Lean and mean performance, i.e. code that is small
and fast - the equivalent to WIN32_LEAN_AND_MEAN. Of course, nobody would choose code that is half in size but twice as slow, but on the other hand many here would hesitate to put into their libraries code that is 2% faster but twice as big. So I chose the square root of size, in order to reduce its influence.
In the end, you will often opt for the faster code, but the LAMP's are an incentive to care also for size.
:bg
MinLampSize = 200
LAMP$ MACRO cycles:REQ, csize:REQ ; Usage: print LAMP$(cycles, size), " LAMPs", 13, 10
push ecx
push cycles
ffree ST(7) ;; free the stack for pushing
mov ecx, csize
.if ecx==0
add ecx, MinLampSize
.endif
push ecx
fild dword ptr [esp] ;; move into ST (0)
pop ecx
fsqrt
push cycles
fild dword ptr [esp] ;; push cycles on FPU
fmul ST, ST(1) ;; multiply with sqrt(size)
fistp dword ptr [esp]
pop eax
ifndef LampsBuffer
.data?
LampsCycles dd ?
LampsBuffer dd 80 dup (?)
.code
endif
shr eax, 10 ;; kiloLAMPS
invoke dwtoa, eax, addr LampsBuffer
pop eax
shr eax, 10
mov LampsCycles, eax
pop ecx
EXITM <offset LampsBuffer>
ENDM
Usage:
print "getlines:", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke getlines, Win$, cbFile
mov esi, eax
counter_end
print LAMP$(eax, offset getlinesE-getlines), " kiloLAMPs"
print str$(LampsCycles), " kilocycles"
Bugfix for the InstrJJ SSE2 version posted, see attachment above (http://www.masm32.com/board/index.php?topic=9370.msg84533#msg84533). Size increased a bit, speed is still the same:
Average cycle counts:
2485 InstrJJ (jj2007)
2990 BMLinDB (Lingo)
10139 InString (Masm32 Lib)
InstrJJ : InString = 24 %
Apologies to those who downloaded the previous versions.
Hi everybody,
I have switched from SSE2 word scanner to byte scanner with improved non-SSE loops, it is now a few cycles slower on a P4 but I have a suspicion that it rocks on Core & AMD.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Average cycle counts:
4280 InstrJJ (jj2007)
16486 InString (Masm32 Lib)
Can I have some timings, please? Thanks.
[attachment deleted by admin]
c:\china>\masm32\bin\ml.exe /c /coff InstrCi11_BL_BSF.asm
Microsoft (R) Macro Assembler Version 6.15.8803
Patched for you by promethee [ECL] in the year 2001 - enjoy
Copyright (C) Microsoft Corp 1981-2000. All rights reserved.
Assembling: InstrCi11_BL_BSF.asm
c:\china>\masm32\bin\link.exe /subsystem:console InstrCi11_BL
_BSF.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>InstrCi11_BL_BSF.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1748 InstrJJ (jj2007)
2817 BMLinDB (Lingo)
10305 InString (Masm32 Lib)
InstrJJ : InString = 16 %
Code sizes:
InstrJJ = 319
BMLinDB = 395
InStringL = 305
c:\china>
> InstrJJ : InString = 16 %
Thanks, UtillMasm. One third faster than the previous version, not too bad :8)
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1148 InstrJJ (jj2007)
2725 BMLinDB (Lingo)
9355 InString (Masm32 Lib)
InstrJJ : InString = 12 %
Code sizes:
InstrJJ = 319
BMLinDB = 395
InStringL = 305
Quote from: fearless on May 11, 2009, 03:41:20 PM
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
..
InstrJJ : InString = 12 %
Thanks. For a P4 I get now 25%, for the T2400 16%, and a Quad with SSE4 gives factor 8 compared to non-SSE2. Interesting pattern...
Quote
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1071 InstrJJ (jj2007)
2728 BMLinDB (Lingo)
9324 InString (Masm32 Lib)
InstrJJ : InString = 11 %
Code sizes:
InstrJJ = 319
BMLinDB = 395
InStringL = 305
Quote from: redskull on May 11, 2009, 04:16:54 PM
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
Test correctness for InstrJJ: OK
Average cycle counts:
1071 InstrJJ (jj2007)
2728 BMLinDB (Lingo)
9324 InString (Masm32 Lib)
InstrJJ : InString = 11 %
Thanks. Not a Quad but SSE4. On my old one-core CPU, I get much less of an improvement:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1717 InstrJJ (jj2007)
2780 BMLinDB (Lingo)
10196 InString (Masm32 Lib)
InstrJJ : InString = 16 %
Lingo's algo is SSE2, too, and pretty stable, so I guess Intel has made progress on a few very specific instructions that I use in the inner loop.
L1: movdqa xmm2, [esi]
Lu: movdqa xmm1, xmm0 ; xmm0 has first byte of pattern
lea esi, [esi+16] ; position in mainstring
pcmpeqb xmm1, xmm2 ; compare memory to pattern byte
pmovmskb edx, xmm1 ; set byte mask in edx for search pattern byte
pcmpeqb xmm1, xmm2 ; look for zero byte (xmm1 is either 0 or ff)
pmovmskb eax, xmm1 ; set byte mask in eax for zero delimiter byte
test eax, eax ; db 02eh ; branch hint: not taken, no effect
jne ZBF ; zero byte found? check/clear ebp, then ChkNull
test edx, edx ; db 03eh ; branch hint: taken, no effect
je L1 ; pattern byte found? no=go back, yes=repeat the exercise with a word
Hi,
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
Test correctness for InstrJJ: OK
Average cycle counts:
3106 InstrJJ (jj2007)
3872 BMLinDB (Lingo)
13736 InString (Masm32 Lib)
InstrJJ : InString = 22 %
Code sizes:
InstrJJ = 319
BMLinDB = 395
InStringL = 305
Regards,
Steve N.
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
2985 InstrJJ (jj2007)
3872 BMLinDB (Lingo)
13725 InString (Masm32 Lib)
InstrJJ : InString = 21 %
Code sizes:
InstrJJ = 319
BMLinDB = 395
InStringL = 305
Thanks to all who tested this code. I have run into a very odd problem that's driving me nuts:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
2636 InstrJJ (jj2007)
2778 BMLinDB (Lingo)
10198 InString (Masm32 Lib)
InstrJJ : InString = 25 %
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
1719 InstrJJ (jj2007)
2778 BMLinDB (Lingo)
10088 InString (Masm32 Lib)
InstrJJ : InString = 17 %
The only difference is the IFNOP switch in line 6 of the attachment.
If set to 1, it inserts an apparently harmless
.if res1 ;; any global variable will do, but not .if eax
nop
.endif
into the main loop of the timings macro:
REPEAT REP_CT
if IFNOP
.if res1 ;; any global variable will do, but not .if eax
nop
.endif
endif
invoke Sleep, 50
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
tmp$
counter_end
add globcycles, eax
inc globcount
if DisplayCycles
.if eax>65536
shr eax, 10
.endif
out$ CATSTR <print str$(eax), 9, ">, Description, <", 13, 10>
out$
endif
ENDM
Otherwise the code is identical. Any ideas?
::)
[attachment deleted by admin]
Why are you so surprised? I see this all the time on amd.
You move the call to the routine by n bytes. I didn't actually assemble it since I don't have sse2, but check for odd/even address, etc.
Some instructions prefer to be on an even address while others an odd address.
that is a HUGE difference, though
it makes me wonder if it is worth the effort to optimize, beyond a certain point
get it running sufficiently fast, then insert nop's behind each instruction to tune it in - lol
Quote from: Jimg on May 11, 2009, 08:14:15 PM
Some instructions prefer to be on an even address while others an odd address.
Both data and code are aligned 16 plus a fixed offset. But it
is an "alignment" issue - I can reproduce the effect by inserting a given number of nops before the getkey+exit...
Do the two exes produce different results on
your CPUs??
:(
EDIT: I managed to get the difference between the slow and fast
calls to the algo:
;CPU Disasm FAST
;Address Hex dump Command
;004036D2 ? 68 6C614000 push offset InstrCi11_BL_BSF.0040616C
;004036D7 ? 68 47574000 push offset InstrCi11_BL_BSF.00405747
;004036DC . 6A 01 push 1
;004036DE . E8 3D0E0000 call 00404520 #######
;004036E3 . 832D 047A4000 01 sub dword ptr [407A04], 1
;CPU Disasm slow
;004036D2 ? 68 6C614000 push offset InstrCi11_BL_BSF.0040616C
;004036D7 ? 68 47574000 push offset InstrCi11_BL_BSF.00405747
;004036DC . 6A 01 push 1
;004036DE . E8 2D0E0000 call 00404510 #######
;004036E3 . 832D 047A4000 01 sub dword ptr [407A04], 1
A code caching issue maybe??
Quote from: dedndave on May 11, 2009, 09:10:33 PM
that is a HUGE difference, though
it makes me wonder if it is worth the effort to optimize, beyond a certain point
get it running sufficiently fast, then insert nop's behind each instruction to tune it in - lol
Inserting nops inside the algo helps sometimes :bg
But here it's from
where you call it that makes the difference. The algo itself, and the data addresses, are unchanged... same algo size, no sudden conversions from near to far jumps etc...
i don't think the stack cares if you push an odd number or an even one
it has to be related to the EIP, itself
perhaps the lag occurs upon returning from the call, as opposed to making the call
so that, after the routine is done, it wants an even addy to return to ?
it does have to re-fill the queue at that point
now, we just need an assembler that will even-align (16-byte align - ouch) the return addy of calls - what a difference !
hint - hint GoAsm guys
I haven't been following this thread, but if the alignment of the return address is a problem it can readily be aligned by any one of several methods. For example:
; push arguments
push OFFSET return_label
push OFFSET procedure
ret
align 16
return_label:
Note that the OFFSET directive is not necessary, it was just added for clarity.
Quote from: MichaelW on May 12, 2009, 03:59:54 AM
I haven't been following this thread, but if the alignment of the return address is a problem
Thanks. It is unlikely that this is the problem because the return address is 004036E3 in both the slow and fast case. Somewhere about 1000 cycles
per call get lost, and I would really like to understand why...
I can't see any reason for the difference, but your app is so complex that it would be easy to miss a problem. You could try isolating the problem components in a simple test app, where there are fewer opportunities for errors and side effects.
Quote from: MichaelW on May 12, 2009, 06:27:35 AM
I can't see any reason for the difference, but your app is so complex that it would be easy to miss a problem. You could try isolating the problem components in a simple test app, where there are fewer opportunities for errors and side effects.
Thanks, Michael, for looking at this. The app is 28k including the "Mainstring" it searches; it may be complex in a way, but algo start is para-aligned, data alignment is fixed, too, and it is so small that code and data should fit into the cache. So how can moving the algo 16 bytes further down cost 1,000 cycles per call?
Isolating the problem components is of course a good idea, but most probably I won't be able to reproduce the effect any more.
I am lost... ::)
EDIT: Just tested the two exes attached above on P4:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
InstrCi11_IfNop_OFF.exe
4233 InstrJJ (jj2007)
4020 BMLinDB (Lingo)
InstrCi11_IfNop_ON.exe
4275 InstrJJ (jj2007)
4045 BMLinDB (Lingo)
Not the slightest difference. On the Celeron, there is a consistent difference of 1,000 cycles for InstrJJ but none for BMLinDB. So it is even CPU-specific... and the P4 is usually more sensitive to alignment than the more recent Celeron M.
Quote from: Jimg on May 11, 2009, 08:14:15 PM
Why are you so surprised? I see this all the time on amd.
Jim,
Did you ever find an explanation why this can happen? Some pattern that would point to possible causes? Does it happen in your big and small apps alike? What is the order of magnitude you observed: +40%=1000 cycles difference, or much less?
Perhaps the code with the extra 16 bytes will not fit in the cache, or some critical part of it will not fit. If this is so, adding more bytes should eventually cause the code to misbehave on other processors.
Quote from: MichaelW on May 12, 2009, 08:37:24 AM
Perhaps the code with the extra 16 bytes will not fit in the cache, or some critical part of it will not fit. If this is so, adding more bytes should eventually cause the code to misbehave on other processors.
The Celeron M 420 has 32kb of L1 cache (http://en.kioskea.net/guide/details/989288-intel-celeron-420-1-60-ghz-socket-775-800-mhz-bus), so the 28k exe should comfortably fit into the cache. But I am not a hardware specialist...
By the way, I cross-checked with GetTickCount, and the cycle counts are correct. Calling the algo
is 1,000 cycles slower when it sits 16 bytes further
down up:boohoo:
EDIT: Found a good article on Optimizing for instruction caches (http://www.dspdesignline.com/howto/202601670):
(page 2): "The effective size of a function in the cache can vary depending on its alignment. Suppose a function requires less than one line of cache, but that it overlaps a cache line boundary. In this case, the function will require allocation of two cache lines, substantially increasing it effective size".
Sounds plausible, but can an executable with only 28k overlap with a cache line boundary if the L1 cache is 32k? Is that the explanation?? If I understand the article correctly, two cache lines may be allocated, fine. But only once, or for every call, costing 1000 cycles??
Quote from: jj2007 on May 12, 2009, 07:44:15 AM
Quote from: Jimg on May 11, 2009, 08:14:15 PM
Why are you so surprised? I see this all the time on amd.
Jim,
Did you ever find an explanation why this can happen? Some pattern that would point to possible causes? Does it happen in your big and small apps alike? What is the order of magnitude you observed: +40%=1000 cycles difference, or much less?
I only know it happens. In fact, a couple of years ago, I set up my time testing routines to actually copy the code under test to the same physical memory location before starting the test, so every call was from and to the same location. About the time I got this all working, I realized I would never be able to play with the big boys because I had an AMD and they were all Pentium-4 centric. Took all the fun out of it.
Here's a sample. I was getting rock solid repeatability with only ten loops.
[attachment deleted by admin]
I only see slight differences with a Prescott....
InstrCi11_IfNop_ON.exe
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
4295 InstrJJ (jj2007)
3995 BMLinDB (Lingo)
16355 InString (Masm32 Lib)
InstrJJ : InString = 26 %
InstrCi11_IfNop_OFF.exe
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Test correctness for InstrJJ: OK
Average cycle counts:
4286 InstrJJ (jj2007)
4045 BMLinDB (Lingo)
16671 InString (Masm32 Lib)
InstrJJ : InString = 25 %
Quote from: Jimg on May 12, 2009, 02:56:44 PM
I only know it happens. In fact, a couple of years ago, I set up my time testing routines to actually copy the code under test to the same physical memory location before starting the test, so every call was from and to the same location. About the time I got this all working, I realized I would never be able to play with the big boys because I had an AMD and they were all Pentium-4 centric. Took all the fun out of it.
Here's a sample. I was getting rock solid repeatability with only ten loops.
Thanks, very interesting. It seems to be linked to
instruction cache locality (http://www.google.it/search?hl=en&safe=off&num=50&newwindow=1&q=%22instruction+cache+locality%22&btnG=Search), although I still do not understand how this should affect a 300-byte algo in a 28k executable on a CPU with 32k L1 cache...
Quote from: dedndave on May 12, 2009, 02:59:40 PM
I only see slight differences with a Prescott....
My P4 is a Prescott, too - same result: no difference. It is CPU-specific.
c:\china>\masm32\bin\ml.exe /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mmword
======= ======== ======== ======== ======== ========
OVH 264 0 0 0 0
InStrL 88 132 363 462 1155
finstr 88 286 429 550 1496
finstr2 154 209 363 363 1617
finstr3 143 209 352 352 1507
finstr4 88 308 451 594 1496
finstr5 88 319 451 605 1496
finstr6 154 209 363 363 1617
Press enter to exit...
c:\china>
This subject brings to mind something that has troubled me a bit.
If a programmer tests/optimizes his/her code on one machine, as many of us would
during a normal session, the code is likely to become biased for that CPU.
As a group, it would be nice if we could standardize testing and document
most of our results in one easy-to-use reference. That way, if a programmer
(especially a n00b like me) wants to find an optimization that has been
tested on a variety of platforms, he/she doesn't have to search and read all
the forum threads concerning a single subject. I am not refering to major
algorithms, but simple little things that have been widely tested.
It seems impractical to have all the members test every algo we write, in
order to acquire a multi-platform sampling. On the other side of the coin,
we don't want finished code that is (sometimes unknowingly) optimized
for a specific processor. There must be a solution someplace.
As for Jochen's issue, it is machine specific, but all of us want to know what
the solution is so that we may attempt to avoid the issue in code that is
written and tested on a different CPU. It makes me wonder how many
programs out there suffer on one or more specific CPU(s).
JJ - This morning I was reading Agner's assembly optimization manual.
One of the things that made me think of your problem was his
statement "Make sure it isn't the test program". He went on to say
that many hours are spent trying to fix an algorithm, when it turns
out to be the testing method at fault. I am not saying your test-bed
has a flaw, I just wanted you to consider the possibilities. At our age,
we can't afford to pull out too much of the hair we have.
it's possible, we can make a custom cpu(or SoftCpu) for our program in furture! :toothy
Quote from: dedndave on May 12, 2009, 04:17:14 PM
..he/she doesn't have to search and read all
the forum threads concerning a single subject.
I agree. Also could there be a summary thread made for all the tested algos that would allow us just to refer to it for the very latest downloads and information. Another thing i would like to see is an indicator of the definitive best solution algo (not always possible i know) or in place of that a recommendation on the best one (or more than one - but try to keep it to one) to go with and a note on the one to avoid or at use the least.
All this in one place would be a very handy resource. Some threads are massive and/or there are multiple threads for similar topics. For example szLen optimize (http://www.masm32.com/board/index.php?topic=1807.0) has 25 pages, and there are two threads on string copying: lstrcpy vs szCopy (http://www.masm32.com/board/index.php?topic=10830.0) has 4 pages, SzCpy vs. lstrcpy (http://www.masm32.com/board/index.php?topic=1589.0) has 8 pages. Could the lab implement other subforums here for different categories: string algos & optimizing, memory algos & optimizing, etc etc. I realise that could take a lot of work, also the placement of topics into the correct subforums may coz moderators grief in moving and sorting etc. Maybe a post template could also be created that defines the information that should be provided when creating a new thread in the lab for purposes of presenting a new or modified algo.
Just some ideas, hope no one takes offence.
Quote from: Jimg on May 12, 2009, 02:56:44 PM
Here's a sample. I was getting rock solid repeatability with only ten loops.
My 2 cts - Celeron M. It took me a while because it choked with ml 9.0 ;-)
Well organised table, by the way :U
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mmword
======= ======== ======== ======== ======== ========
OVH 264 0 0 0 0
InStrL 96 132 348 468 1152
finstr 96 288 432 552 1500
finstr2 144 216 372 372 1608
finstr3 144 204 348 348 1512
finstr4 84 300 456 588 1500
finstr5 84 324 456 612 1488
finstr6 144 216 360 372 1608
Quote from: fearless on May 12, 2009, 06:21:20 PM
Just some ideas, hope no one takes offence.
Excellent ideas, fearless. Some of these threads are really too detailed (and I plead guilty :bg), and unfortunately not always on topic. Good coders tend to have strong personalities :wink
Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the
timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink
all good ideas
i agree with fearless - some of the threads require a lot of reading that does not pertain to what you want to find
perhaps we could start by outlining some goals and making a list of which members have which CPU's
that way, when a new piece of code is introduced, we can get it tested on several platforms, then hash over the optimization
it would give all members a chance to contribute - acknowledging that a few members are really good at it - some not so good (like me - lol)
along the way, we all get to learn tricks and tips
i know some old stuff that needs to be modernized by the optimizing wizards that hang out in here
but, for me, it will be mostly a learning experience
One more element to the never ending story: I tried another mainstring/pattern with otherwise identical settings.
Result: For the "slow" setting, execution is around half as fast and cycle counts are very erratic, but they never reach 1000-1200 cycles more as with the long mainstring. Which means a proportional pattern, i.e. the algo just runs with breaks on. The fast version runs very stable at 325 cycles.
Intel(R) Celeron(R) M CPU 420
Average cycle counts:
Slow version:
600-800 InstrJJ
Fast version:
325 InstrJJ
Quote from: jj2007 on May 12, 2009, 06:35:33 PM
Quote from: Jimg on May 12, 2009, 02:56:44 PM
Here's a sample. I was getting rock solid repeatability with only ten loops.
My 2 cts - Celeron M. It took me a while because it choked with ml 9.0 ;-)
Well organised table, by the way :U
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mmword
======= ======== ======== ======== ======== ========
OVH 264 0 0 0 0
InStrL 96 132 348 468 1152
finstr 96 288 432 552 1500
finstr2 144 216 372 372 1608
finstr3 144 204 348 348 1512
finstr4 84 300 456 588 1500
finstr5 84 324 456 612 1488
finstr6 144 216 360 372 1608
So where's instrjj ?
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mmword
======= ======== ======== ======== ======== ========
OVH 229 0 0 0 0
InStrL 51 94 383 340 842
finstr 51 153 238 281 816
finstr2 111 145 264 255 952
finstr3 111 153 247 238 986
finstr4 51 204 289 417 799
finstr5 60 187 272 383 808
finstr6 111 145 264 255 952
fearless CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
nice numbers
i like the core 2 quad q9550, i think
here are mine for a prescott - here we go with the funny minus signs, again
i hope some of the other numbers are wrong, as well - lol
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mmword
======= ======== ======== ======== ======== ========
OVH 555 -30 -15 -30 -23
InStrL 120 232 577 855 2482
finstr 142 405 578 750 2295
finstr2 158 337 750 690 2670
finstr3 120 412 712 735 2745
finstr4 82 420 712 893 2175
finstr5 127 487 773 1005 2408
finstr6 158 360 600 720 2760
Quote from: jj2007 on May 12, 2009, 11:55:25 AM
Sounds plausible, but can an executable with only 28k overlap with a cache line boundary if the L1 cache is 32k? Is that the explanation?? If I understand the article correctly, two cache lines may be allocated, fine. But only once, or for every call, costing 1000 cycles??
hmm, it is an alignment problem, but not the "normal alignment", not directly... a pentium M can't deal with more than 4 uops in one operation (it's under the abilities of a P3), with this "skill" (i see lingo laughting here...), a single byte can change lot of things (especially with recent instructions sets)...
Quote from: jj2007 on May 12, 2009, 07:01:53 PM
Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink
hmm, look like a racism against real coders :'(
Quote from: fearless on May 12, 2009, 06:21:20 PM
All this in one place would be a very handy resource. Some threads are massive and/or there are multiple threads for similar topics.
it's not as simple, periodically someone found another way to do stuff, it's also very often that algos don't do exactly the same things... or don't use the same instructions sets... etc... for example, in my case (as french), a BM algo or even a case insensitive algo are useless, why ? because with all the accents, etc... (and possible errors linked to this) it can't give satisfying results. so i have to use something else (i post it coz it can be usefull on others languages, however you must correct the values of the table, depending of your language.) :
.DATA
;
; tables pour obtenir les caractères majuscules équivalents (fait 0.25 ko, cette table est utilisée pour réduire la recherche/comparaison aux
; caractères majuscules basiques, pour tenir compte des erreurs concernant les accents, trémas, cédilles, etc...)
;
_Search_Table_ BYTE 000,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015
BYTE 016,017,018,019,020,021,022,023,024,025,026,027,028,029,030,031
; le jeu de caractères classique :
BYTE 032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047
BYTE 048,049,050,051,052,053,054,055,056,057,058,059,060,061,062,063
BYTE 064,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
BYTE 080,081,082,083,084,085,086,087,088,089,090,091,092,093,094,095
BYTE 096,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
BYTE 080,081,082,083,084,085,086,087,088,089,090,123,124,125,126,127
; le jeu de caractères étendu :
BYTE 128,129,130,131,132,133,134,135,136,137,83,139,140,141,90,143
BYTE 144,145,146,147,148,149,150,151,152,153,83,155,140,157,90,89
BYTE 160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175
BYTE 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
BYTE 65,65,65,65,65,65,198,67,69,69,69,69,73,73,73,73
BYTE 68,78,79,79,79,79,79,215,216,85,85,85,85,89,222,223
BYTE 65,65,65,65,65,65,198,67,69,69,69,69,73,73,73,73
BYTE 68,78,79,79,79,79,79,247,216,85,85,85,85,89,222,89
.CODE
ALIGN 16
;
; trouver une chaine de caractères dans une chaîne/un texte ascii (la première occurrence)
; note : insensible à la casse, accents compris... attention, ne teste pas si la position est valide
;
; syntax :
; mov eax,starting position of the search (in the string to scan)
; mov esi,OFFSET (address of the string to find)
; mov edi,OFFSET (address of the string to scan)
; call FindNextString
;
; Return :
; eax = -1 if the string isn't found, otherwise eax has the position of the string
;
FindNextString PROC
push ebx ;; empiler ebx
push ecx ;; empiler ecx
push edx ;; empiler edx
add eax,edi ;; on ajoute la position à l'adresse de la chaîne
; on stocke le premier octet à trouver
movzx ebx,BYTE PTR [esi] ;; placer le premier octet à trouver dans bl
mov bl,BYTE PTR [_Search_Table_+ebx] ;; placer la majuscule de la table dans bl
dec eax ;; enlever 1 à l'adresse ( cause de la ligne qui suit...)
; maintenant, recherche du premier caractère identique
Label1: inc eax ;; ajouter 1 à l'adresse de loctet à comparer
movzx edx,BYTE PTR [eax] ;; lire l'octet à comparer, et le placer dans dl
test edx,edx ;; fixer les flags de dl
jz Label5 ;; si c'est égal à 0, aller Label5
mov dl,BYTE PTR [_Search_Table_+edx] ;; sinon, placer la majuscule de la table dans dl
cmp dl,bl ;; comparer ensuite à la majuscule à trouver
jne Label1 ;; si c'est différent, on continu la recherche, aller Label1
; sinon, on teste les caractères suivants
Label2: xor ecx,ecx ;; effacer ecx
Label3: inc ecx ;; augmenter ecx d'1
movzx edx,BYTE PTR [esi+ecx] ;; placer l'octet suivant dans dl
test edx,edx ;; fixer les flags de edx
jz Label4 ;; si c'est égal à 0, aller Label4
mov bh,BYTE PTR [_Search_Table_+edx] ;; sinon, placer la majuscule de la table dans bh
movzx edx,BYTE PTR [eax+ecx] ;; lire l'octet suivant à comparer
mov dl,BYTE PTR [_Search_Table_+edx] ;; placer la majuscule de la table dans dl
cmp dl,bh ;; comparer ensuite à la majuscule à trouver
je Label3 ;; si c'est égal à 0, aller Label3
jmp Label1 ;; aller Label1
; ici la chaîne est trouvée, alors on defini eax qui contiendra l'adresse
Label4: sub eax,edi ;; on soustrait l'adresse de la chaîne pour obtenir la position trouvée dans la chaîne
pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
; chaîne non trouvée
Label5: mov eax,-1 ;; sinon on met -1 (chaîne non trouvée dans eax)
pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
FindNextString ENDP
Quotehmm, look like a racism against real coders :'(
i understand what you mean by this, but you can't tell me you learned everything out of a book
somehwere along the line, you learned from someone else, also
i considered myself to be a pretty fair programmer in 16-bit
not that i was a pro, but i have done a lot of math that many cannot do
i'll show you mine if you show me yours :lol
Quote from: NightWare on May 13, 2009, 12:38:19 AM
a single byte can change lot of things (especially with recent instructions sets)...
Algo is unchanged, code start is 16-byte aligned in both cases. Just 16 bytes further up, and it slows down by 40% and timings get erratic...
Quote
Quote from: jj2007 on May 12, 2009, 07:01:53 PM
Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink
hmm, look like a racism against real coders :'(
No racism intended. We want a testbed that makes algos comparable. Usually you do that with macros that allow a syntax of this type
timings crt_strstr, addr Mainstr, addr TestSubD
timings InstrCi, 1, addr Mainstr, addr TestSubE, 0
timings BMLinDB, addr Mainstr, addr TestSubEx, (SIZEOF TestSubEx)-1
where each line replaces
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke crt_strstr, addr Mainstr, addr TestSubD
counter_end
print str$(eax), 9, "cycles for rep stosd", 13, 10
Now it gets messy if one programmer says "my algo needs Mainstr in eax and Pattern in ecx, only then it will be really fast".
Quote from: Jimg on May 12, 2009, 09:27:29 PM
So where's instrjj ?
Here it is. I had to change quite a number of things to make it compatible ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 264 0 0 0 0
InStrL 84 132 360 468 1152
finstr 96 288 432 552 1500
finstr2 144 216 372 372 1608
finstr3 144 204 348 348 1512
finstr4 84 324 456 624 1500
finstr5 84 324 456 612 1488
finstr6 156 216 360 372 1608
InstrJJ 84 120 204 132 1536
[attachment deleted by admin]
i find this very disturbing
i don't know what's going on, here
i get very different numbers from everyone else on my prescott
also, i get very different numbers if i redirect the output to a text file
and different numbers again if i open it with Windows Explorer
i am running XP MCE2005 SP2 (the OS shouldn't make much difference - who knows)
compare these results with the ones i pasted a few posts back ^
here, i open a command window and redirect the output with "instrnew>test.txt"
i get similar results by running it without redirection (some test programs, i get different results by redirecting)
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 487 0 0 0 0
InStrL 91 203 495 750 1433
finstr 90 278 420 511 1418
finstr2 105 270 675 540 1695
finstr3 105 285 548 540 1718
finstr4 98 285 473 638 1343
finstr5 105 353 623 705 1403
finstr6 105 270 428 533 1696
InstrJJ 188 180 451 248 2888
on these two, i ran it by clicking on it in Windows Explorer
notice the inconsistencies, especially the minus signs
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 517 -7 8 0 -7
InStrL 143 248 555 923 2303
finstr 158 443 675 788 2408
finstr2 188 398 668 765 2738
finstr3 188 413 818 736 2783
finstr4 188 450 908 938 2198
finstr5 188 533 781 1035 2258
finstr6 188 390 668 728 2775
InstrJJ 278 300 841 391 4531
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 495 23 15 23 22
InStrL 218 270 653 915 2400
finstr 180 443 622 780 2467
finstr2 232 405 773 795 2813
finstr3 202 458 780 802 2850
finstr4 173 488 892 960 2220
finstr5 165 548 787 1072 2288
finstr6 180 412 690 780 2745
InstrJJ 285 322 698 397 4417
Quote from: dedndave on May 13, 2009, 07:09:24 AM
i get very different numbers from everyone else on my prescott
Just launched my Prescott:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 459 17 17 17 -9
InStrL 110 229 382 535 1402
finstr 119 263 399 519 1428
finstr2 127 280 450 561 1708
finstr3 127 289 459 569 1725
finstr4 110 323 527 620 1360
finstr5 119 340 544 722 1428
finstr6 127 297 450 552 1708
InstrJJ 196 187 467 280 2661
This is with TimesToLoop equ 10000. With only 10, timings are pretty erratic...
I tried the redirection thing and got exactly the same numbers as from explorer (no joke, they all matched).
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 252 0 0 0 0
InStrL 63 108 270 360 882
finstr 54 180 243 297 819
finstr2 108 171 288 261 954
finstr3 117 180 261 252 954
finstr4 54 216 279 387 810
finstr5 45 216 279 387 810
finstr6 108 171 288 261 954
InstrJJ 63 72 135 126 1323
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 229 0 0 0 0
InStrL 51 102 187 340 842
finstr 51 153 238 281 808
finstr2 102 145 247 255 952
finstr3 102 162 255 247 986
finstr4 51 213 272 417 799
finstr5 43 196 264 374 808
finstr6 111 145 247 255 961
InstrJJ 60 68 136 77 1369
lol - fearless gets such nice numbers with that CPU
i hafta go buy one :U
i have been playing with selecting cores on a multi-core machine
i have tried to keep in mind that, in some cases, the cores may not be the same CPU type, especially in the case of overdrive chips
it seems that, if they have an overdrive chip, you have to use SetProcessAffinityMask to tell the system which core you want to look at
this includes the CPUID - how else would you select which processor to run CPUID, right ?
if they have different cores, you may have to run the CPUID routine more than once, right?
(also, for certain non-Intel/non-AMD CPU's, you have to enable the CPUID instruction)
you have to use SetProcessAffinityMask on a multi-core for the RDTSC instruction, also
now, here are a couple things that i have not seen in any of the test code that i have looked at to date
in order to use Get/SetProcessAffinityMask to control core selection.....
1) use GetSecurityInfo to see ... (of course, there is no documented "SetSecurityInfo" function - :P )
to use GetProcessAffinityMask on XP or Windows server 2003, you must have:
PROCESS_QUERY_INFORMATION access right
to use GetProcessAffinityMask on other OS's, you must have:
PROCESS_QUERY_INFORMATION or PROCESS_QUERY_LIMITED_INFORMATION access right
to use SetProcessAffinityMask (to manipulate core selection), you must have:
PROCESS_SET_INFORMATION access right
2) to use SetProcessAffinityMask (to manipulate core selection)
use SetProcessAffinityUpdateMode to set PROCESS_AFFINITY_ENABLE_AUTO_UPDATE
The system can adjust process affinity under various conditions, such as when a processor is added dynamically. By default, dynamic updates to the process affinity are disabled for each process.
Processes should use this function to indicate whether they can handle dynamic adjustment of process affinity by the system. After a process enables affinity update mode, it can call this function to disable it. However, a process cannot enable affinity update mode after it has used this function to disable it.
in other words, i don't think anyone is looking at the return values when they try to control core selection for RDTSC
if they were, they should get an error condition, no ?
at any rate, it should be verifiable by using GetProcessAffinityMask to see if they have selected a single core to read the counter
also, for a complete CPU ID routine, you need to ID the OS to see if you can even read the GetProcessAffinityMask value
and, you need to be able to control SetProcessAffinityMask in order to specify which core the CPUID instruction is executing on
running a test on a single core machine - easy
multi-core - not so easy
c:\china>"C:\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch
Source Mainstr Mainstr Mainstr Mainstr mismatch
Pattern Substr1 Substr2 Substr3 Substr4 mXword
======= ======== ======== ======== ======== ========
OVH 264 0 0 0 0
InStrL 88 132 352 462 1155
finstr 88 286 429 550 1507
finstr2 143 209 363 363 1617
finstr3 143 209 352 352 1507
finstr4 88 308 451 594 1496
finstr5 88 319 462 605 1496
finstr6 143 209 363 363 1617
InstrJJ 88 121 209 132 1518
Press enter to exit...
c:\china>
Made a few changes to Jim's package:
- ShowCPU
- Overall cycles (helps to quickly identify problems with timing consistency etc)
- increased TimesToLoop to 10000
- added an "easy" mismatch case ("xillabong"; note the old mmword has become a reserved word)
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 459 17 17 17 17 17 544
InStrL 162 162 502 527 1326 366 3045
finstr 119 272 391 527 1411 722 3442
finstr2 119 280 442 552 1708 765 3866
finstr3 119 280 467 552 1717 773 3908
finstr4 111 306 535 671 1352 927 3902
finstr5 110 340 637 689 1411 1045 4232
finstr6 111 272 459 544 1708 765 3859
InstrJJ 204 187 476 280 2644 297 4088
EDIT: Timings on the Celeron are a lot more stable:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 84 132 360 468 1152 300 2496
finstr 108 288 432 552 1500 828 3708
finstr2 144 216 360 372 1608 540 3240
finstr3 144 204 348 348 1512 528 3084
finstr4 84 324 456 636 1500 924 3924
finstr5 84 336 456 612 1488 996 3972
finstr6 144 216 360 372 1608 540 3240
InstrJJ 84 120 204 132 1548 156 2244
[attachment deleted by admin]
much better
i get very close results either way i run it
now, can you make my prescott run like fearless's machine ? lol
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 487 0 0 0 0 0 487
InStrL 135 203 308 518 1290 360 2814
finstr 90 255 375 473 1403 683 3279
finstr2 113 270 435 495 1695 758 3766
finstr3 105 263 435 495 1695 750 3743
finstr4 98 278 540 601 1335 915 3767
finstr5 98 345 495 660 1403 1028 4029
finstr6 98 270 421 533 1673 758 3753
InstrJJ 188 180 450 248 2640 270 3976
i may have misunderstood the use of
SetProcessAffinityUpdateMode to set PROCESS_AFFINITY_ENABLE_AUTO_UPDATE
i need to research this one a little more
as i read it, it needed to be enabled to use SetProcessAffinityMask
after a re-read, it may need to be disabled to dis-allow the system from altering the ProcessAffinity dynamically
i can write a little program to test this - probably easier than reading MS docs
#c:\china>"\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
#Microsoft (R) Macro Assembler Version 8.00.50727.762
#Copyright (C) Microsoft Corporation. All rights reserved.
#
# Assembling: instrnew.asm
#\masm32\include\windows.inc(16610) : error A2138: invalid data initializer
#
#c:\china>
#
why?
#ImportRec record Reserved:11,NameType:3,Type2:2
#IMPORT_OBJECT_HEADER struct
# Sig1 dw ?
# Sig2 dw ?
# Version dw ?
# Machine dw ?
# TimeDateStamp dd ?
# SizeOfData dd ?
# union
# Ordinal dw ?
# Hint dw ?
# ends
# rImport ImportRec <> ;line 16610
#IMPORT_OBJECT_HEADER ends
no why (because GeneSys windows.inc v1.30 comment IMPORT_OBJECT_HEADER struct.), comment line 16610 to continue...
c:\china>"\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 88 132 341 462 1133 297 2453
finstr 88 286 429 550 1496 836 3685
finstr2 143 209 352 363 1617 539 3223
finstr3 121 209 352 352 1507 517 3058
finstr4 88 297 451 594 1496 924 3850
finstr5 77 308 451 605 1496 935 3872
finstr6 143 209 363 363 1606 539 3223
InstrJJ 88 121 209 132 1474 154 2178
Press any key to exit...
c:\china>
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InStrL 43 94 162 340 833 204 1676
finstr 43 153 230 281 808 434 1949
finstr2 111 145 247 255 952 383 2093
finstr3 102 162 255 247 978 349 2093
finstr4 43 204 289 417 799 663 2415
finstr5 43 187 264 366 800 587 2247
finstr6 102 145 239 255 952 375 2068
InstrJJ 60 68 128 77 1386 85 1804
I ran it a few times and noticed a few minor variations in some of the categories.
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InStrL 43 102 230 332 842 204 1753
finstr 43 153 230 281 808 434 1949
finstr2 102 145 230 247 952 374 2050
finstr3 111 162 255 238 978 357 2101
finstr4 43 204 289 417 799 663 2415
finstr5 43 187 264 366 808 587 2255
finstr6 102 145 247 247 952 366 2059
InstrJJ 60 68 128 77 1386 85 1804
Press any key to exit...
C:\Downloads\InstrTable>instrnew
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InStrL 51 102 179 340 842 204 1718
finstr 51 153 230 281 808 434 1957
finstr2 103 145 238 255 952 374 2067
finstr3 111 162 255 247 978 357 2110
finstr4 43 204 289 417 799 663 2415
finstr5 43 188 272 366 799 604 2272
finstr6 102 145 247 255 952 374 2075
InstrJJ 60 68 128 77 1386 85 1804
Press any key to exit...
C:\Downloads\InstrTable>instrnew
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InStrL 43 94 162 340 842 204 1685
finstr 43 153 238 281 808 434 1957
finstr2 103 145 247 255 952 383 2085
finstr3 111 153 255 230 986 357 2092
finstr4 43 204 289 417 799 663 2415
finstr5 43 187 264 383 799 587 2263
finstr6 102 145 230 255 952 383 2067
InstrJJ 60 68 128 77 1386 85 1804
to simplify the security access rights issue.....
because the requirements are different for different operating systems,
rather than using GetSecurityInfo to determine of you have sufficient access rights,
it is easier to just use something like this:
1) GetProcessAffinityMask (save the values)
2) SetProcessAffinityMask (select a single core mask)
3) GetProcessAffinityMask (test to see if you changed it)
using the Get/SetPriorityClass has the same access rights requirements
as Get/SetProcessAffinityMask and may be tested in a similar manner
i guess if you can change PriorityClass level, then you can change ProcessAffinity as well
only need to test one
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======= ======== ======== ======== ======== ======== ======== ========
OVH 71 0 0 0 0 0 71
InStrL 100 110 328 425 905 263 2131
finstr 88 235 471 444 1382 663 3283
finstr2 99 201 407 335 1504 468 3014
finstr3 98 188 371 320 1467 457 2901
finstr4 79 214 439 420 1262 609 3023
finstr5 74 215 449 403 1274 593 3008
finstr6 99 202 388 335 1505 460 2989
InstrJJ 136 134 358 161 2006 204 2999
Run2:
======= ======== ======== ======== ======== ======== ======== ========
OVH 71 0 0 0 0 0 71
InStrL 82 110 328 424 905 263 2112
finstr 88 235 471 444 1382 663 3283
finstr2 99 202 392 319 1501 465 2978
finstr3 98 188 386 320 1467 457 2916
finstr4 79 214 439 420 1262 616 3030
finstr5 74 215 449 403 1274 593 3008
finstr6 99 202 388 335 1504 468 2996
InstrJJ 136 134 362 161 2006 206 3005
Run3:
======= ======== ======== ======== ======== ======== ======== ========
OVH 71 0 0 0 0 0 71
InStrL 83 110 328 425 905 263 2114
finstr 88 235 469 444 1380 663 3279
finstr2 99 202 392 327 1507 462 2989
finstr3 98 188 386 320 1466 457 2915
finstr4 79 214 423 420 1261 616 3013
finstr5 74 215 449 403 1274 593 3008
finstr6 99 202 407 334 1507 468 3017
InstrJJ 136 134 349 161 2006 206 2992
More than 2000 cycles on my ultrafast Celeron was just not acceptable, so I decided to dig out my trusty old word scanner:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 96 144 252 468 1152 300 2412
finstr 84 288 444 588 1500 828 3732
finstr2 144 216 360 372 1608 540 3240
finstr3 144 204 348 348 1512 528 3084
finstr4 84 300 456 648 2352 924 4764
finstr5 84 336 492 648 1488 1008 4056
finstr6 144 216 360 372 1608 540 3240
InstrJJ 84 120 204 132 1536 156 2232
InstrJJw 60 120 180 156 1164 240 1920
:bg
[attachment deleted by admin]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 487 0 0 0 0 0 487
InStrL 90 150 480 525 1290 315 2850
finstr 90 270 383 473 1403 698 3317
finstr2 106 248 428 540 1673 758 3753
finstr3 98 263 510 495 1695 750 3811
finstr4 91 270 458 600 1343 915 3677
finstr5 90 323 608 661 1403 1028 4113
finstr6 98 255 428 495 1680 758 3714
InstrJJ 188 180 450 248 2648 278 3992
InstrJJw 90 180 338 293 2730 405 4036
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InStrL 51 94 162 332 842 204 1685
finstr 43 153 230 281 808 434 1949
finstr2 102 145 247 255 952 374 2075
finstr3 111 145 247 238 978 349 2068
finstr4 51 179 264 374 799 587 2254
finstr5 43 187 272 383 808 595 2288
finstr6 102 145 247 255 952 374 2075
InstrJJ 60 68 128 77 1318 85 1736
InstrJJw 43 77 153 136 910 187 1506
Just for the fun of it, I wrote a little proggie that monitors how INSTR gets used in my main baby, simulating a typical user session. Wow what a surprise...!
Avg Mainstr= 55.1 bytes
Avg Pattern= 3.46 bytes
Total calls= 13,237,796
Looks as if we should concentrate more on short strings around 55 bytes long, and ultrashort patterns ::)
EDIT: In the light of these findings, I added a switch called OptiStrings, which forces shorter Mainstrings - see below. I also added the current Masm32 library InString for comparison, as InstrM32 (that was a bit tricky with JimG's code copy scheme :8))
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 36 36 72 264 12 120 540
InstrM32 192 204 336 408 168 396 1704
finstr 48 84 132 300 24 312 900
finstr2 48 72 120 204 24 216 684
finstr3 48 72 120 204 36 192 672
finstr4 48 96 144 348 24 336 996
finstr5 48 84 132 336 24 324 948
finstr6 48 72 120 204 24 216 684
InstrJJ 48 60 96 96 48 84 432
InstrJJw 36 60 84 108 48 120 456
Sizes:
Mainstr 100
Substr1 4
Substr2 7
Substr3 14
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
[attachment deleted by admin]
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 88 132 253 462 1133 297 2365
finstr 88 286 429 550 1496 825 3674
finstr2 132 209 363 363 1606 539 3212
finstr3 132 209 352 352 1507 528 3080
finstr4 88 297 451 594 1496 924 3850
finstr5 77 308 451 605 1496 935 3872
finstr6 143 209 363 363 1606 539 3223
InstrJJ 88 121 209 132 1518 165 2233
InstrJJw 55 110 198 154 1155 231 1903
Interesting...
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 71 0 0 0 0 0 71
InStrL 30 32 63 235 21 111 492
InstrM32 177 249 354 558 168 520 2026
finstr 42 70 110 238 29 225 714
finstr2 45 64 109 186 33 162 599
finstr3 47 65 106 179 33 159 589
finstr4 42 67 111 228 26 216 690
finstr5 39 66 110 219 24 205 663
finstr6 45 64 109 186 33 166 603
InstrJJ 60 72 133 108 62 126 561
InstrJJw 53 76 122 130 58 148 587
I'm wondering why the last version of InString-JJ is so slow
for my Test 13 --Find 'Duplicate inc' in 'windows.inc' -
clocks: 844498 against clocs: 679641 of previous version
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJ: 38 ; clocks: 84
SrchBytes - Lingo: 37 ; clocks: 49
Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJ: 1008 ; clocks: 13623
SrchBytes - Lingo: 1007 ; clocks: 11672
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 292
SrchBytes - Lingo: 1008 ; clocks: 134
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 2979
SrchBytes - Lingo: 1008 ; clocks: 1400
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 1829
SrchBytes - Lingo: 1008 ; clocks: 1285
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJ: 1009 ; clocks: 292
SrchBytes - Lingo: 1008 ; clocks: 129
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJ: 1010 ; clocks: 544
SrchBytes - Lingo: 1009 ; clocks: 521
Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJ: 1002 ; clocks: 260
SrchBytes - Lingo: 1001 ; clocks: 92
Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJ: 1002 ; clocks: 254
SrchBytes - Lingo: 1001 ; clocks: 95
Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJ: 1002 ; clocks: 257
SrchBytes - Lingo: 1001 ; clocks: 96
Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJ: 1002 ; clocks: 257
SrchBytes - Lingo: 1001 ; clocks: 96
Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJ: 1002 ; clocks: 267
SrchBytes - Lingo: 1001 ; clocks: 98
Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJ: 1127625 ; clocks: 844498
SrchBytes - Lingo: 1127624 ; clocks: 510066
Press ENTER to exit...
Microsoft Windows [Version 6.0.6001]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.
c:\Users\Administrator>cd\china
c:\china>\masm32\bin\ml.exe /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 6.14.8444
Copyright (C) Microsoft Corp 1981-1997. All rights reserved.
Assembling: instrnew.asm
instrnew.asm(1315) : error A2008: syntax error : pshufd
instrnew.asm(1341) : error A2008: syntax error : xmm2
instrnew.asm(1348) : error A2008: syntax error : movdqa
instrnew.asm(1349) : error A2008: syntax error : movdqa
instrnew.asm(1495) : error A2008: syntax error : pshufd
instrnew.asm(1517) : error A2008: syntax error : xmm1
instrnew.asm(1518) : error A2008: syntax error : xmm4
instrnew.asm(1521) : error A2008: syntax error : movdqa
instrnew.asm(1527) : error A2008: syntax error : movdqa
instrnew.asm(1528) : error A2008: syntax error : xmm4
instrnew.asm(1529) : error A2008: syntax error : xmm2
instrnew.asm(1637) : error A2008: syntax error : pshufd
instrnew.asm(1638) : error A2008: syntax error : movdqu
instrnew.asm(1639) : error A2008: syntax error : movdqa
instrnew.asm(1314) : error A2006: undefined symbol : xmm0
instrnew.asm(1352) : error A2006: undefined symbol : xmm1
instrnew.asm(1353) : error A2006: undefined symbol : xmm1
instrnew.asm(1354) : error A2006: undefined symbol : xmm1
instrnew.asm(1355) : error A2006: undefined symbol : xmm1
instrnew.asm(1492) : error A2006: undefined symbol : xmm3
instrnew.asm(1531) : error A2006: undefined symbol : xmm1
instrnew.asm(1532) : error A2006: undefined symbol : xmm1
instrnew.asm(1533) : error A2006: undefined symbol : xmm4
instrnew.asm(1535) : error A2006: undefined symbol : xmm4
instrnew.asm(1536) : error A2006: undefined symbol : xmm2
instrnew.asm(1537) : error A2006: undefined symbol : xmm2
instrnew.asm(1636) : error A2006: undefined symbol : xmm3
instrnew.asm(1641) : error A2006: undefined symbol : xmm1
instrnew.asm(1642) : error A2006: undefined symbol : xmm2
instrnew.asm(1643) : error A2006: undefined symbol : xmm1
instrnew.asm(1644) : error A2006: undefined symbol : xmm2
c:\china>
replace ml.exe by 615 to continue.
c:\china>\masm32\bin\ml.exe /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 6.15.8803
Patched for you by promethee [ECL] in the year 2001 - enjoy
Copyright (C) Microsoft Corp 1981-2000. All rights reserved.
Assembling: instrnew.asm
instrnew.asm(1314) : error A2006: undefined symbol : xmm0
instrnew.asm(1315) : error A2006: undefined symbol : xmm0
instrnew.asm(1342) : error A2006: undefined symbol : xmm2
instrnew.asm(1349) : error A2006: undefined symbol : xmm2
instrnew.asm(1350) : error A2006: undefined symbol : xmm1
instrnew.asm(1352) : error A2006: undefined symbol : xmm1
instrnew.asm(1353) : error A2006: undefined symbol : xmm1
instrnew.asm(1354) : error A2006: undefined symbol : xmm1
instrnew.asm(1355) : error A2006: undefined symbol : xmm1
instrnew.asm(1492) : error A2006: undefined symbol : xmm3
instrnew.asm(1495) : error A2006: undefined symbol : xmm3
instrnew.asm(1518) : error A2006: undefined symbol : xmm1
instrnew.asm(1519) : error A2006: undefined symbol : xmm4
instrnew.asm(1522) : error A2006: undefined symbol : xmm2
instrnew.asm(1528) : error A2006: undefined symbol : xmm1
instrnew.asm(1529) : error A2006: undefined symbol : xmm4
instrnew.asm(1530) : error A2006: undefined symbol : xmm2
instrnew.asm(1531) : error A2006: undefined symbol : xmm1
instrnew.asm(1532) : error A2006: undefined symbol : xmm1
instrnew.asm(1533) : error A2006: undefined symbol : xmm4
instrnew.asm(1535) : error A2006: undefined symbol : xmm4
instrnew.asm(1536) : error A2006: undefined symbol : xmm2
instrnew.asm(1537) : error A2006: undefined symbol : xmm2
instrnew.asm(1636) : error A2006: undefined symbol : xmm3
instrnew.asm(1637) : error A2006: undefined symbol : xmm3
instrnew.asm(1638) : error A2006: undefined symbol : xmm1
instrnew.asm(1639) : error A2006: undefined symbol : xmm2
instrnew.asm(1641) : error A2006: undefined symbol : xmm1
instrnew.asm(1642) : error A2006: undefined symbol : xmm2
instrnew.asm(1643) : error A2006: undefined symbol : xmm1
instrnew.asm(1644) : error A2006: undefined symbol : xmm2
oh my fo!
jump to another way.
c:\china>"\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
\masm32\include\windows.inc(16569) : error A2191: cannot include structure in se
lf
\masm32\include\windows.inc(16610) : error A2138: invalid data initializer
oh my masm32!
comment 'FPO_DATA struct' (16569) and 'IMPORT_OBJECT_HEADER struct' (16610) to continue.
c:\china>"\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium Long HighAnsi MisMatch MisMatch Overall
Source Mainstr Mainstr Mainstr Mainstr mismatch mismatch sum of
Pattern Substr1 Substr2 Substr3 Substr4 mmStr_b mmStr_x cycles
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 33 33 77 253 22 121 539
InstrM32 198 231 330 418 176 407 1760
finstr 44 77 132 297 22 297 869
finstr2 44 66 121 209 33 198 671
finstr3 55 66 110 198 33 198 660
finstr4 44 77 132 319 22 319 913
finstr5 44 77 132 330 22 330 935
finstr6 55 66 110 209 33 198 671
InstrJJ 55 66 99 99 55 88 462
InstrJJw 44 55 88 110 44 121 462
Press any key to exit...
c:\china>
Quote from: lingo on May 13, 2009, 11:44:46 PM
I'm wondering why the last version of InString-JJ is so slow
for my Test 13 --Find 'Duplicate inc' in 'windows.inc' -
clocks: 844498 against clocs: 679641 of previous version
Latest version is a byte scanner again, so it runs an ultrafast inner SSE2 loop with D instead of Du - at the expense of many false hits that forces it to run slow code in the bsf loop (there are more than 18,000 uppercase D in windows.inc). Try replacing Duplicate with WARNING...
The new package (the one that produces the table) has both versions. InStringJJw is the word scanner explained here (http://www.masm32.com/board/index.php?topic=9370.msg84571#msg84571), while the first example of the more recent byte scanner version is at the bottom of that page. As the tables show, they are roughly equivalent. If the search string starts with a rare character, the byte scanner rocks, of course.
The version attached contains the file test. Try playing with the WindInc patterns :wink
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 36 36 264 12 120 1558560 468
InstrM32 192 240 408 168 384 3035448 1392
finstr 48 84 300 24 300 2806320 756
finstr2 48 60 204 24 204 2342100 540
finstr3 48 60 204 24 192 2353740 528
finstr4 36 72 312 24 312 3027360 756
finstr5 36 96 372 24 348 3315468 876
finstr6 48 60 204 24 204 2342112 540
InstrJJ 48 72 96 60 108 815556 384
InstrJJw 36 60 108 48 120 667620 372
* not including file test
[attachment deleted by admin]
c:\china>"C:\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff
instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InStrL 33 33 264 11 110 1577950 451
InstrM32 253 275 495 187 407 3063731 1617
finstr 44 88 297 22 297 2874828 748
finstr2 55 66 209 33 198 2409429 561
finstr3 44 66 198 33 187 2414093 528
finstr4 44 77 319 22 308 3129148 770
finstr5 44 77 330 22 330 3298768 803
finstr6 55 66 209 33 198 2410782 561
InstrJJ 44 66 99 55 110 808071 374
InstrJJw 33 55 99 44 121 687203 352
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 7
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 874032
WindInc 13 = Duplicate inc
Press any key to exit...
c:\china>
I can't start included exe from InstrNewSSE_FileTest.zip (see attachment)
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJb: 38 ; clocks: 86
InString - JJw: 38 ; clocks: 92
SrchBytes - Lingo: 37 ; clocks: 49
Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJb: 1008 ; clocks: 13628
InString - JJw: 1008 ; clocks: 11236
SrchBytes - Lingo: 1007 ; clocks: 11592
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 303
InString - JJw: 1009 ; clocks: 658
SrchBytes - Lingo: 1008 ; clocks: 135
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 3108
InString - JJw: 1009 ; clocks: 3634
SrchBytes - Lingo: 1008 ; clocks: 1544
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 1822
InString - JJw: 1009 ; clocks: 2470
SrchBytes - Lingo: 1008 ; clocks: 1273
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 303
InString - JJw: 1009 ; clocks: 640
SrchBytes - Lingo: 1008 ; clocks: 130
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJb: 1010 ; clocks: 567
InString - JJw: 1010 ; clocks: 597
SrchBytes - Lingo: 1009 ; clocks: 523
Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJb: 1002 ; clocks: 265
InString - JJw: 1002 ; clocks: 326
SrchBytes - Lingo: 1001 ; clocks: 93
Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJb: 1002 ; clocks: 256
InString - JJw: 1002 ; clocks: 611
SrchBytes - Lingo: 1001 ; clocks: 96
Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJb: 1002 ; clocks: 257
InString - JJw: 1002 ; clocks: 609
SrchBytes - Lingo: 1001 ; clocks: 97
Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJb: 1002 ; clocks: 257
InString - JJw: 1002 ; clocks: 613
SrchBytes - Lingo: 1001 ; clocks: 96
Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJb: 1002 ; clocks: 265
InString - JJw: 1002 ; clocks: 771
SrchBytes - Lingo: 1001 ; clocks: 97
Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJb: 1127625 ; clocks: 845967
InString - JJw: 1127625 ; clocks: 665849
SrchBytes - Lingo: 1127624 ; clocks: 510086
Search Test 14 --Find 'warning or' in 'windows.inc' ;lenSrchPattern ->10
InString - JJb: 1127404 ; clocks: 362831
InString - JJw: 1127404 ; clocks: 665390
SrchBytes - Lingo: 1127403 ; clocks: 207779
Press ENTER to exit...
[attachment deleted by admin]
Quote from: lingo on May 14, 2009, 11:49:45 AM
I can't start included exe from InstrNewSSE_FileTest.zip (see attachment)
Strange. It's your own algo - I swear I didn't modify a single byte. Maybe a problem with Vista?
Your timings look good. When will you post code? I guess I can speak for everybody when I say we are curious :bg
on my 'Microsoft Windows Vista Ultimate 32bit english with Service Pack 1' is ok.
"Strange. It's your own algo - I swear I didn't modify a single byte."
I'm not a newbie to talk with me this way...
Your "old" instrings work for me OK (see the results.txt in every subdir)
Your last two tests do not work for me and I have no time to dig in
your macros.
UtillMasm,
I am with Vista Ultimate 64 bit with SP 1
[attachment deleted by admin]
oh my fo!
i have assume you with 32bit. :lol
Quote from: lingo on May 14, 2009, 01:28:27 PM
"Strange. It's your own algo - I swear I didn't modify a single byte."
I have no time to dig in your macros.
Just to avoid embarrassing copyright battles: the macros are JimG (TM). And they work on almost all PCs... :bg
Just to remove any misunderstanding, they are not trademarked, they are free for any use whatsoever, and the printing macros were mods of some code originally written by Phil. (wonder what ever happen to Phil?)
we don't support 64 bit OS's here at MASM32
MS doesn't support us - we don't have to support them :P
It was a bit tricky, but I managed to get Lingo's byte shift Boyer-Moore algo (http://www.masm32.com/board/index.php?topic=3601.msg83057#msg83057) on board:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InstrM32 192 240 408 168 384 3034608 1392
finstr2 48 60 204 24 216 2342292 552
finstr3 48 60 204 24 216 2355540 552
finstr6 48 60 204 24 204 2342484 540
InStrL 36 36 264 12 120 1557576 468
InstrJJ 48 72 108 60 96 815124 384
InstrJJw 36 60 108 48 120 665688 372
BMLinDB 0 0 168 0 0 941484 168
Note:
a) 0 cycles means the pattern was too short for Lingo's algo - minimum is 8 bytes.
b) Lingo's code had to be modified for purely technical reasons - the macros supplied by JimG/Phil require 2 args instead of 3.
Warning: The executable has no idea on which drive your \masm32\include\windows.inc sits. Therefore extract the exe's to the same drive as your Masm32 installation, and run it from there.
EDIT: Error check added. Watch out for e after the cycle count.
[attachment deleted by admin]
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (SSE4)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 229 0 0 0 0 0 229
InstrM32 128 162 324 111 306 2664793 1031
finstr2 26 43 187 18 136 1876070 410
finstr3 34 43 128 17 128 1918850 350
finstr6 26 43 145 17 128 1873902 359
InStrL 17 26 179 9 68 1328491 299
InstrJJ 26 34 51 34 43 651534 188
InstrJJw 17 43 68 26 85 498678 239
BMLinDB 0 0 102 0 0 847875 102
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 7
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 851328
WindInc 13 = Duplicate inc
c:\china>"\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: instrnew.asm
c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InstrM32 209 242 429 187 407 3089163 1474
finstr2 55 66 242 33 209 2425984 605
finstr3 44 66 198 33 187 2434069 528
finstr6 55 66 198 33 198 2423784 550
InStrL 44 33 253 11 121 1559305 462
InstrJJ 44 66 99 55 110 812658 374
InstrJJw 33 55 99 44 110 697653 341
BMLinDB -11 0 165 0 0 660748 154
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 7
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 874031
WindInc 13 = Duplicate inc
Press any key to exit...
c:\china>
Error check added, see attachment above. Watch out for e after the cycle count:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 0 0 0 0 0 264
InstrM32 192 216 408 168 384 3034320 1368
finstr2 48 72 204 24 204 2343300 552
finstr3 48 72 204 24 192 2350572 540
finstr6 48 72 204 24 204 2342664 552
InStrL 36 48 264 12 120 1558824e 480
InstrJJ 48 96 120 48 84 813828 396
InstrJJw 36 60 108 36 108 667296 348
BMLinDB 0e 132 168 0 0e 947220 300
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 9
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 849788
WindInc 13 = Duplicate inc
Note 0e is not an error - this algo was not designed for simple string searches and needs at least an 8-byte pattern. The overall cycle count is not valid in this case, of course.
c:\china>instrnew.exe
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 264 -11 -11 0 0 0 242
InstrM32 198 253 429 187 396 3053754 1463
finstr2 55 66 209 33 198 2415127 561
finstr3 44 66 198 22 187 2429416 517
finstr6 55 66 198 33 198 2424917 550
InStrL 33 44 253 11 121 1577477 462
InstrJJ 44 66 99 44 77 808027 330
InstrJJw 33 66 99 33 99 683859 330
BMLinDB 0e 121 165 -11 0e 649583 275
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 9
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 874029
WindInc 13 = Duplicate inc
Press any key to exit...
c:\china>
Quote from: UtillMasm on May 15, 2009, 07:11:39 AM
Genuine Intel(R) CPU T2400 @ 1.83GHz (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
InStrL 33 44 253 11 121 1577477 462
File 874029
Type Short Medium HighAnsi MisMatch MisMatch File Overall
InStrL 26 43 298 1 170 2512065e <------------------
File 849788
Strange. Which Masm32/Windows.inc version do you use? For my version, the algo produces an error...
Directory of D:\masm32\include
05.06.08 03:57 849,788 windows.inc
yes, because i comment two lines.
\masm32\include\windows.inc(16569)
\masm32\include\windows.inc(16610)
atachment is before comment.
[attachment deleted by admin]
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
Type Short Medium HighAnsi MisMatch MisMatch File Overall
Source Mainstr Mainstr Mainstr mismatch mismatch fbuffer sum of
Pattern Substr1 Substr2 Substr4 mmStr_b mmStr_x WindInc cycles*
======== ======== ======== ======== ======== ======== ======== ========
OVH 71 0 0 0 0 0 71
InstrM32 177 262 555 168 520 4635503 1682
finstr2 45 74 186 33 165 2235211 503
finstr3 47 71 179 33 158 2177821 488
finstr6 45 71 186 33 165 2232679 500
InStrL 30 36 197 21 109 1517858e 393
InstrJJ 61 80 108 59 116 1163430 424
InstrJJw 53 86 128 54 135 832986 456
BMLinDB 11e 125 199 11 11e 1446577 357
* not including file test
Sizes:
Mainstr 100
Substr1 4
Substr2 9
Substr4 9
mismatch 100
mmStr_b 5
mmStr_x 5
File 849788
WindInc 13 = Duplicate inc
New CPU with new instructions and results: :lol
comment * SrchBytes by Lingo *
align 16
SrchBytes proc lpBuffer:dword, lpSrchPatt:dword
mov edx, [esp+2*4]
mov eax, [esp+1*4]
movdqu xmm2, xmmword ptr[edx]
pxor xmm3, xmm3
@Loop:
pcmpistri xmm2, xmmword ptr [eax], 0Ch
jna @f
pcmpistri xmm2, xmmword ptr [eax+16], 0Ch
lea eax, [eax+32]
ja @Loop
lea ecx, [ecx-16]
@@:
lea eax, [eax+ecx]
jnc L_not_found
sub edx, eax
movd xmm7, eax
@@:
movdqu xmm1, xmmword ptr[eax + edx]
pcmpistrm xmm3, xmm1, 58h
movdqu xmm4, xmmword ptr[eax]
add eax, 16
pand xmm4, xmm0
pcmpistri xmm1, xmm4, 18h
ja @b
movd eax, xmm7
jnc L_found
add edx, eax
add eax, 1
jne @Loop
align 8
L_not_found:
or eax, -1
ret 2*4
align 8
L_found:
sub eax, [esp+1*4]
ret 2*4
SrchBytes endp
The results are:
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (SSE4)
Search Test 1 - value expected 37; lenSrchPattern ->22
InString - JJb: 38 ; clocks: 52
InString - JJw: 38 ; clocks: 60
SrchBytes - Lingo: 37 ; clocks: 23
Search Test 2 - value expected 1007; lenSrchPattern ->17
InString - JJb: 1008 ; clocks: 10432
InString - JJw: 1008 ; clocks: 7897
SrchBytes - Lingo: 1007 ; clocks: 1710
Search Test 3 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 208
InString - JJw: 1009 ; clocks: 284
SrchBytes - Lingo: 1008 ; clocks: 166
Search Test 4 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 1994
InString - JJw: 1009 ; clocks: 2183
SrchBytes - Lingo: 1008 ; clocks: 1409
Search Test 5 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 1284
InString - JJw: 1009 ; clocks: 1377
SrchBytes - Lingo: 1008 ; clocks: 1050
Search Test 6 - value expected 1008 ;lenSrchPattern ->16
InString - JJb: 1009 ; clocks: 205
InString - JJw: 1009 ; clocks: 289
SrchBytes - Lingo: 1008 ; clocks: 174
Search Test 7 - value expected 1009 ;lenSrchPattern ->14
InString - JJb: 1010 ; clocks: 296
InString - JJw: 1010 ; clocks: 281
SrchBytes - Lingo: 1009 ; clocks: 214
Search Test 8 - value expected 1001 ;lenSrchPattern ->1
InString - JJb: 1002 ; clocks: 175
InString - JJw: 1002 ; clocks: 186
SrchBytes - Lingo: 1001 ; clocks: 138
Search Test 9 - value expected 1001 ;lenSrchPattern ->2
InString - JJb: 1002 ; clocks: 181
InString - JJw: 1002 ; clocks: 261
SrchBytes - Lingo: 1001 ; clocks: 158
Search Test 10 - value expected 1001 ;lenSrchPattern ->3
InString - JJb: 1002 ; clocks: 180
InString - JJw: 1002 ; clocks: 264
SrchBytes - Lingo: 1001 ; clocks: 151
Search Test 11 - value expected 1001 ;lenSrchPattern ->4
InString - JJb: 1002 ; clocks: 176
InString - JJw: 1002 ; clocks: 269
SrchBytes - Lingo: 1001 ; clocks: 159
Search Test 12 - value expected 1001 ;lenSrchPattern ->5
InString - JJb: 1002 ; clocks: 188
InString - JJw: 1002 ; clocks: 264
SrchBytes - Lingo: 1001 ; clocks: 171
Search Test 13 --Find 'Duplicate inc' in 'windows.inc' ;lenSrchPattern ->13
InString - JJb: 1127625 ; clocks: 550930
InString - JJw: 1127625 ; clocks: 249918
SrchBytes - Lingo: 1127624 ; clocks: 202321
Search Test 14 --Find 'warning or' in 'windows.inc' ;lenSrchPattern ->10
InString - JJb: 1127404 ; clocks: 196414
InString - JJw: 1127404 ; clocks: 250335
SrchBytes - Lingo: 1127403 ; clocks: 171055
Press ENTER to exit...
Since I fully trust my old friend Lingo, I double-clicked the exe - and wow, that's probably called a "SSE4.2 GPF" :green2