News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

InstrCi - case-insensitive string search

Started by jj2007, June 15, 2008, 10:09:46 PM

Previous topic - Next topic

dedndave

all good ideas
i agree with fearless - some of the threads require a lot of reading that does not pertain to what you want to find
perhaps we could start by outlining some goals and making a list of which members have which CPU's
that way, when a new piece of code is introduced, we can get it tested on several platforms, then hash over the optimization
it would give all members a chance to contribute - acknowledging that a few members are really good at it - some not so good (like me - lol)
along the way, we all get to learn tricks and tips
i know some old stuff that needs to be modernized by the optimizing wizards that hang out in here
but, for me, it will be mostly a learning experience

jj2007

One more element to the never ending story: I tried another mainstring/pattern with otherwise identical settings.

Result: For the "slow" setting, execution is around half as fast and cycle counts are very erratic, but they never reach 1000-1200 cycles more as with the long mainstring. Which means a proportional pattern, i.e. the algo just runs with breaks on. The fast version runs very stable at 325 cycles.

Intel(R) Celeron(R) M CPU 420
Average cycle counts:

Slow version:
600-800     InstrJJ

Fast version:
325      InstrJJ

Jimg

Quote from: jj2007 on May 12, 2009, 06:35:33 PM
Quote from: Jimg on May 12, 2009, 02:56:44 PM

Here's a sample.  I was getting rock solid repeatability with only ten loops.

My 2 cts - Celeron M. It took me a while because it choked with ml 9.0 ;-)
Well organised table, by the way :U

Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mmword
======= ======== ======== ======== ======== ========
OVH          264        0        0        0        0
InStrL        96      132      348      468     1152
finstr        96      288      432      552     1500
finstr2      144      216      372      372     1608
finstr3      144      204      348      348     1512
finstr4       84      300      456      588     1500
finstr5       84      324      456      612     1488
finstr6      144      216      360      372     1608


So where's instrjj ?

fearless

Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mmword
======= ======== ======== ======== ======== ========
OVH          229        0        0        0        0
InStrL        51       94      383      340      842
finstr        51      153      238      281      816
finstr2      111      145      264      255      952
finstr3      111      153      247      238      986
finstr4       51      204      289      417      799
finstr5       60      187      272      383      808
finstr6      111      145      264      255      952
ƒearless

dedndave

fearless CPU: Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz (SSE4)
nice numbers
i like the core 2 quad q9550, i think


here are mine for a prescott - here we go with the funny minus signs, again
i hope some of the other numbers are wrong, as well - lol

Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mmword
======= ======== ======== ======== ======== ========
OVH          555      -30      -15      -30      -23
InStrL       120      232      577      855     2482
finstr       142      405      578      750     2295
finstr2      158      337      750      690     2670
finstr3      120      412      712      735     2745
finstr4       82      420      712      893     2175
finstr5      127      487      773     1005     2408
finstr6      158      360      600      720     2760

NightWare

Quote from: jj2007 on May 12, 2009, 11:55:25 AM
Sounds plausible, but can an executable with only 28k overlap with a cache line boundary if the L1 cache is 32k? Is that the explanation?? If I understand the article correctly, two cache lines may be allocated, fine. But only once, or for every call, costing 1000 cycles??
hmm, it is an alignment problem, but not the "normal alignment", not directly... a pentium M can't deal with more than 4 uops in one operation (it's under the abilities of a P3), with this "skill" (i see lingo laughting here...), a single byte can change lot of things (especially with recent instructions sets)...

Quote from: jj2007 on May 12, 2009, 07:01:53 PM
Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink
hmm, look like a racism against real coders  :'(

Quote from: fearless on May 12, 2009, 06:21:20 PM
All this in one place would be a very handy resource. Some threads are massive and/or there are multiple threads for similar topics.
it's not as simple, periodically someone found another way to do stuff, it's also very often that algos don't do exactly the same things... or don't use the same instructions sets... etc...  for example, in my case (as french), a BM algo or even a case insensitive algo are useless, why ? because with all the accents, etc... (and possible errors linked to this) it can't give satisfying results. so i have to use something else (i post it coz it can be usefull on others languages, however you must correct the values of the table, depending of your language.) :
.DATA
;
; tables pour obtenir les caractères majuscules équivalents (fait 0.25 ko, cette table est utilisée pour réduire la recherche/comparaison aux
; caractères majuscules basiques, pour tenir compte des erreurs concernant les accents, trémas, cédilles, etc...)
;
_Search_Table_ BYTE 000,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015
BYTE 016,017,018,019,020,021,022,023,024,025,026,027,028,029,030,031
; le jeu de caractères classique :
BYTE 032,033,034,035,036,037,038,039,040,041,042,043,044,045,046,047
BYTE 048,049,050,051,052,053,054,055,056,057,058,059,060,061,062,063
BYTE 064,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
BYTE 080,081,082,083,084,085,086,087,088,089,090,091,092,093,094,095
BYTE 096,065,066,067,068,069,070,071,072,073,074,075,076,077,078,079
BYTE 080,081,082,083,084,085,086,087,088,089,090,123,124,125,126,127
; le jeu de caractères étendu :
BYTE 128,129,130,131,132,133,134,135,136,137,83,139,140,141,90,143
BYTE 144,145,146,147,148,149,150,151,152,153,83,155,140,157,90,89
BYTE 160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175
BYTE 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
BYTE 65,65,65,65,65,65,198,67,69,69,69,69,73,73,73,73
BYTE 68,78,79,79,79,79,79,215,216,85,85,85,85,89,222,223
BYTE 65,65,65,65,65,65,198,67,69,69,69,69,73,73,73,73
BYTE 68,78,79,79,79,79,79,247,216,85,85,85,85,89,222,89

.CODE
ALIGN 16
;
; trouver une chaine de caractères dans une chaîne/un texte ascii (la première occurrence)
; note : insensible à la casse, accents compris... attention, ne teste pas si la position est valide
;
; syntax :
; mov eax,starting position of the search (in the string to scan)
; mov esi,OFFSET (address of the string to find)
; mov edi,OFFSET (address of the string to scan)
; call  FindNextString
;
; Return :
; eax = -1 if the string isn't found, otherwise eax has the position of the string
;
FindNextString PROC
push ebx ;; empiler ebx
push ecx ;; empiler ecx
push edx ;; empiler edx

add eax,edi ;; on ajoute la position à l'adresse de la chaîne
; on stocke le premier octet à trouver
movzx ebx,BYTE PTR [esi] ;; placer le premier octet à trouver dans bl
mov bl,BYTE PTR [_Search_Table_+ebx] ;; placer la majuscule de la table dans bl
dec eax ;; enlever 1 à l'adresse ( cause de la ligne qui suit...)
; maintenant, recherche du premier caractère identique
Label1: inc eax ;; ajouter 1 à l'adresse de loctet à comparer
movzx edx,BYTE PTR [eax] ;; lire l'octet à comparer, et le placer dans dl
test edx,edx ;; fixer les flags de dl
jz Label5 ;; si c'est égal à 0, aller Label5
mov dl,BYTE PTR [_Search_Table_+edx] ;; sinon, placer la majuscule de la table dans dl
cmp dl,bl ;; comparer ensuite à la majuscule à trouver
jne Label1 ;; si c'est différent, on continu la recherche, aller Label1
; sinon, on teste les caractères suivants
Label2: xor ecx,ecx ;; effacer ecx
Label3: inc ecx ;; augmenter ecx d'1
movzx edx,BYTE PTR [esi+ecx] ;; placer l'octet suivant dans dl
test edx,edx ;; fixer les flags de edx
jz Label4 ;; si c'est égal à 0, aller Label4
mov bh,BYTE PTR [_Search_Table_+edx] ;; sinon, placer la majuscule de la table dans bh
movzx edx,BYTE PTR [eax+ecx] ;; lire l'octet suivant à comparer
mov dl,BYTE PTR [_Search_Table_+edx] ;; placer la majuscule de la table dans dl
cmp dl,bh ;; comparer ensuite à la majuscule à trouver
je Label3 ;; si c'est égal à 0, aller Label3
jmp Label1 ;; aller Label1
; ici la chaîne est trouvée, alors on defini eax qui contiendra l'adresse
Label4: sub eax,edi ;; on soustrait l'adresse de la chaîne pour obtenir la position trouvée dans la chaîne

pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
; chaîne non trouvée
Label5: mov eax,-1 ;; sinon on met -1 (chaîne non trouvée dans eax)

pop edx ;; désempiler edx
pop ecx ;; désempiler ecx
pop ebx ;; désempiler ebx
ret ;; retourner (sortir de la procédure)
FindNextString ENDP


dedndave

Quotehmm, look like a racism against real coders   :'(

i understand what you mean by this, but you can't tell me you learned everything out of a book
somehwere along the line, you learned from someone else, also
i considered myself to be a pretty fair programmer in 16-bit
not that i was a pro, but i have done a lot of math that many cannot do
i'll show you mine if you show me yours   :lol

jj2007

Quote from: NightWare on May 13, 2009, 12:38:19 AM
a single byte can change lot of things (especially with recent instructions sets)...
Algo is unchanged, code start is 16-byte aligned in both cases. Just 16 bytes further up, and it slows down by 40% and timings get erratic...

Quote
Quote from: jj2007 on May 12, 2009, 07:01:53 PM
Standardisation could help a lot - Michael's timer macros are a fantastic achievement, but we might add some rules, e.g. how to pass parameters (we all know that fastcall is a few cycles faster...). You may have seen the timings macro in my posts above, I have done that one to suit my own laziness, but I believe we could work out something along those lines. Not to mention LAMPs :wink
hmm, look like a racism against real coders  :'(

No racism intended. We want a testbed that makes algos comparable. Usually you do that with macros that allow a syntax of this type
timings crt_strstr, addr Mainstr, addr TestSubD
timings InstrCi, 1, addr Mainstr, addr TestSubE, 0
timings BMLinDB, addr Mainstr, addr TestSubEx, (SIZEOF TestSubEx)-1

where each line replaces
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke crt_strstr, addr Mainstr, addr TestSubD
counter_end
print str$(eax), 9, "cycles for rep stosd", 13, 10

Now it gets messy if one programmer says "my algo needs Mainstr in eax and Pattern in ecx, only then it will be really fast".

jj2007

Quote from: Jimg on May 12, 2009, 09:27:29 PM

So where's instrjj ?


Here it is. I had to change quite a number of things to make it compatible ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          264        0        0        0        0
InStrL        84      132      360      468     1152
finstr        96      288      432      552     1500
finstr2      144      216      372      372     1608
finstr3      144      204      348      348     1512
finstr4       84      324      456      624     1500
finstr5       84      324      456      612     1488
finstr6      156      216      360      372     1608
InstrJJ       84      120      204      132     1536



[attachment deleted by admin]

dedndave

i find this very disturbing
i don't know what's going on, here
i get very different numbers from everyone else on my prescott
also, i get very different numbers if i redirect the output to a text file
and different numbers again if i open it with Windows Explorer
i am running XP MCE2005 SP2 (the OS shouldn't make much difference - who knows)

compare these results with the ones i pasted a few posts back ^

here, i open a command window and redirect the output with "instrnew>test.txt"
i get similar results by running it without redirection (some test programs, i get different results by redirecting)

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          487        0        0        0        0
InStrL        91      203      495      750     1433
finstr        90      278      420      511     1418
finstr2      105      270      675      540     1695
finstr3      105      285      548      540     1718
finstr4       98      285      473      638     1343
finstr5      105      353      623      705     1403
finstr6      105      270      428      533     1696
InstrJJ      188      180      451      248     2888

on these two, i ran it by clicking on it in Windows Explorer
notice the inconsistencies, especially the minus signs

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          517       -7        8        0       -7
InStrL       143      248      555      923     2303
finstr       158      443      675      788     2408
finstr2      188      398      668      765     2738
finstr3      188      413      818      736     2783
finstr4      188      450      908      938     2198
finstr5      188      533      781     1035     2258
finstr6      188      390      668      728     2775
InstrJJ      278      300      841      391     4531

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          495       23       15       23       22
InStrL       218      270      653      915     2400
finstr       180      443      622      780     2467
finstr2      232      405      773      795     2813
finstr3      202      458      780      802     2850
finstr4      173      488      892      960     2220
finstr5      165      548      787     1072     2288
finstr6      180      412      690      780     2745
InstrJJ      285      322      698      397     4417

jj2007

Quote from: dedndave on May 13, 2009, 07:09:24 AM
i get very different numbers from everyone else on my prescott

Just launched my Prescott:

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          459       17       17       17       -9
InStrL       110      229      382      535     1402
finstr       119      263      399      519     1428
finstr2      127      280      450      561     1708
finstr3      127      289      459      569     1725
finstr4      110      323      527      620     1360
finstr5      119      340      544      722     1428
finstr6      127      297      450      552     1708
InstrJJ      196      187      467      280     2661


This is with TimesToLoop equ 10000. With only 10, timings are pretty erratic...

sinsi

I tried the redirection thing and got exactly the same numbers as from explorer (no joke, they all matched).


Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          252        0        0        0        0
InStrL        63      108      270      360      882
finstr        54      180      243      297      819
finstr2      108      171      288      261      954
finstr3      117      180      261      252      954
finstr4       54      216      279      387      810
finstr5       45      216      279      387      810
finstr6      108      171      288      261      954
InstrJJ       63       72      135      126     1323
Light travels faster than sound, that's why some people seem bright until you hear them.

fearless

Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz (SSE
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          229        0        0        0        0
InStrL        51      102      187      340      842
finstr        51      153      238      281      808
finstr2      102      145      247      255      952
finstr3      102      162      255      247      986
finstr4       51      213      272      417      799
finstr5       43      196      264      374      808
finstr6      111      145      247      255      961
InstrJJ       60       68      136       77     1369
ƒearless

dedndave

lol - fearless gets such nice numbers with that CPU
i hafta go buy one  :U

i have been playing with selecting cores on a multi-core machine
i have tried to keep in mind that, in some cases, the cores may not be the same CPU type, especially in the case of overdrive chips
it seems that, if they have an overdrive chip, you have to use SetProcessAffinityMask to tell the system which core you want to look at
this includes the CPUID - how else would you select which processor to run CPUID, right ?
if they have different cores, you may have to run the CPUID routine more than once, right?
(also, for certain non-Intel/non-AMD CPU's, you have to enable the CPUID instruction)
you have to use SetProcessAffinityMask on a multi-core for the RDTSC instruction, also

now, here are a couple things that i have not seen in any of the test code that i have looked at to date

in order to use Get/SetProcessAffinityMask to control core selection.....

1) use GetSecurityInfo to see ...  (of course, there is no documented "SetSecurityInfo" function - :P )

      to use GetProcessAffinityMask on XP or Windows server 2003, you must have:
      PROCESS_QUERY_INFORMATION access right
      to use GetProcessAffinityMask on other OS's, you must have:
      PROCESS_QUERY_INFORMATION or PROCESS_QUERY_LIMITED_INFORMATION access right

      to use SetProcessAffinityMask (to manipulate core selection), you must have:
      PROCESS_SET_INFORMATION access right

2) to use SetProcessAffinityMask (to manipulate core selection)
     use SetProcessAffinityUpdateMode to set PROCESS_AFFINITY_ENABLE_AUTO_UPDATE

     The system can adjust process affinity under various conditions, such as when a processor is added dynamically. By default, dynamic updates to the process affinity are disabled for each process.

Processes should use this function to indicate whether they can handle dynamic adjustment of process affinity by the system. After a process enables affinity update mode, it can call this function to disable it. However, a process cannot enable affinity update mode after it has used this function to disable it.

in other words, i don't think anyone is looking at the return values when they try to control core selection for RDTSC
if they were, they should get an error condition, no ?
at any rate, it should be verifiable by using GetProcessAffinityMask to see if they have selected a single core to read the counter
also, for a complete CPU ID routine, you need to ID the OS to see if you can even read the GetProcessAffinityMask value
and, you need to be able to control SetProcessAffinityMask in order to specify which core the CPUID instruction is executing on


running a test on a single core machine - easy
multi-core - not so easy

UtillMasm

c:\china>"C:\Program Files\Microsoft SDKs\Windows\v6.0\VC\Bin\ml.exe" /c /coff instrnew.asm
Microsoft (R) Macro Assembler Version 8.00.50727.762
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: instrnew.asm

c:\china>\masm32\bin\link.exe /subsystem:console instrnew.obj
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.


c:\china>instrnew.exe
Genuine Intel(R) CPU           T2400  @ 1.83GHz (SSE3)
Type       Short   Medium     Long HighAnsi MisMatch
Source   Mainstr  Mainstr  Mainstr  Mainstr mismatch
Pattern  Substr1  Substr2  Substr3  Substr4   mXword
======= ======== ======== ======== ======== ========
OVH          264        0        0        0        0
InStrL        88      132      352      462     1155
finstr        88      286      429      550     1507
finstr2      143      209      363      363     1617
finstr3      143      209      352      352     1507
finstr4       88      308      451      594     1496
finstr5       88      319      462      605     1496
finstr6      143      209      363      363     1617
InstrJJ       88      121      209      132     1518

Press enter to exit...

c:\china>