News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

clive

What about a binary file that contains some zero binary bytes and
those bytes are part of the string we are comparing?


They are NOT strings (PASCAL, C, '$' terminated), they are simply collections of bytes.

Yes, you want to do a memcmp(), not a strcmp().
It could be a random act of randomness. Those happen a lot as well.

jj2007

Strings are null-terminated, unless the contrary is explicitly stated. But that's not the problem.
Imagine you are comparing files. You have allocated two fat memory buffers with VirtualAlloc. You are loading two files into this buffer, and run the routine without checking the nullbytes. The following scenarios are possible:

- you know the string length, and stop comparing when the end is reached
- you do not know the string length, and stop comparing when the strings differ
- you do not know the string length, and the files are equal

In the latter case, you stop comparing with an access violation at the end of the buffer.

clive

#122
http://www.masm32.com/board/index.php?topic=13701.msg113195#msg113195
StrCompCheckNull.zip

Venice

AMD Athlon(tm) 64 Processor 3400+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2883    cycles for SSE with null check, long string
3520    cycles for SSE with null check, long string, case-insensitive
3964    cycles for Lingo, long string with null check
56656   cycles for Frank, long string
54710   cycles for Rockoon, long string
61931   cycles for RockoonJ, long string
54668   cycles for RockoonJ2, long string
54667   cycles for RockoonJ3, long string
5061    cycles for RockoonCNB, long string, check nullbyte
10149   cycles for crt_strcmp, long string
20218   cycles for crt__stricmp, long string, case-insensitive
106631  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
46      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
7       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
524     cycles for lstrcmp, 10 bytes

2888    cycles for SSE with null check, long string
3519    cycles for SSE with null check, long string, case-insensitive
3975    cycles for Lingo, long string with null check
56400   cycles for Frank, long string
55030   cycles for Rockoon, long string
61608   cycles for RockoonJ, long string
54774   cycles for RockoonJ2, long string
55538   cycles for RockoonJ3, long string
5169    cycles for RockoonCNB, long string, check nullbyte
10234   cycles for crt_strcmp, long string
20625   cycles for crt__stricmp, long string, case-insensitive
108163  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
43      cycles for SSE with null check, 10 bytes, case-insensitive
8       cycles for Lingo, 10 bytes with null check
-12     cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
2       cycles for RockoonJ, 10 bytes
14      cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
558     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


NewCastle

AMD Athlon(tm) 64 Processor 3200+ (SSE2)
String comparison: short string 10 bytes, long string 5050
2920    cycles for SSE with null check, long string
3541    cycles for SSE with null check, long string, case-insensitive
3985    cycles for Lingo, long string with null check
91701   cycles for Frank, long string
57463   cycles for Rockoon, long string
61822   cycles for RockoonJ, long string
57433   cycles for RockoonJ2, long string
57502   cycles for RockoonJ3, long string
5104    cycles for RockoonCNB, long string, check nullbyte
10184   cycles for crt_strcmp, long string
20326   cycles for crt__stricmp, long string, case-insensitive
107065  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
46      cycles for SSE with null check, 10 bytes, case-insensitive
19      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
9       cycles for RockoonJ3, 10 bytes
15      cycles for RockoonCNB, 10 bytes, check nullbyte
528     cycles for lstrcmp, 10 bytes

2925    cycles for SSE with null check, long string
3542    cycles for SSE with null check, long string, case-insensitive
4111    cycles for Lingo, long string with null check
57906   cycles for Frank, long string
57502   cycles for Rockoon, long string
61841   cycles for RockoonJ, long string
57448   cycles for RockoonJ2, long string
57577   cycles for RockoonJ3, long string
5143    cycles for RockoonCNB, long string, check nullbyte
10188   cycles for crt_strcmp, long string
20411   cycles for crt__stricmp, long string, case-insensitive
107784  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
47      cycles for SSE with null check, 10 bytes, case-insensitive
14      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
13      cycles for RockoonJ3, 10 bytes
15      cycles for RockoonCNB, 10 bytes, check nullbyte
528     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


Atom

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
String comparison: short string 10 bytes, long string 5050
6787    cycles for SSE with null check, long string
7194    cycles for SSE with null check, long string, case-insensitive
13367   cycles for Lingo, long string with null check
131992  cycles for Frank, long string
95902   cycles for Rockoon, long string
104444  cycles for RockoonJ, long string
106118  cycles for RockoonJ2, long string
70027   cycles for RockoonJ3, long string
10839   cycles for RockoonCNB, long string, check nullbyte
18223   cycles for crt_strcmp, long string
23639   cycles for crt__stricmp, long string, case-insensitive
222617  cycles for lstrcmp, long string

73      cycles for SSE with null check, 10 bytes
70      cycles for SSE with null check, 10 bytes, case-insensitive
33      cycles for Lingo, 10 bytes with null check
23      cycles for Frank, 10 bytes
22      cycles for Rockoon, 10 bytes
20      cycles for RockoonJ, 10 bytes
30      cycles for RockoonJ2, 10 bytes
26      cycles for RockoonJ3, 10 bytes
29      cycles for RockoonCNB, 10 bytes, check nullbyte
913     cycles for lstrcmp, 10 bytes

4723    cycles for SSE with null check, long string
5298    cycles for SSE with null check, long string, case-insensitive
12200   cycles for Lingo, long string with null check
130570  cycles for Frank, long string
96277   cycles for Rockoon, long string
97053   cycles for RockoonJ, long string
98298   cycles for RockoonJ2, long string
68774   cycles for RockoonJ3, long string
9803    cycles for RockoonCNB, long string, check nullbyte
17967   cycles for crt_strcmp, long string
24884   cycles for crt__stricmp, long string, case-insensitive
216194  cycles for lstrcmp, long string

57      cycles for SSE with null check, 10 bytes
71      cycles for SSE with null check, 10 bytes, case-insensitive
38      cycles for Lingo, 10 bytes with null check
27      cycles for Frank, 10 bytes
29      cycles for Rockoon, 10 bytes
24      cycles for RockoonJ, 10 bytes
34      cycles for RockoonJ2, 10 bytes
30      cycles for RockoonJ3, 10 bytes
35      cycles for RockoonCNB, 10 bytes, check nullbyte
930     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---
It could be a random act of randomness. Those happen a lot as well.

KeepingRealBusy

Clive,

For this following test, which ZiP did you use?

Dave.

Quote from: clive on June 18, 2010, 11:54:59 PM
AMD Athlon(tm) 64 Processor 3400+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2883    cycles for SSE with null check, long string
3520    cycles for SSE with null check, long string, case-insensitive
3964    cycles for Lingo, long string with null check
56656   cycles for Frank, long string
54710   cycles for Rockoon, long string
61931   cycles for RockoonJ, long string
54668   cycles for RockoonJ2, long string
54667   cycles for RockoonJ3, long string
5061    cycles for RockoonCNB, long string, check nullbyte
10149   cycles for crt_strcmp, long string
20218   cycles for crt__stricmp, long string, case-insensitive
106631  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
46      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
7       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
524     cycles for lstrcmp, 10 bytes

2888    cycles for SSE with null check, long string
3519    cycles for SSE with null check, long string, case-insensitive
3975    cycles for Lingo, long string with null check
56400   cycles for Frank, long string
55030   cycles for Rockoon, long string
61608   cycles for RockoonJ, long string
54774   cycles for RockoonJ2, long string
55538   cycles for RockoonJ3, long string
5169    cycles for RockoonCNB, long string, check nullbyte
10234   cycles for crt_strcmp, long string
20625   cycles for crt__stricmp, long string, case-insensitive
108163  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
43      cycles for SSE with null check, 10 bytes, case-insensitive
8       cycles for Lingo, 10 bytes with null check
-12     cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
2       cycles for RockoonJ, 10 bytes
14      cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
558     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


clive

It could be a random act of randomness. Those happen a lot as well.

KeepingRealBusy


frktons

Quote from: jj2007 on June 18, 2010, 11:29:51 PM
Strings are null-terminated, unless the contrary is explicitly stated. But that's not the problem.
Imagine you are comparing files. You have allocated two fat memory buffers with VirtualAlloc. You are loading two files into this buffer, and run the routine without checking the nullbytes. The following scenarios are possible:

- you know the string length, and stop comparing when the end is reached
- you do not know the string length, and stop comparing when the strings differ
- you do not know the string length, and the files are equal

In the latter case, you stop comparing with an access violation at the end of the buffer.

When I read files, I have some chances to know the number of bytes I'm reading into
memory, so I can compare just that number of bytes or create by myself the sequence
of bytes that tell me I'm on the EOF.

As Clive pointed out, I'm going to do a memcmp() in that case, and it could be one
possible scenario for the use of this kind of comparing.

If they are strings, the NULL check is mandatory, being it the de facto EOF.
By the way, I assume you are getting those results with the methods tested
just because this rule is not explicit, so there are big differences due to this
simple fact, not because the algos are that much smarter.  :P
Mind is like a parachute. You know what to do in order to use it :-)

KeepingRealBusy

Here is my time for my dekstop (development) system:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2863    cycles for SSE with null check, long string
6233    cycles for SSE with null check, long string, case-insensitive
3948    cycles for Lingo, long string with null check
60190   cycles for Frank, long string
56336   cycles for Rockoon, long string
63878   cycles for RockoonJ, long string
55673   cycles for RockoonJ2, long string
54345   cycles for RockoonJ3, long string
5029    cycles for RockoonCNB, long string, check nullbyte
12533   cycles for crt_strcmp, long string
20673   cycles for crt__stricmp, long string, case-insensitive
101896  cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
55      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
-6      cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
591     cycles for lstrcmp, 10 bytes

2864    cycles for SSE with null check, long string
3495    cycles for SSE with null check, long string, case-insensitive
6455    cycles for Lingo, long string with null check
56777   cycles for Frank, long string
60430   cycles for Rockoon, long string
65005   cycles for RockoonJ, long string
54752   cycles for RockoonJ2, long string
60138   cycles for RockoonJ3, long string
5272    cycles for RockoonCNB, long string, check nullbyte
10029   cycles for crt_strcmp, long string
20091   cycles for crt__stricmp, long string, case-insensitive
95502   cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
46      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
7       cycles for Rockoon, 10 bytes
8       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
504     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


And here is my timing for my laptop (Internet):


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
String comparison: short string 10 bytes, long string 5050
5463    cycles for SSE with null check, long string
8446    cycles for SSE with null check, long string, case-insensitive
9214    cycles for Lingo, long string with null check
77152   cycles for Frank, long string
67687   cycles for Rockoon, long string
66298   cycles for RockoonJ, long string
69855   cycles for RockoonJ2, long string
66412   cycles for RockoonJ3, long string
9748    cycles for RockoonCNB, long string, check nullbyte
14225   cycles for crt_strcmp, long string
33251   cycles for crt__stricmp, long string, case-insensitive
135879  cycles for lstrcmp, long string

16      cycles for SSE with null check, 10 bytes
23      cycles for SSE with null check, 10 bytes, case-insensitive
37      cycles for Lingo, 10 bytes with null check
43      cycles for Frank, 10 bytes
6       cycles for Rockoon, 10 bytes
16      cycles for RockoonJ, 10 bytes
5       cycles for RockoonJ2, 10 bytes
16      cycles for RockoonJ3, 10 bytes
11      cycles for RockoonCNB, 10 bytes, check nullbyte
695     cycles for lstrcmp, 10 bytes

5105    cycles for SSE with null check, long string
9337    cycles for SSE with null check, long string, case-insensitive
8719    cycles for Lingo, long string with null check
80502   cycles for Frank, long string
67870   cycles for Rockoon, long string
65171   cycles for RockoonJ, long string
66735   cycles for RockoonJ2, long string
66174   cycles for RockoonJ3, long string
9802    cycles for RockoonCNB, long string, check nullbyte
14903   cycles for crt_strcmp, long string
34773   cycles for crt__stricmp, long string, case-insensitive
132146  cycles for lstrcmp, long string

22      cycles for SSE with null check, 10 bytes
32      cycles for SSE with null check, 10 bytes, case-insensitive
18      cycles for Lingo, 10 bytes with null check
16      cycles for Frank, 10 bytes
20      cycles for Rockoon, 10 bytes
13      cycles for RockoonJ, 10 bytes
10      cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
18      cycles for RockoonCNB, 10 bytes, check nullbyte
648     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


frktons

Here it is a new approach to string comparing.

First difference: a parallel forward and backward scan of the strings.
This should be faster because if there is a difference toward the end
or near the beginning of the string it'll be detected without scanning
the whole.
Second difference: it checks all the spare bytes the previous version
did not.
I don't know if anyone has already used this approach, but I think it could
be useful in various occasions.

The routine doesn't return the position of the different bytes, nor anything else.
It is just a quick code, without any optimization in mind.

This is just to show the method.


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
1172    cycles for Parallel scan, 2600 bytes

--- ok ---


If anyone thinks it is worth the effort, optimize it, customize it, and enjoy.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

dedndave

to do it that way, seems like you have to know the string lengths, first   :P

frktons

Quote from: dedndave on June 20, 2010, 02:49:54 PM
to do it that way, seems like you have to know the string lengths, first   :P

Compliments  :clap: :clap:

This is one the three assumptions the program does.  :P

Guess the next, please.  :lol
Mind is like a parachute. You know what to do in order to use it :-)

KeepingRealBusy

I have added some timing for my version of an SSE compare as KRBxxxx (4 tests)
(KRB for KeepingRealBusy). I use this procedure in a Quicksort Indirect for a
buffer (2 GB) full of variable length strings. These have been read from a 6+ GB
file and converted (2 GB block by block) to variable length structures
(WORD_WITH_BYTE_COUNT,BYTE_ARRAY) in-place in the input buffer. Then a separate
buffer is created pointing to these structures, and the pointers are
Quicksorted. Next, the pointers are written to an intermediate file followed by
the records in-order (data only, not the string length - the pointers are
adjusted to point to the moved data in the intermediate buffer, thus the string
length is defined by the difference between any pointer and it's following
pointer). I did reserve the last 16 bytes of the buffer to insure that an XMM
access of the last record in the buffer would not encounter an error. My sorting
requirements did not need the mismatch location (as has been supplied in some of
these other tests) but did require the type of the mismatch (below/equal/above,
and not just equal/not_equal). Thus - I made 4 tests with different optional
return choices: returning both values, one value, the other value, and neither.
My choice would be KRBNP (return compare result but no position return) for use
with the sort. I implemented some test code to verify the results of my first
long string test and my first short string test (returning both results), and
then removed the verification code with "if 0" "endif".

Of course, since my data needed a particular format (a leading size WORD), I
created that format with 4 new data structures (2 for long and 2 for short).

I had originally created fixed length copies of the input strings for direct
sorting, but this took too much time with data handling. The average string
length was 48 bytes but the longest strings were 160 bytes. This caused the
fixed length file to balloon to 20+ GB. The new method allows blocks of sorted
records to be created with only 2 extra bytes per string so that the merging of
the blocks is much faster (only 1/3 of the blocks). The two different formats
for the string lengths (a WORD preceeding the string during sort, and the
difference between two pointers during merging) require two different entries to
the compare, but the rest of the compare logic remains the same following the
entry.

I was a bit disappointed with some of the timing code. No attempt was made to
actually return a true string comparison indicator (a negative value, a zero
value, or a positive value in eax) (such as would be returned from lstrcmp)
except for the last three methods (crt_strcmp, crt__stricmp, and lstrcmp) and
Lingo. In fact, the returned flags for all methods other than those last three
returned the flags from subtracting the starting value from the mismatch point.
For my tests which return comparison results, the results are returned in the
flags. For my tests which return mismatch position, the result is returned in
eax without affecting the comparison flags.

Frank made no check to detect the end of a string and ran off the end of the
first string into the null and then the 50000 byte zero pad and then the string
of "xxx" and then the second string, but at the same time ran off the end of the
second string into the null and then the 50000 byte zero pad and then the string
of "xxx" and then the short strings which finally stopped the compare with a
mismatch (the second string data as part of the first string did not match the
short string data as part of the second string). Essentially, he compared 55036
bytes (5000+1+50000+34+1) before the mismatch. If he had made a null check, then
his counts would have been in the 5000 range instead of the 55000 range. Rockoon
(below) had the same problem until the last attempt where he added the null
check and his times show the huge difference. Neither Frank's method nor
Rockoon's method with DWORD compares will substitute for a lstrcmp even if it
returned the comparison result. What needs to be done to return a true lstrcmp
result (but using DWORD compares) is something like the following (but it
requires that the string be padded with zeros to a mod 4 bound and not just a
single null with trailing garbage in the last DWORD and this is not the normal
case unless the strings are edited):


    @@:
        mov eax[esi]
        cmp eax[edi]
        jnz @f
    ;
    ;   null check or length check here
    ;
         .
         .
         .
        lea esi,[esi+4]
        lea edi,[edi+4]
        jmp @b
    @@:
        mov ebx,[edi]
        bswap eax
        bswap ebx
        cmp eax,ebx


Rockoon depended on the two strings being adjacent such that the string length
of the first was the difference between the two pointers (and the second string
length was not checked) or maybe I am missing something. He was comparing the
strings, the null, the 50000 byte zero pad and then the string of "xxx". His
final test did make the null check and the counts show it:


    56620   cycles for RockoonJ3, long string
    5028    cycles for RockoonCNB, long string, check nullbyte



I have not completely analyzed the StrCmpSSE method, but at least it returns both
a mismatch position and the mismatched byte difference (it swaps the regs,
trying to get one string that is aligned, and I did not thoroughly analyze if
this difference was corrected for such swaps - probably the simplest test would
be to create test strings that differ in alignment and where the first string is
always less than the second string in the lstrcmp sense and verify that the
returned comparison result is always correct whether or not the strings might
have been swapped for alignment).

I have not analyzed Lingo's method completely, but what can I say. His code is
good, and fast, and good and fast. He even returns the result of the comparison
(I think it does, I did not spend too much time verifying all of the possible
conditions).

What can I say about crt_strcmp, crt__stricmp, and lstrcmp. The counts tell the
story of what you can expect with a general purpose solution:


    9670    cycles for crt_strcmp, long string
    20589   cycles for crt__stricmp, long string, case-insensitive
    95656   cycles for lstrcmp, long string


I was dismayed to see that the repe cmpsb test was conditionally not assembled.
I wanted to see how bad it could really get since they didn't include it, so I
re-enabled the tests. Actually, I created 4 tests with 4 sets of data. The first
set of two tests used unaligned data, the second set of two tests used aligned
data. The first test of each set was a repe cmpsb, the second test of each set
was a repe cmpsd (by DWORDS until mismatch), then backup the indices and retry
the last DWORD with a cmpsb to get an accurate compare result. Amazingly, the
counts are NOT ALL THAT BAD, especially for aligned data!

In all of tests (except mine) which make null byte checks, the null byte check
is done for each compare. Checking for the end of the string for each compare
seems to be a case of time wasted, at least for usage with a sort. This is
necessary for a generic strcmp, but you do not want to use a generic process for
a specific task that is repeated a billion times or more. What I have done is to
do this once (to begin with when I read the data and convert it to structures)
and then use the calculated lengths for the compares. It is hard to quantify how
much extra time my tests would take (considering that I really do a sort which
compares the strings more than once, more like a billion times, and the length
calculation is only done once and saved with the strings). A 2 GB buffer of
average length strings (48 byte) gives about 44,739,242 records per buffer. This
can be converted in 6 seconds per buffer, so that is about 7,456,540 records per
second, or .134 microseconds per record, or about 134 nanoseconds per record.
Does anyone have a clue of how to convert that to cycles, and then how to
quantify the sort condition where the strings get compared 44,739,242 *
lg(44,739,242) or 44,739,242 * 26 = 1,163,220,292 times. I think you need to
divide 134 nano by 1,163,220,292 * 2 and I do not think in numbers that small. I
know that if I did the strlen for each string during the compare, it would add
134 nano * 1,163,220,292 compares * 2 (both strings) or 311,743,038,256 nano or
312 seconds to the sort time - that is 5 extra minutes, and that is per2 GB buffer,
and there are 3+ buffers to sort and merge.

Case insensitive compares are also time consuming, especially for sorts where
the same string is accessed many times. I find it highly beneficial to do a
lowercase conversion once (table lookup if ASCII - only 3 instructions per
character) and preface the string with the lowercase data, sort the
lowercase/uppercase data, then strip off the lowercase data when writing the
result strings.

Without any further ado, here are my timings.


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
String comparison: short string 10 bytes, long string 5050
3835    cycles for SSE with null check, long string
2738    cycles for SSE with null check, long string, case-insensitive
4062    cycles for Lingo, long string with null check
59170   cycles for Frank, long string
3189    cycles for KRB, long string
3171    cycles for KRBNF, long string no flags returned
2855    cycles for KRBNR, long string no return position
2542    cycles for KRBNFNR, long string no flags returned no return position
11172   cycles for repe cmpsb, unaligned long string
3849    cycles for repe cmpsd, unaligned long string
12140   cycles for repe cmpsb, aligned long string
1868    cycles for repe cmpsd, aligned long string
60221   cycles for Rockoon, long string
61258   cycles for RockoonJ, long string
55305   cycles for RockoonJ2, long string
57307   cycles for RockoonJ3, long string
5294    cycles for RockoonCNB, long string, check nullbyte
9670    cycles for crt_strcmp, long string
20589   cycles for crt__stricmp, long string, case-insensitive
95656   cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
46      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
56      cycles for KRB, 10 bytes
42      cycles for KRBNF, 10 bytes no flags returned
38      cycles for KRBNR, 10 bytes no return position
38      cycles for KRBNFNR, 10 bytes no flags returned no return position
36      cycles for repe cmpsb, unaligned 10 bytes
50      cycles for repe cmpsd, unaligned 10 bytes
36      cycles for repe cmpsb, aligned 10 bytes
46      cycles for repe cmpsd, aligned 10 bytes
7       cycles for Rockoon, aligned 10 bytes
9       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
13      cycles for RockoonCNB, 10 bytes, check nullbyte
504     cycles for lstrcmp, 10 bytes

2866    cycles for SSE with null check, long string
3497    cycles for SSE with null check, long string, case-insensitive
3938    cycles for Lingo, long string with null check
58561   cycles for Frank, long string
3362    cycles for KRB, long string
3340    cycles for KRBNF, long string no flags returned
2873    cycles for KRBNR, long string no return position
2694    cycles for KRBNFNR, long string no flags returned no return position
12600   cycles for repe cmpsb, unaligned long string
3835    cycles for repe cmpsd, unaligned long string
11129   cycles for repe cmpsb, aligned long string
2825    cycles for repe cmpsd, aligned long string
57165   cycles for Rockoon, long string
62959   cycles for RockoonJ, long string
56479   cycles for RockoonJ2, long string
55263   cycles for RockoonJ3, long string
4118    cycles for RockoonCNB, long string, check nullbyte
11211   cycles for crt_strcmp, long string
21257   cycles for crt__stricmp, long string, case-insensitive
91616   cycles for lstrcmp, long string

29      cycles for SSE with null check, 10 bytes
29      cycles for SSE with null check, 10 bytes, case-insensitive
13      cycles for Lingo, 10 bytes with null check
6       cycles for Frank, 10 bytes
72      cycles for KRB, 10 bytes
42      cycles for KRBNF, 10 bytes no flags returned
38      cycles for KRBNR, 10 bytes no return position
47      cycles for KRBNFNR, 10 bytes no flags returned no return position
36      cycles for repe cmpsb, unaligned 10 bytes
50      cycles for repe cmpsd, unaligned 10 bytes
36      cycles for repe cmpsb, aligned 10 bytes
46      cycles for repe cmpsd, aligned 10 bytes
7       cycles for Rockoon, aligned 10 bytes
9       cycles for RockoonJ, 10 bytes
8       cycles for RockoonJ2, 10 bytes
10      cycles for RockoonJ3, 10 bytes
26      cycles for RockoonCNB, 10 bytes, check nullbyte
541     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
KRB:    115
KRBNF:  99
KRBNR:  105
KRBNFNR:        91
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
RockoonJ3:      33
RockoonCNB:     84
SSE:    277     (MasmBasic)
--- ok ---


jj2007

Good job :U

QuoteChecking for the end of the string for each compare seems to be a case of time wasted, at least for usage with a sort.
The leading word is a solution for this particular case, where you compare over and over the same strings. However, in a different setting checking the len first might imply that the cache has to be reloaded, i.e. once for Len() and again for the actual comparison.

QuoteStrCmpSSE method, ... the simplest test would be to create test strings that differ in alignment
The two dummy strings are meant for doing this test.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2930    cycles for SSE with null check, long string
3662    cycles for SSE with null check, long string, case-insensitive
5533    cycles for Lingo, long string with null check
55931   cycles for Frank, long string
14716   cycles for KRB, long string
3192    cycles for KRBNF, long string no flags returned
3210    cycles for KRBNR, long string no return position
3141    cycles for KRBNFNR, long string no flags returned no return position
20961   cycles for repe cmpsb, unaligned long string
6982    cycles for repe cmpsd, unaligned long string
26079   cycles for repe cmpsb, aligned long string
5135    cycles for repe cmpsd, aligned long string
44509   cycles for Rockoon, long string
44180   cycles for RockoonJ, long string
44074   cycles for RockoonJ2, long string
44162   cycles for RockoonJ3, long string
7864    cycles for RockoonCNB, long string, check nullbyte
13922   cycles for crt_strcmp, long string
16095   cycles for crt__stricmp, long string, case-insensitive
89770   cycles for lstrcmp, long string

18      cycles for SSE with null check, 10 bytes
25      cycles for SSE with null check, 10 bytes, case-insensitive
16      cycles for Lingo, 10 bytes with null check
9       cycles for Frank, 10 bytes
53      cycles for KRB, 10 bytes
21      cycles for KRBNF, 10 bytes no flags returned
20      cycles for KRBNR, 10 bytes no return position
20      cycles for KRBNFNR, 10 bytes no flags returned no return position
88      cycles for repe cmpsb, unaligned 10 bytes
125     cycles for repe cmpsd, unaligned 10 bytes
88      cycles for repe cmpsb, aligned 10 bytes
121     cycles for repe cmpsd, aligned 10 bytes
8       cycles for Rockoon, aligned 10 bytes
7       cycles for RockoonJ, 10 bytes
7       cycles for RockoonJ2, 10 bytes
8       cycles for RockoonJ3, 10 bytes
16      cycles for RockoonCNB, 10 bytes, check nullbyte
453     cycles for lstrcmp, 10 bytes

ecube

KeepingRealBusy i can't compile your example I get


strcompkrb.asm(757) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(758) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(759) : error A2085: instruction or register not accepted in curre
nt CPU mode
etc...
strcompkrb.asm(1143) : error A2006: undefined symbol : aE2
strcompkrb.asm(1144) : error A2006: undefined symbol : uE2
strcompkrb.asm(1148) : error A2006: undefined symbol : ciE2
strcompkrb.asm(1172) : error A2006: undefined symbol : aL1
strcompkrb.asm(1203) : error A2006: undefined symbol : aL1


but according http://www.xbitlabs.com/images/cpu/athlon64-3000/cpuz.png my cpu supports mmx :\

frktons

I hope I'll have time enough to modify the routine I proposed for a different context,
and let it suite the context you are using, so in a couple of days the new string compare
routine should be out.  :P

If I have understood the rules of the game, the routine has to be built with the assumptions
that:
1 - the strings are NULL terminated
2 - we don't know the length of the strings

And the next point that I would like to understand is:
3 - the strings are of the same length or it doesn't matter?
If the latter is true what do we check?

Let me know  so I can full enjoy the game for the Beginners Section  :P
Mind is like a parachute. You know what to do in order to use it :-)