News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

jj2007

Just replaced the attachment above - I could shave off around 30 cycles and a few bytes. Here is the 33 bytes version.

align 16
RockoonJ2 proc
pop edx
pop eax
pop ecx
push edx
sub ecx, eax
align 8
@@: mov edx, [eax+ecx]
cmp edx, [eax]
jne @F
lea eax, [eax+4]
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
je @B
@@: sub eax, [esp-4] ; return the relative position in ecx
ret
RockoonJ2 endp
RockoonJ2_endp:

qWord

1 byte shorter and remove one dependency:

mov edx, [eax+ecx+4]
cmp edx, [eax+4]
lea eax, [eax+8]

instead of:
lea eax, [eax+4]
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]

:bg
FPU in a trice: SmplMath
It's that simple!

jj2007

Quote from: qWord on June 18, 2010, 08:55:22 PM
1 byte shorter and remove one dependency:

Thanks, qWord. I had tested that one, it's a whopping 80 cycles slower on my Celeron - but of course, it might be faster on other CPU's. Did you test it?

qWord

no - can you post your current test bed?
FPU in a trice: SmplMath
It's that simple!

jj2007

Here it is. Search for "qWord" :bg

qWord

sorry jj,
I've found your previous test bed in the mean time  ::)
FPU in a trice: SmplMath
It's that simple!

Rockoon

Quote from: jj2007 on June 18, 2010, 08:29:38 PM
Cute idea :U

'twas nothing. Just one of those things in my bag of tricks. I've got many :)

Now, thinking about this further...

...operations on eax are usually encoded shorter, and that is true for both MOV EAX, [...] and ADD EAX, constant

so maybe make the comparator register EAX or conversely make the incrementing register EAX. I doubt it will have a performance impact (so doubtful that I wont bother) but it should shave a byte or so (??) off the loop.

as far as detecting the null terminator, a SWAR method would probably be most efficient for the general purpose regs. Might become register constrained tho.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

qWord

jj,
for my core2duo it seems a bit faster for long strings, but no difference for the short ones.
FPU in a trice: SmplMath
It's that simple!

Rockoon


Am I the only guy with AMD's?


AMD Phenom(tm) II X6 1055T Processor (SSE3)
String comparison: short string 10 bytes, long string 5050
1609    cycles for SSE with null check, long string
1921    cycles for SSE with null check, long string, case-insensitive
3776    cycles for Lingo, long string
2732    cycles for Frank, long string
2522    cycles for Rockoon, long string
2018    cycles for RockoonJ, long string
2211    cycles for RockoonJ2, long string
10046   cycles for crt_strcmp, long string
20065   cycles for crt__stricmp, long string, case-insensitive
25050   cycles for movzx check null, long string
11026   cycles for repe cmpsb, long string
78248   cycles for lstrcmp, long string

23      cycles for SSE with null check, 10 bytes
34      cycles for SSE with null check, 10 bytes, case-insensitive
17      cycles for Lingo, 10 bytes
8       cycles for Frank, 10 bytes
9       cycles for Rockoon, 10 bytes
10      cycles for RockoonJ, 10 bytes
10      cycles for RockoonJ2, 10 bytes
23      cycles for crt_strcmp, 10 bytes
60      cycles for crt__stricmp, 10 bytes, case-insensitive
41      cycles for movzx check null, 10 bytes
34      cycles for repe cmpsb, 10 bytes
777     cycles for lstrcmp, 10 bytes

1608    cycles for SSE with null check, long string
1920    cycles for SSE with null check, long string, case-insensitive
3776    cycles for Lingo, long string
2732    cycles for Frank, long string
2522    cycles for Rockoon, long string
2018    cycles for RockoonJ, long string
2210    cycles for RockoonJ2, long string
10037   cycles for crt_strcmp, long string
20072   cycles for crt__stricmp, long string, case-insensitive
25048   cycles for movzx check null, long string
11031   cycles for repe cmpsb, long string
77537   cycles for lstrcmp, long string

23      cycles for SSE with null check, 10 bytes
34      cycles for SSE with null check, 10 bytes, case-insensitive
17      cycles for Lingo, 10 bytes
8       cycles for Frank, 10 bytes
9       cycles for Rockoon, 10 bytes
9       cycles for RockoonJ, 10 bytes
10      cycles for RockoonJ2, 10 bytes
20      cycles for crt_strcmp, 10 bytes
60      cycles for crt__stricmp, 10 bytes, case-insensitive
41      cycles for movzx check null, 10 bytes
34      cycles for repe cmpsb, 10 bytes
755     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        40
RockoonJ:       42
RockoonJ2:      33
SSE:    277     (MasmBasic)

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: Rockoon on June 18, 2010, 09:22:09 PM
...operations on eax are usually encoded shorter, and that is true for both MOV EAX, [...] and ADD EAX, constant
so maybe make the comparator register EAX or conversely make the incrementing register EAX.

I'm afraid there is no difference - sizes are identical. This variant seems a tick faster, though:

align 16
RockoonJ3 proc
pop edx
pop eax
pop ecx
sub ecx, eax
push edx
align 8 ; aka mov edi, edi
@@:
REPEAT 1 ; further unrolling slows it down
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
jne @F
ENDM
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
je @B
@@: sub eax, [esp-4] ; return the relative position in eax
ret
RockoonJ3 endp


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2909    cycles for SSE with null check, long string
3826    cycles for Rockoon, long string
3836    cycles for RockoonJ, long string
3788    cycles for RockoonJ2, long string
3723    cycles for RockoonJ3, long string

clive

Quote from: Rockoon
Am I the only guy with AMD's?

I have a couple of Venice and Newcastle boxes, but unfortunately nothing newer. Will also try the code on an Atom later, because that is more like the P3.
It could be a random act of randomness. Those happen a lot as well.

Rockoon

SWAR method for detecting NULL bytes...

assuming eax contains 4 bytes to be checked

mov esi, eax
sub eax, 1010101h
not esi
and eax, 80808080h
and eax, esi

ZF is now set if EAX contained no null's

can be extended in the obvious manner for 64-bit registers

In C, this is

bool hasZeroByte = (v - 0x01010101UL) & ~v & 0x80808080UL;

from http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: Rockoon on June 18, 2010, 09:55:33 PM
SWAR method for detecting NULL bytes...

It makes a difference :bg

For these timings, I have been mean - I added 50,000 nullbytes to the long test strings.
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050 (+50000)
2932    cycles for SSE with null check, long string
3665    cycles for SSE with null check, long string, case-insensitive
5545    cycles for Lingo, long string with null check
59617   cycles for Frank, long string
44534   cycles for Rockoon, long string
44263   cycles for RockoonJ, long string
44044   cycles for RockoonJ2, long string
44192   cycles for RockoonJ3, long string
7875    cycles for RockoonCNB, long string, check nullbyte
13922   cycles for crt_strcmp, long string
16129   cycles for crt__stricmp, long string, case-insensitive
90185   cycles for lstrcmp, long string

jj2007

Quote from: Rockoon on June 18, 2010, 09:55:33 PM
SWAR method for detecting NULL bytes...
...
bool hasZeroByte = (v - 0x01010101UL) & ~v & 0x80808080UL;

from http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

"was written in a newsgroup post on April 27, 1987 by Alan Mycroft" - I had just abandoned my Sinclair ZX at that time ::)

frktons

So what? Are you checking if one or more bytes inside the register
contain a NULL [zero binary?].

And then you stop confronting the two strings?

What about a binary file that contains some zero binary bytes and
those bytes are part of the string we are comparing?

If all the strings are NULL terminated like in the C Language, let's
assume we are comparing just two memory areas with that content.

I hope I understood what you are meaning.  ::)
Mind is like a parachute. You know what to do in order to use it :-)