News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

jj2007

Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
        mov edx, [ecx] ; Comparing two strings in edx and [eax]
        cmp edx, [eax]
        jne @f
;        add ecx, 4
;        add eax,4
        mov edx, [ecx+4]
        cmp edx, [eax+4]
        jne @f
        add ecx, 4+4
        add eax, 4+4
        dec ebx
        jnz @b 
@@:

frktons

Quote from: jj2007 on June 18, 2010, 12:54:10 PM
Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
        mov edx, [ecx] ; Comparing two strings in edx and [eax]
        cmp edx, [eax]
        jne @f
;        add ecx, 4
;        add eax,4
        mov edx, [ecx+4]
        cmp edx, [eax+4]
        jne @f
        add ecx, 4+4
        add eax, 4+4
        dec ebx
        jnz @b 
@@:


A very tiny improvement 1 or 2 cycles, I expected more to be honest.


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
3269    cycles for LingoCMP, 2600 bytes
1329    cycles for frktons with XOR , 2600 bytes
1326    cycles for frktons with CMP , 2600 bytes
1316    cycles for frktons with CMP II , 2600 bytes
1314    cycles for frktons with CMP III , 2600 bytes

--- ok ---


I only changed:

        add ecx, 4+4
        add eax, 4+4

with:

        add ecx, 8
        add eax, 8


But it is somewhat smaller and smarter.  :U
Mind is like a parachute. You know what to do in order to use it :-)

frktons

I tried to unroll the loop, 10 times, but no misurable gain.
Probably this method has been exploited enough  :P

Enjoy
Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

Overclocked to 3.46ghz:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268    cycles for LingoCMP, 2600 bytes
1642    cycles for frktons with XOR , 2600 bytes
1642    cycles for frktons with CMP , 2600 bytes
1966    cycles for frktons with CMP II , 2600 bytes
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

frktons

Quote from: Rockoon on June 18, 2010, 05:00:42 PM
Overclocked to 3.46ghz:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268    cycles for LingoCMP, 2600 bytes
1642    cycles for frktons with XOR , 2600 bytes
1642    cycles for frktons with CMP , 2600 bytes
1966    cycles for frktons with CMP II , 2600 bytes


:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol

AMD has a different philsophy about optimization.

:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol
Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

Quote from: frktons on June 18, 2010, 05:40:13 PM
Quote from: Rockoon on June 18, 2010, 05:00:42 PM
Overclocked to 3.46ghz:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268    cycles for LingoCMP, 2600 bytes
1642    cycles for frktons with XOR , 2600 bytes
1642    cycles for frktons with CMP , 2600 bytes
1966    cycles for frktons with CMP II , 2600 bytes


:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol

AMD has a different philsophy about optimization.

:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol

Me too. It can be faster, even on Intel :)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Rockoon

Added CMP III


AMD Phenom(tm) II X6 1055T Processor (SSE3)
3265    cycles for LingoCMP, 2600 bytes
1640    cycles for frktons with XOR , 2600 bytes
1640    cycles for frktons with CMP , 2600 bytes
1965    cycles for frktons with CMP II , 2600 bytes
1310    cycles for frktons with CMP III, 2600 bytes

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

theunknownguy

Quote from: jj2007 on June 18, 2010, 12:54:10 PM
Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
        mov edx, [ecx] ; Comparing two strings in edx and [eax]
        cmp edx, [eax]
        jne @f
;        add ecx, 4
;        add eax,4
        mov edx, [ecx+4]
        cmp edx, [eax+4]
        jne @f
        add ecx, 4+4
        add eax, 4+4
        dec ebx
        jnz @b 
@@:


Dec EBX? what happen with dec being slower than sub?.

It shouldnt been sub ebx, 1. At least for what i remember from agner fog...

Also using 1 reg could avoid 2 adds.

mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
xor edi, edi
@@:
       mov edx, [ecx+edi] ; Comparing two strings in edx and [eax]
       cmp edx, [eax+edi]
       jne @f
       mov edx, [ecx+edi+4]
       cmp edx, [eax+edi+4]
       jne @f
       add edi, 4+4
       sub ebx, 1
       jnz @b  
@@:


Testing the CMP REG32*4 SCALAR give worst results, but on XOR give betters... odd

Anyway this should run faster some body test it?




frktons

Quote from: Rockoon on June 18, 2010, 07:03:01 PM
Added CMP III


AMD Phenom(tm) II X6 1055T Processor (SSE3)
3265    cycles for LingoCMP, 2600 bytes
1640    cycles for frktons with XOR , 2600 bytes
1640    cycles for frktons with CMP , 2600 bytes
1965    cycles for frktons with CMP II , 2600 bytes
1310    cycles for frktons with CMP III, 2600 bytes



Where is the code? Does it work on Intel as well?

Let me know.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

frktons

Added method proposed by theunknownguy:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
3372    cycles for LingoCMP, 2600 bytes
1331    cycles for frktons with XOR , 2600 bytes
1326    cycles for frktons with CMP , 2600 bytes
1318    cycles for frktons with CMP II , 2600 bytes
1316    cycles for frktons with CMP III , 2600 bytes
1317    cycles for frktons with CMP unrolled , 2600 bytes
1316    cycles for theunknown with CMP , 2600 bytes

--- ok ---


No gain.  ::)
Mind is like a parachute. You know what to do in order to use it :-)

Rockoon

aligning the top of all the inner loops by 16, and introducing my own method:


AMD Phenom(tm) II X6 1055T Processor (SSE3)
2834    cycles for LingoCMP, 2600 bytes
1553    cycles for frktons with XOR , 2600 bytes
1542    cycles for frktons with CMP , 2600 bytes
1319    cycles for frktons with CMP II , 2600 bytes
1146    cycles for frktons with CMP III, 2600 bytes
989     cycles for RockoonCMP, 2600 bytes


and my method (please check for errors?)

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE

mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
        sub eax, ecx
align 16
@@:
        mov edx, [ecx]
        cmp edx, [ecx + eax]
        jne @f
        mov edx, [ecx + 4]
        cmp edx, [ecx + eax + 4]
        jne @f
        add ecx, 8
        dec ebx
        jnz @b 
@@:
counter_end

print str$(eax), 9, "cycles for RockoonCMP, 2600 bytes", 13, 10

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

dedndave

i doubt that ALIGN 16 makes much difference   :P
i find that, for smaller loops (where the branch at the end is SHORT), alignment can actually slow you down a bit
the added padding takes time to execute, too

if the branch at the end of the loop is NEAR, 16-alignment definately helps

Rockoon

Well, it does make a difference on AMD.

On Intels with a trace cache, then probably not. 1 byte instructions might be stored as 32-bits, for instance. I'm sure Intel made some good choices with it, so that all the uops in it fall on boundaries that make the most sense for the decoder.

AMD hasnt really changed its architecture in quite awhile. No trace caches or anything like that. Alignment still matters there. In a lot of ways, its like programming for the old Pentium III architecture. The same can be said for the Core2's tho, with the P4's netburst behind us I think that we are all happy.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

jj2007

Quote from: Rockoon on June 18, 2010, 07:23:57 PM
aligning the top of all the inner loops by 16, and introducing my own method:

...
        sub eax, ecx
align 16
@@:
        mov edx, [ecx]
        cmp edx, [ecx + eax]
        jne @f
        mov edx, [ecx + 4]
        cmp edx, [ecx + eax + 4]
        jne @f
        add ecx, 8
        dec ebx
        jnz @b 
@@:

Cute idea :U

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2911    cycles for SSE with null check, long string
3684    cycles for SSE with null check, long string, case-insensitive
5854    cycles for Lingo, long string
4878    cycles for Frank, long string
3803    cycles for Rockoon, long string
3811    cycles for RockoonJ, long string
3770    cycles for RockoonJ2, long string
13906   cycles for crt_strcmp, long string
16493   cycles for crt__stricmp, long string, case-insensitive
30072   cycles for movzx check null, long string
20088   cycles for repe cmpsb, long string
89965   cycles for lstrcmp, long string

18      cycles for SSE with null check, 10 bytes
24      cycles for SSE with null check, 10 bytes, case-insensitive
16      cycles for Lingo, 10 bytes
9       cycles for Frank, 10 bytes
48      cycles for Rockoon, 10 bytes
7       cycles for RockoonJ, 10 bytes
7       cycles for RockoonJ2, 10 bytes
32      cycles for crt_strcmp, 10 bytes
108     cycles for crt__stricmp, 10 bytes, case-insensitive
51      cycles for movzx check null, 10 bytes
88      cycles for repe cmpsb, 10 bytes
453     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
Frank:  45
Rockoon:        44
RockoonJ:       42
RockoonJ2:      33
SSE:    277     (MasmBasic)


It can compete with the SSE algo, but bear in mind there is no null check, and granularity is 8 bytes.
2911    cycles for SSE with null check, long string
3803    cycles for Rockoon, long string

frktons

Well on my pc I got these results:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
3299    cycles for LingoCMP, 2600 bytes
1331    cycles for frktons with XOR , 2600 bytes
1370    cycles for frktons with CMP , 2600 bytes
1317    cycles for frktons with CMP II , 2600 bytes
1317    cycles for frktons with CMP III , 2600 bytes
1314    cycles for frktons with CMP unrolled , 2600 bytes
1334    cycles for theunknown with CMP , 2600 bytes
1316    cycles for RockoonCMP, 2600 bytes

--- ok ---


Doesn't seem different from others method regarding performance.  ::)
Mind is like a parachute. You know what to do in order to use it :-)