News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

frktons

Quote from: lingo on June 17, 2010, 10:53:23 PM
Take a look that cmp is faster than xor instruction... :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
3274    cycles for LingoCMP, 2600 bytes
1318    cycles for NewfrktonsWithoutXOR  , 2600 bytes
1334    cycles for frktonsCMP , 2600 bytes

--- ok ---


Not really that fast 1318 vs 1334. You have surely spared some
cycles skipping a passage, maybe one MOV or ADD less somewhere.  ::)

I think that an experienced ASM programmer like you can cut down
the cycles by 30-40% wisely using  xmm instructions and registers I'll
be aware of in my 4th or 5th semester.  :eek

Mind is like a parachute. You know what to do in order to use it :-)

theunknownguy

Quote from: lingo on June 17, 2010, 10:53:23 PM
Take a look that cmp is faster than xor instruction... :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
3274    cycles for LingoCMP, 2600 bytes
1318    cycles for NewfrktonsWithoutXOR  , 2600 bytes
1334    cycles for frktonsCMP , 2600 bytes

--- ok ---


1330 withouth xor
1331 CMP


Indeed is CMP more faster but it was needed to do such a test for this? (Not to be rude) But i think it was clear that CMP was faster. lol...

Funny second test and:


1330 withouth xor
1330 CMP

frktons

Well I tried to use CMP instead of XOR and there is
a very slight difference in performance:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
3738    cycles for LingoCMP, 2600 bytes
1319    cycles for NewfrktonsWithout XOR  , 2600 bytes
1329    cycles for frktons XOR , 2600 bytes
1325    cycles for frktons with CMP , 2600 bytes

--- ok ---


So the extra 5-6 cycles that Lingo spared depends on something else.
The new routine I tried is:

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE

mov ecx, offset Src1
mov eax, offset Src2
         
        mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
        mov    esi, [eax]
        cmp    edx, esi
        jne    @f
add     ecx, 4
        add    eax,4
        mov    edx, [ecx]
        mov    esi, [eax]
        cmp    edx, esi
        jne    @f
        add    ecx, 4
        add    eax, 4
        dec    ebx
        jnz    @b 
   


@@:
counter_end
print str$(eax), 9, "cycles for frktons with CMP , 2600 bytes", 13, 10


Quote from: theunknownguy on June 18, 2010, 12:12:43 AM

Indeed is CMP more faster but it was needed to do such a test for this? (Not to be rude) But i think it was clear that CMP was faster. lol...


Probably it is clear for you, but the same I can't say about me  :P
Lingo was so kind to post an example to clear this point and I am
grateful he did.  :U
Mind is like a parachute. You know what to do in order to use it :-)

lingo

"So why didn't you use that in your code?"
and
"...you can cut down the cycles by 30-40% wisely using  xmm instructions and registers

Yes, I have similar algo named SrchBytes with times here.. :wink

frktons

Quote from: lingo on June 18, 2010, 12:29:50 AM
"So why didn't you use that in your code?"
and
"...you can cut down the cycles by 30-40% wisely using  xmm instructions and registers

Yes, I have similar algo named SrchBytes with times here.. :wink


My, I just need to be 30 years younger and a lot of free time and I can
have a nice time between bytes and mnemonics  :lol

Great, that is really good to know that we have people still using
their brain today.  :P
Mind is like a parachute. You know what to do in order to use it :-)

theunknownguy

Quote

Probably it is clear for you, but the same I can't say about me  :P
Lingo was so kind to post an example to clear this point and I am
grateful he did.  :U

Quote

I think you can get it from logic itself:

"CMP subtracts the second operand from the first but, unlike the SUB instruction, does not store the result; only the flags are changed"

XOR computes the exclusive OR of the two operands. Each bit of the result is 1 if the corresponding bits of the operands are different; each bit is 0 if the corresponding bits are the same. The answer replaces the first operand.

But the example is quiet good. Still i am waiting for some PowerPC for run test like this...  :dazzled:

Ghandi

Quote
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?


8 bytes per iteration

So i saw after re-reading the source and noticing that there are 2x INC EBX per iteration, making it 8 bytes per iteration. My bad, o0

HR,
Ghandi

frktons

Quote from: theunknownguy on June 18, 2010, 12:58:27 AM

I think you can get it from logic itself:


"CMP subtracts the second operand from the first but,
unlike the SUB instruction, does not store the result;
only the flags are changed"



XOR computes the exclusive OR of the two operands. Each bit of the result is 1
if the corresponding bits of the operands are different; each bit is 0 if the
corresponding bits are the same. The answer replaces the first operand.



:U well, I am in the path of learning Assembly, so these definitions
are studying matter. Where did you get them? I'll surely need them in many
future occasions.
Mind is like a parachute. You know what to do in order to use it :-)

frktons

Quote from: Ghandi on June 18, 2010, 01:17:06 AM
Quote
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?


8 bytes per iteration

So i saw after re-reading the source and noticing that there are 2x INC EBX per iteration, making it 8 bytes per iteration. My bad, o0

HR,
Ghandi

Yes, Clive already said that, so I didn't feel to add anything else.  :U
Mind is like a parachute. You know what to do in order to use it :-)

theunknownguy

Quote from: frktons on June 18, 2010, 01:31:06 AM
Quote from: theunknownguy on June 18, 2010, 12:58:27 AM

I think you can get it from logic itself:


"CMP subtracts the second operand from the first but,
unlike the SUB instruction, does not store the result;
only the flags are changed"



XOR computes the exclusive OR of the two operands. Each bit of the result is 1
if the corresponding bits of the operands are different; each bit is 0 if the
corresponding bits are the same. The answer replaces the first operand.



:U well, I am in the path of learning Assembly, so these definitions
are studying matter. Where did you get them? I'll surely need them in many
future occasions.

http://pdos.csail.mit.edu/6.828/2004/readings/i386/

Or IBM manual

jj2007

Quote from: frktons on June 18, 2010, 12:04:27 AM

Not really that fast 1318 vs 1334. You have surely spared some
cycles skipping a passage, maybe one MOV or ADD less somewhere.  ::)


If you are not sure, why don't you simple open his attachment and look at your code? You would have found out that butcher lingo has done an impressive job - see below, original copy & paste....
By the way: I had tested that question by simply replacing xor with cmp, and found out there is absolutely no difference in timings.
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE

mov ecx, offset Src1
        mov ebx, 325
lea eax,  Src2
       
@@:
  mov edx, [ecx] ; Comparing two strings in ecx and eax
  add ecx, 8
                                      cmp          edx, [eax]
                                      mov          edx, [ecx-4]
                                      lea            eax, [eax+8]
                                      jnz            @f
  cmp          edx, [eax-4]
                                      jnz            @f
                                      sub           ebx, 1
                                      jnz            @b 
@@:
counter_end
print str$(eax), 9, "cycles for NewfrktonsWithoutXOR  , 2600 bytes", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE

mov ecx, offset Src1
mov eax, offset Src2
         
        mov ebx, 325
@@:
  mov edx, [ecx] ; Comparing two strings in edx and esi
                                      mov          esi, [eax]
                                      xor            edx, esi
                                      jnz            @f
add ecx, 4
                                      add            eax,4
                                      mov            edx, [ecx]
                                      mov            esi, [eax]
                                      xor            edx, esi
                                      jnz            @f
                                      add            ecx, 4
                                      add            eax, 4
                                      dec            ebx
                                      jnz            @b 
   


@@:
counter_end
print str$(eax), 9, "cycles for frktonsCMP , 2600 bytes", 13, 10

frktons

Yes, Jochen, of course I looked inside the code Lingo posted, and
I noticed That he changed many things to get those 5-6 cycles, but
I am not able, so far, to understand which of those instructions got
those cycles.   ::)

By the way it's good to see people getting better results with a little
bit crunching, I hoped this could be done in our complex and sometime
difficult daily life.

Enjoy

Mind is like a parachute. You know what to do in order to use it :-)

frktons

Mind is like a parachute. You know what to do in order to use it :-)

theunknownguy

An straight cmp comparation runned on the same test of 2600 bytes:

1338 clocks

mov ecx, offset Src1
mov eax, offset Src2
xor ebx, ebx


mov  edx, [ecx+ebx]
cmp [eax+ebx], edx
jnz @2
add ebx, 4
jnz @1

1365 ~ 1379?:

mov  edx, [ecx+ebx*4]
cmp [eax+ebx*4], edx
jnz @2
add ebx, 1
jnz @1

Against:

1370 clocks

mov  edx, [ecx+ebx]
xor edx, [eax+ebx]
jnz @2
add ebx, 4
jnz @1

1378 clocks

mov  edx, [ecx+ebx]
mov  edi, [eax+ebx]
xor edx, edi
jnz @2
add ebx, 4
jnz @1
1369 clocks
mov  edx, [ecx+ebx*4]
mov  edi, [eax+ebx*4]
xor edx, edi
jnz @2
add ebx, 1
jnz @1
1365 clocks

mov  edx, [ecx+ebx*4]
xor edx, [eax+ebx*4]
jnz @2
add ebx, 1
jnz @1


Surprise me the CMP SCALAR... meaby something wrong with the code i dont know... 5:14 AM here my eyes could be cheating on me.

frktons

A slighter faster version of string_compare:

I skipped a couple of MOV and I got a slight improvement in the overall
performance of the routine.


Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
3369    cycles for LingoCMP, 2600 bytes
1330    cycles for frktons with XOR , 2600 bytes
1325    cycles for frktons with CMP , 2600 bytes
1316    cycles for frktons with CMP II , 2600 bytes

--- ok ---


Source and EXE attached.

Enjoy

Frank
Mind is like a parachute. You know what to do in order to use it :-)