News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SSE Integer Comparisons

Started by johnsa, April 19, 2010, 09:36:09 AM

Previous topic - Next topic

johnsa

Hey all,

So i'm working on a piece of code to compare to 32byte (aligned) arrays using SSE. I've done a lot of work with SSE for FP stuff but not too much on the integer side save for the usual bulk moves/writes etc.
So this is what I've got sofar.. i initially thought using ptest xmm,xmm would be better but my requirement is that it must run on SSE3.


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

push esi
push edi

mov esi,arr1Ptr
mov edi,arr2Ptr

movaps xmm0,[esi]
movaps xmm1,[esi+16]

movaps xmm2,[edi]
movaps xmm3,[edi+16]

psubd xmm0,xmm2
psubd xmm1,xmm3

pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero

xor eax,eax
mov edx,eax
sub edx,1

pmovmskb ebx,xmm0
cmp bx,0ffffh
je short @F

mov eax,-1
pop edi
pop esi
ret
@@:
pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx

pop edi
pop esi

ret

CompareArrays ENDP




I'm not happy with the mask operations and subsequent comparison/branch. Can anyone think of ways to improve this?

johnsa

Attempt 2:


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

push esi
push edi
push ebx
push edx

mov esi,arr1Ptr
mov edi,arr2Ptr

movaps xmm0,[esi]
movaps xmm1,[esi+16]

psubd xmm0,oword ptr [edi]
psubd xmm1,oword ptr [edi+16]

pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero

xor eax,eax
mov edx,eax
sub edx,1

pmovmskb ebx,xmm0
cmp bx,0ffffh
cmovne eax,edx

pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx

pop edx
pop ebx
pop edi
pop esi

ret

CompareArrays ENDP

jj2007

Why not a direct comparison of esi and edi?

@@: movapd xmm0, [esi]
pcmpeqb xmm0, [edi] ; compare [esi] and [edi]
lea edi, [edi+16] ; increase position
lea esi, [esi+16] ; in memory
pmovmskb edx, xmm0 ; set byte mask in edx
cmp edx, 65535
jne @F
dec ecx
jns @B
@@:

lingo

#3
 :wink
Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh      ;0FFFFFFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd  xmm0,[edx]
pcmpeqd  xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
        and      eax, edx
;shl eax, 16
;add eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef

johnsa

lingo, nice improvement!

my test of 100 million comparison went down from 1400ms to 1000ms with my second attempt, then my 3rd attempt got it to 800ms and yours runs in 400ms for me. Nice! :)


clive

Pretty sweet Lingo. BSWAP EAX; OR EAX,EDX would be a wash

If the bit ordering is unimportant, only the inequity

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh ; Note only looking at 16 bits, as high/low words and'ed, both bits must be high to propagate
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd  xmm0,[edx]
pcmpeqd  xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef


Or a tad faster

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm2,[edx]
movdqa xmm1,[eax+16]
movdqa xmm3,[edx+16]
pcmpeqd xmm0,xmm2
pmovmskb eax,xmm0
pcmpeqd xmm1,xmm3
pmovmskb edx,xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
It could be a random act of randomness. Those happen a lot as well.

jj2007

Testbed. 6 cycles for 15 instructions is not so bad...

Celeron M:
6       cycles for CmpArray1
8       cycles for CmpArray2
7       cycles for CmpArray3

clive

Not to appear argumentative, but I think we loose something with the integer result from the division of the average. The better/worse is going to be fractional cycles.

Empty
      1612 Ticks (min),  0.054 cycles

FOO (1) Lingo
     18165 Ticks (min), 18.165 cycles
     21177 Ticks (avg), 21.177 cycles
     21950 Ticks (rms), 21.951 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     19113 Ticks (avg), 19.113 cycles
     19564 Ticks (rms), 19.565 cycles

FOO (3)
     17656 Ticks (min), 17.656 cycles
     19291 Ticks (avg), 19.291 cycles
     19837 Ticks (rms), 19.837 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20207 Ticks (avg), 20.207 cycles
     20901 Ticks (rms), 20.902 cycles

FOO (5) Clive#2
     17648 Ticks (min), 17.648 cycles
     24062 Ticks (avg), 24.062 cycles
     22784 Ticks (rms), 22.785 cycles

FOO (6) JJ (lingo w/ret)
     17003 Ticks (min), 17.003 cycles
     17805 Ticks (avg), 17.805 cycles
     18128 Ticks (rms), 18.128 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     17790 Ticks (avg), 17.790 cycles
     18634 Ticks (rms), 18.634 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     39217 Ticks (avg), 39.217 cycles
     36707 Ticks (rms), 36.708 cycles
It could be a random act of randomness. Those happen a lot as well.

lingo

"Not to appear argumentative,"

In this forum every numbers published without files are obvious manipulation or just garbage....

clive

As you will. The numbers seem to track the real world tests johnsa did of yours/his. Your algorithm works really nicely (clean/efficient), using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted.

-Clive
It could be a random act of randomness. Those happen a lot as well.

lingo

johnsa,
Clive is right about:

shl    eax, 16
add    eax, edx

is slower than
and eax, edx
SO,I edited my algo and Usage too..

Clive,
"using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted."

It is just NOT TRUE ( see jj times) and he really can "improve" my algo with ecx and edx preservation only... :lol

clive

Atom N270 @ 1.6 GHz, the use of RET is quite profound vs register linking. Looks decidedly like a pipeline flush. It looks particularly bad because it loops so close to the return point.
http://www.anandtech.com/show/2493/11

67      cycles for CmpArray1
50      cycles for CmpArray2
67      cycles for CmpArray3

--- ok ---


CmpArray2 proc strA, strB ; Lingo but "normal" ret
mov eax, [esp+4]
mov edx, [esp+8]
movdqa xmm0, [eax]
movdqa xmm1, [eax+16]
pcmpeqd xmm0, [edx]
pcmpeqd xmm1, [edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8
CmpArray2 endp
It could be a random act of randomness. Those happen a lot as well.

johnsa

The various different versions here all seem to average roughly the same for me over 100,000,000 iterations of about 375-390ms
I think this is probably about as good as it can possibly get! Thanks for the hugely useful input on this guys, much appreciated.


dioxin

Is it not quicker to merge the results of the SSE compares before transferring them to CPU registers?

Instead of this:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8


Try something like this:
pand xmm0,xmm1
pmovmskb eax, xmm0  'eax now contains &hffff if the 32 bytes matched
ret 8


   Less instructions and avoids one of the slow pmovmskb instructions.

Paul.
   

clive

On a Prescott losing the PMOVMSKB EDX,XMM1; AND EAX,EDX appears to save a cycle. Loading all the values into XMM register first is faster on the Prescott, no net gain on Atom.

Dioxin code fragment being sampled.

   mov ebx,1000
@@: ; ALIGN 16
    push  edi
    push  esi
    call  callthis
    cmp   eax,0xFFFF

   dec ebx
   jnz @B

..

  ALIGN 16

callthis:
    mov eax, [esp+4]
    mov edx, [esp+8]
    movdqa xmm0,[eax]
    movdqa xmm2,[edx]
    movdqa xmm1,[eax+16]
    movdqa xmm3,[edx+16]
    pcmpeqd xmm0,xmm2
    pcmpeqd xmm1,xmm3
    pand xmm0,xmm1
    pmovmskb eax,xmm0
   ret 8


Prescott P4 3 GHz
FOO (1) Lingo
     18158 Ticks (min), 18.158 cycles
     19623 Ticks (avg), 19.623 cycles
     20173 Ticks (rms), 20.173 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     20262 Ticks (avg), 20.262 cycles
     20931 Ticks (rms), 20.932 cycles

FOO (3)
     17641 Ticks (min), 17.641 cycles
     20868 Ticks (avg), 20.868 cycles
     21710 Ticks (rms), 21.710 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20407 Ticks (avg), 20.407 cycles
     21118 Ticks (rms), 21.119 cycles

FOO (5) Clive#2
     17655 Ticks (min), 17.655 cycles
     19658 Ticks (avg), 19.658 cycles
     20321 Ticks (rms), 20.322 cycles

FOO (6) JJ (lingo w/ret)
     16988 Ticks (min), 16.988 cycles
     18490 Ticks (avg), 18.490 cycles
     19292 Ticks (rms), 19.293 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     16920 Ticks (avg), 16.920 cycles
     17497 Ticks (rms), 17.497 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     40207 Ticks (avg), 40.207 cycles
     41457 Ticks (rms), 41.458 cycles

FOO (9) dioxin
     14505 Ticks (min), 14.505 cycles
     15793 Ticks (avg), 15.793 cycles
     16409 Ticks (rms), 16.410 cycles


It could be a random act of randomness. Those happen a lot as well.