The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on April 19, 2010, 09:36:09 AM

Title: SSE Integer Comparisons
Post by: johnsa on April 19, 2010, 09:36:09 AM
Hey all,

So i'm working on a piece of code to compare to 32byte (aligned) arrays using SSE. I've done a lot of work with SSE for FP stuff but not too much on the integer side save for the usual bulk moves/writes etc.
So this is what I've got sofar.. i initially thought using ptest xmm,xmm would be better but my requirement is that it must run on SSE3.


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

push esi
push edi

mov esi,arr1Ptr
mov edi,arr2Ptr

movaps xmm0,[esi]
movaps xmm1,[esi+16]

movaps xmm2,[edi]
movaps xmm3,[edi+16]

psubd xmm0,xmm2
psubd xmm1,xmm3

pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero

xor eax,eax
mov edx,eax
sub edx,1

pmovmskb ebx,xmm0
cmp bx,0ffffh
je short @F

mov eax,-1
pop edi
pop esi
ret
@@:
pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx

pop edi
pop esi

ret

CompareArrays ENDP




I'm not happy with the mask operations and subsequent comparison/branch. Can anyone think of ways to improve this?
Title: Re: SSE Integer Comparisons
Post by: johnsa on April 19, 2010, 09:43:02 AM
Attempt 2:


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

push esi
push edi
push ebx
push edx

mov esi,arr1Ptr
mov edi,arr2Ptr

movaps xmm0,[esi]
movaps xmm1,[esi+16]

psubd xmm0,oword ptr [edi]
psubd xmm1,oword ptr [edi+16]

pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero

xor eax,eax
mov edx,eax
sub edx,1

pmovmskb ebx,xmm0
cmp bx,0ffffh
cmovne eax,edx

pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx

pop edx
pop ebx
pop edi
pop esi

ret

CompareArrays ENDP
Title: Re: SSE Integer Comparisons
Post by: jj2007 on April 19, 2010, 10:15:40 AM
Why not a direct comparison of esi and edi?

@@: movapd xmm0, [esi]
pcmpeqb xmm0, [edi] ; compare [esi] and [edi]
lea edi, [edi+16] ; increase position
lea esi, [esi+16] ; in memory
pmovmskb edx, xmm0 ; set byte mask in edx
cmp edx, 65535
jne @F
dec ecx
jns @B
@@:
Title: Re: SSE Integer Comparisons
Post by: lingo on April 19, 2010, 12:16:29 PM
 :wink
Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh      ;0FFFFFFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd  xmm0,[edx]
pcmpeqd  xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
        and      eax, edx
;shl eax, 16
;add eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
Title: Re: SSE Integer Comparisons
Post by: johnsa on April 20, 2010, 09:27:51 AM
lingo, nice improvement!

my test of 100 million comparison went down from 1400ms to 1000ms with my second attempt, then my 3rd attempt got it to 800ms and yours runs in 400ms for me. Nice! :)

Title: Re: SSE Integer Comparisons
Post by: clive on April 20, 2010, 03:50:39 PM
Pretty sweet Lingo. BSWAP EAX; OR EAX,EDX would be a wash

If the bit ordering is unimportant, only the inequity

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh ; Note only looking at 16 bits, as high/low words and'ed, both bits must be high to propagate
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd  xmm0,[edx]
pcmpeqd  xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef


Or a tad faster

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm2,[edx]
movdqa xmm1,[eax+16]
movdqa xmm3,[edx+16]
pcmpeqd xmm0,xmm2
pmovmskb eax,xmm0
pcmpeqd xmm1,xmm3
pmovmskb edx,xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
Title: Re: SSE Integer Comparisons
Post by: jj2007 on April 20, 2010, 06:13:11 PM
Testbed. 6 cycles for 15 instructions is not so bad...

Celeron M:
6       cycles for CmpArray1
8       cycles for CmpArray2
7       cycles for CmpArray3
Title: Re: SSE Integer Comparisons
Post by: clive on April 20, 2010, 07:36:10 PM
Not to appear argumentative, but I think we loose something with the integer result from the division of the average. The better/worse is going to be fractional cycles.

Empty
      1612 Ticks (min),  0.054 cycles

FOO (1) Lingo
     18165 Ticks (min), 18.165 cycles
     21177 Ticks (avg), 21.177 cycles
     21950 Ticks (rms), 21.951 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     19113 Ticks (avg), 19.113 cycles
     19564 Ticks (rms), 19.565 cycles

FOO (3)
     17656 Ticks (min), 17.656 cycles
     19291 Ticks (avg), 19.291 cycles
     19837 Ticks (rms), 19.837 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20207 Ticks (avg), 20.207 cycles
     20901 Ticks (rms), 20.902 cycles

FOO (5) Clive#2
     17648 Ticks (min), 17.648 cycles
     24062 Ticks (avg), 24.062 cycles
     22784 Ticks (rms), 22.785 cycles

FOO (6) JJ (lingo w/ret)
     17003 Ticks (min), 17.003 cycles
     17805 Ticks (avg), 17.805 cycles
     18128 Ticks (rms), 18.128 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     17790 Ticks (avg), 17.790 cycles
     18634 Ticks (rms), 18.634 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     39217 Ticks (avg), 39.217 cycles
     36707 Ticks (rms), 36.708 cycles
Title: Re: SSE Integer Comparisons
Post by: lingo on April 20, 2010, 09:37:45 PM
"Not to appear argumentative,"

In this forum every numbers published without files are obvious manipulation or just garbage....
Title: Re: SSE Integer Comparisons
Post by: clive on April 20, 2010, 09:59:38 PM
As you will. The numbers seem to track the real world tests johnsa did of yours/his. Your algorithm works really nicely (clean/efficient), using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted.

-Clive
Title: Re: SSE Integer Comparisons
Post by: lingo on April 20, 2010, 11:36:26 PM
johnsa,
Clive is right about:

shl    eax, 16
add    eax, edx

is slower than
and eax, edx
SO,I edited my algo and Usage too..

Clive,
"using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted."

It is just NOT TRUE ( see jj times) and he really can "improve" my algo with ecx and edx preservation only... :lol
Title: Re: SSE Integer Comparisons
Post by: clive on April 21, 2010, 05:01:31 AM
Atom N270 @ 1.6 GHz, the use of RET is quite profound vs register linking. Looks decidedly like a pipeline flush. It looks particularly bad because it loops so close to the return point.
http://www.anandtech.com/show/2493/11

67      cycles for CmpArray1
50      cycles for CmpArray2
67      cycles for CmpArray3

--- ok ---


CmpArray2 proc strA, strB ; Lingo but "normal" ret
mov eax, [esp+4]
mov edx, [esp+8]
movdqa xmm0, [eax]
movdqa xmm1, [eax+16]
pcmpeqd xmm0, [edx]
pcmpeqd xmm1, [edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8
CmpArray2 endp
Title: Re: SSE Integer Comparisons
Post by: johnsa on April 21, 2010, 09:57:47 AM
The various different versions here all seem to average roughly the same for me over 100,000,000 iterations of about 375-390ms
I think this is probably about as good as it can possibly get! Thanks for the hugely useful input on this guys, much appreciated.

Title: Re: SSE Integer Comparisons
Post by: dioxin on April 21, 2010, 05:13:42 PM
Is it not quicker to merge the results of the SSE compares before transferring them to CPU registers?

Instead of this:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8


Try something like this:
pand xmm0,xmm1
pmovmskb eax, xmm0  'eax now contains &hffff if the 32 bytes matched
ret 8


   Less instructions and avoids one of the slow pmovmskb instructions.

Paul.
   
Title: Re: SSE Integer Comparisons
Post by: clive on April 21, 2010, 06:59:45 PM
On a Prescott losing the PMOVMSKB EDX,XMM1; AND EAX,EDX appears to save a cycle. Loading all the values into XMM register first is faster on the Prescott, no net gain on Atom.

Dioxin code fragment being sampled.

   mov ebx,1000
@@: ; ALIGN 16
    push  edi
    push  esi
    call  callthis
    cmp   eax,0xFFFF

   dec ebx
   jnz @B

..

  ALIGN 16

callthis:
    mov eax, [esp+4]
    mov edx, [esp+8]
    movdqa xmm0,[eax]
    movdqa xmm2,[edx]
    movdqa xmm1,[eax+16]
    movdqa xmm3,[edx+16]
    pcmpeqd xmm0,xmm2
    pcmpeqd xmm1,xmm3
    pand xmm0,xmm1
    pmovmskb eax,xmm0
   ret 8


Prescott P4 3 GHz
FOO (1) Lingo
     18158 Ticks (min), 18.158 cycles
     19623 Ticks (avg), 19.623 cycles
     20173 Ticks (rms), 20.173 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     20262 Ticks (avg), 20.262 cycles
     20931 Ticks (rms), 20.932 cycles

FOO (3)
     17641 Ticks (min), 17.641 cycles
     20868 Ticks (avg), 20.868 cycles
     21710 Ticks (rms), 21.710 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20407 Ticks (avg), 20.407 cycles
     21118 Ticks (rms), 21.119 cycles

FOO (5) Clive#2
     17655 Ticks (min), 17.655 cycles
     19658 Ticks (avg), 19.658 cycles
     20321 Ticks (rms), 20.322 cycles

FOO (6) JJ (lingo w/ret)
     16988 Ticks (min), 16.988 cycles
     18490 Ticks (avg), 18.490 cycles
     19292 Ticks (rms), 19.293 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     16920 Ticks (avg), 16.920 cycles
     17497 Ticks (rms), 17.497 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     40207 Ticks (avg), 40.207 cycles
     41457 Ticks (rms), 41.458 cycles

FOO (9) dioxin
     14505 Ticks (min), 14.505 cycles
     15793 Ticks (avg), 15.793 cycles
     16409 Ticks (rms), 16.410 cycles


Title: Re: SSE Integer Comparisons
Post by: jj2007 on April 21, 2010, 07:01:38 PM
My version, Celeron M, with source.

416     ticks for CmpArray1, Lingo      -1
518     ticks for CmpArray2, Lingo+ret  -1
418     ticks for CmpArray3, Clive      65535
404     ticks for CmpArray4, dioxin     65535
Title: Re: SSE Integer Comparisons
Post by: dioxin on April 21, 2010, 11:53:17 PM
I find it faster to do this on a Phenom II:
!movdqu xmm0,[esi]     'get the first 32 bytes
!movdqu xmm1,[esi+16]

!pcmpeqd xmm0,[edi]    'compare with the second 32 bytes
!pcmpeqd xmm1,[edi+16]
         
!pand xmm0,xmm1
                 
!pmovmskb eax, xmm0  'eax now contains &hffff if the 32 bytes matched

Not loading the data to register before comparing, instead just compare with the data in memory.

Since the code is now so short, wouldn't it make sense to inline it to remove the CALL/RET overhead?

Paul.
Title: Re: SSE Integer Comparisons
Post by: jj2007 on April 22, 2010, 06:27:41 AM
Quote from: dioxin on April 21, 2010, 11:53:17 PM
I find it faster to do this on a Phenom II:...
Paul.


Paul,

That's what I implemented in the attachment as CmpArray4. It is marginally faster than the others, even though I used movdqa.

Regards,
Jochen
Title: Re: SSE Integer Comparisons
Post by: johnsa on April 22, 2010, 09:11:59 AM
This update: moving from proc to macro and and'ing the values in xmm regs first doubles the speed for me on my core2 duo 2.0ghz.

It's now doing 600,000,000 per second instead of 300,000,000. Huge boost.
Title: Re: SSE Integer Comparisons
Post by: clive on April 25, 2010, 08:13:10 PM
On a Willamette P4, its all pretty much a wash.

Empty
      1592 Ticks (min),  0.053 cycles

FOO (1) Lingo
     13016 Ticks (min), 13.016 cycles
     13607 Ticks (avg), 13.607 cycles
     20153 Ticks (rms), 20.154 cycles

FOO (2) Clive#1
     12520 Ticks (min), 12.520 cycles
     12568 Ticks (avg), 12.568 cycles
     12591 Ticks (rms), 12.592 cycles

FOO (3)
     12520 Ticks (min), 12.520 cycles
     12539 Ticks (avg), 12.539 cycles
     12539 Ticks (rms), 12.539 cycles

FOO (4)
     12520 Ticks (min), 12.520 cycles
     12542 Ticks (avg), 12.542 cycles
     12542 Ticks (rms), 12.543 cycles

FOO (5) Clive#2
     12520 Ticks (min), 12.520 cycles
     12960 Ticks (avg), 12.960 cycles
     18549 Ticks (rms), 18.550 cycles

FOO (6) JJ (lingo w/ret)
     12516 Ticks (min), 12.516 cycles
     12537 Ticks (avg), 12.537 cycles
     12537 Ticks (rms), 12.537 cycles

FOO (7) Clive#2 w/ret
     12500 Ticks (min), 12.500 cycles
     12532 Ticks (avg), 12.532 cycles
     12561 Ticks (rms), 12.561 cycles

FOO (8) johnsa#2
     25524 Ticks (min), 25.524 cycles
     27684 Ticks (avg), 27.684 cycles
     53286 Ticks (rms), 53.286 cycles

FOO (9) dioxin
     12496 Ticks (min), 12.496 cycles
     13210 Ticks (avg), 13.210 cycles
     24524 Ticks (rms), 24.525 cycles