SSE Integer Comparisons

johnsa · April 19, 2010, 09:36:09 AM

Hey all,

So i'm working on a piece of code to compare to 32byte (aligned) arrays using SSE. I've done a lot of work with SSE for FP stuff but not too much on the integer side save for the usual bulk moves/writes etc.
So this is what I've got sofar.. i initially thought using ptest xmm,xmm would be better but my requirement is that it must run on SSE3.

Code Select


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

	push esi
	push edi
	
	mov esi,arr1Ptr
	mov edi,arr2Ptr
	
	movaps xmm0,[esi]
	movaps xmm1,[esi+16]
	
	movaps xmm2,[edi]
	movaps xmm3,[edi+16]
	
	psubd xmm0,xmm2
	psubd xmm1,xmm3
	
	pcmpeqd xmm0,oword ptr sse_zero
	pcmpeqd xmm1,oword ptr sse_zero

	xor eax,eax
	mov edx,eax
	sub edx,1
	
	pmovmskb ebx,xmm0
	cmp bx,0ffffh
	je short @F
	
	mov eax,-1
	pop edi
	pop esi
	ret
@@:
	pmovmskb ebx,xmm1
	cmp bx,0ffffh
	cmovne eax,edx	

	pop edi
	pop esi
		
	ret
	
CompareArrays ENDP

I'm not happy with the mask operations and subsequent comparison/branch. Can anyone think of ways to improve this?

johnsa · April 19, 2010, 09:43:02 AM

Attempt 2:

Code Select


align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD

	push esi
	push edi
	push ebx
	push edx
	
	mov esi,arr1Ptr
	mov edi,arr2Ptr
	
	movaps xmm0,[esi]
	movaps xmm1,[esi+16]
	
	psubd xmm0,oword ptr [edi]
	psubd xmm1,oword ptr [edi+16]
	
	pcmpeqd xmm0,oword ptr sse_zero
	pcmpeqd xmm1,oword ptr sse_zero

	xor eax,eax
	mov edx,eax
	sub edx,1
	
	pmovmskb ebx,xmm0
	cmp bx,0ffffh
	cmovne eax,edx
	
	pmovmskb ebx,xmm1
	cmp bx,0ffffh
	cmovne eax,edx	

	pop edx
	pop ebx
	pop edi
	pop esi
		
	ret
	
CompareArrays ENDP

jj2007 · April 19, 2010, 10:15:40 AM

Why not a direct comparison of esi and edi?

Code Select

@@:	movapd xmm0, [esi]
	pcmpeqb xmm0, [edi]	; compare [esi] and [edi]
	lea edi, [edi+16]	; increase position
	lea esi, [esi+16]	; in memory
	pmovmskb edx, xmm0	; set byte mask in edx
	cmp edx, 65535
	jne @F
	dec ecx
	jns @B
@@:

lingo · April 19, 2010, 12:16:29 PM

:wink

Code Select


Usage:
	invoke	Comp32Array, addr array1, addr array2
	cmp	eax, 0FFFFh      ;0FFFFFFFFh
	jne	NotEqual
;Equal:
	......
	.....
NotEqual:
	......
option	prologue:none
option	epilogue:none
align 16
Comp32Array 	 proc arr1Ptr:dword, arr2Ptr:dword
	pop 	 ecx
	pop	 eax
	pop	 edx
	movdqa	 xmm0,[eax]
	movdqa 	 xmm1,[eax+16]
	pcmpeqd  xmm0,[edx]	
	pcmpeqd  xmm1,[edx+16]
	pmovmskb eax, xmm0
	pmovmskb edx, xmm1
        and      eax, edx
	;shl	 eax, 16
	;add	 eax, edx
	jmp	 ecx
Comp32Array 	 endp	
option prologue:PrologueDef
option epilogue:EpilogueDef

johnsa · April 20, 2010, 09:27:51 AM

lingo, nice improvement!

my test of 100 million comparison went down from 1400ms to 1000ms with my second attempt, then my 3rd attempt got it to 800ms and yours runs in 400ms for me. Nice! :)

clive · April 20, 2010, 03:50:39 PM

Pretty sweet Lingo. BSWAP EAX; OR EAX,EDX would be a wash

If the bit ordering is unimportant, only the inequity

Code Select

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh ; Note only looking at 16 bits, as high/low words and'ed, both bits must be high to propagate
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd  xmm0,[edx]
pcmpeqd  xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef

Or a tad faster

Code Select

Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm2,[edx]
movdqa xmm1,[eax+16]
movdqa xmm3,[edx+16]
pcmpeqd xmm0,xmm2
pmovmskb eax,xmm0
pcmpeqd xmm1,xmm3
pmovmskb edx,xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef

jj2007 · April 20, 2010, 06:13:11 PM

Testbed. 6 cycles for 15 instructions is not so bad...

Celeron M:

Code Select

6       cycles for CmpArray1
8       cycles for CmpArray2
7       cycles for CmpArray3

clive · April 20, 2010, 07:36:10 PM

Not to appear argumentative, but I think we loose something with the integer result from the division of the average. The better/worse is going to be fractional cycles.

Code Select

Empty
      1612 Ticks (min),  0.054 cycles

FOO (1) Lingo
     18165 Ticks (min), 18.165 cycles
     21177 Ticks (avg), 21.177 cycles
     21950 Ticks (rms), 21.951 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     19113 Ticks (avg), 19.113 cycles
     19564 Ticks (rms), 19.565 cycles

FOO (3)
     17656 Ticks (min), 17.656 cycles
     19291 Ticks (avg), 19.291 cycles
     19837 Ticks (rms), 19.837 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20207 Ticks (avg), 20.207 cycles
     20901 Ticks (rms), 20.902 cycles

FOO (5) Clive#2
     17648 Ticks (min), 17.648 cycles
     24062 Ticks (avg), 24.062 cycles
     22784 Ticks (rms), 22.785 cycles

FOO (6) JJ (lingo w/ret)
     17003 Ticks (min), 17.003 cycles
     17805 Ticks (avg), 17.805 cycles
     18128 Ticks (rms), 18.128 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     17790 Ticks (avg), 17.790 cycles
     18634 Ticks (rms), 18.634 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     39217 Ticks (avg), 39.217 cycles
     36707 Ticks (rms), 36.708 cycles

lingo · April 20, 2010, 09:37:45 PM

"Not to appear argumentative,"

In this forum every numbers published without files are obvious manipulation or just garbage....

clive · April 20, 2010, 09:59:38 PM

As you will. The numbers seem to track the real world tests johnsa did of yours/his. Your algorithm works really nicely (clean/efficient), using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted.

-Clive

lingo · April 20, 2010, 11:36:26 PM

johnsa,
Clive is right about:

shl eax, 16
add eax, edx
is slower than
and eax, edx
SO,I edited my algo and Usage too..

Clive,
"using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted."

It is just NOT TRUE ( see jj times) and he really can "improve" my algo with ecx and edx preservation only... :lol

clive · April 21, 2010, 05:01:31 AM

Atom N270 @ 1.6 GHz, the use of RET is quite profound vs register linking. Looks decidedly like a pipeline flush. It looks particularly bad because it loops so close to the return point.
http://www.anandtech.com/show/2493/11

Code Select

67      cycles for CmpArray1
50      cycles for CmpArray2
67      cycles for CmpArray3

--- ok ---

Code Select

CmpArray2 proc strA, strB	; Lingo but "normal" ret
	mov eax, [esp+4]
	mov edx, [esp+8]
	movdqa xmm0, [eax]
	movdqa xmm1, [eax+16]
	pcmpeqd xmm0, [edx]
	pcmpeqd xmm1, [edx+16]
	pmovmskb eax, xmm0
	pmovmskb edx, xmm1
	shl eax, 16
	add eax, edx
	ret 8
CmpArray2 endp

johnsa · April 21, 2010, 09:57:47 AM

The various different versions here all seem to average roughly the same for me over 100,000,000 iterations of about 375-390ms
I think this is probably about as good as it can possibly get! Thanks for the hugely useful input on this guys, much appreciated.

dioxin · April 21, 2010, 05:13:42 PM

Is it not quicker to merge the results of the SSE compares before transferring them to CPU registers?

Instead of this:

Code Select

	pmovmskb eax, xmm0
	pmovmskb edx, xmm1
	shl eax, 16
	add eax, edx
	ret 8

Try something like this:

Code Select

	pand xmm0,xmm1
	pmovmskb eax, xmm0  'eax now contains &hffff if the 32 bytes matched
	ret 8

Less instructions and avoids one of the slow pmovmskb instructions.

Paul.

clive · April 21, 2010, 06:59:45 PM

On a Prescott losing the PMOVMSKB EDX,XMM1; AND EAX,EDX appears to save a cycle. Loading all the values into XMM register first is faster on the Prescott, no net gain on Atom.

Dioxin code fragment being sampled.

Code Select


   mov ebx,1000
@@: ; ALIGN 16
    push  edi
    push  esi
    call  callthis
    cmp   eax,0xFFFF

   dec ebx
   jnz @B

..

  ALIGN 16

callthis:
    mov eax, [esp+4]
    mov edx, [esp+8]
    movdqa xmm0,[eax]
    movdqa xmm2,[edx]
    movdqa xmm1,[eax+16]
    movdqa xmm3,[edx+16]
    pcmpeqd xmm0,xmm2
    pcmpeqd xmm1,xmm3
    pand xmm0,xmm1
    pmovmskb eax,xmm0
   ret 8

Prescott P4 3 GHz

Code Select

FOO (1) Lingo
     18158 Ticks (min), 18.158 cycles
     19623 Ticks (avg), 19.623 cycles
     20173 Ticks (rms), 20.173 cycles

FOO (2) Clive#1
     18158 Ticks (min), 18.158 cycles
     20262 Ticks (avg), 20.262 cycles
     20931 Ticks (rms), 20.932 cycles

FOO (3)
     17641 Ticks (min), 17.641 cycles
     20868 Ticks (avg), 20.868 cycles
     21710 Ticks (rms), 21.710 cycles

FOO (4)
     18158 Ticks (min), 18.158 cycles
     20407 Ticks (avg), 20.407 cycles
     21118 Ticks (rms), 21.119 cycles

FOO (5) Clive#2
     17655 Ticks (min), 17.655 cycles
     19658 Ticks (avg), 19.658 cycles
     20321 Ticks (rms), 20.322 cycles

FOO (6) JJ (lingo w/ret)
     16988 Ticks (min), 16.988 cycles
     18490 Ticks (avg), 18.490 cycles
     19292 Ticks (rms), 19.293 cycles

FOO (7) Clive#2 w/ret
     15503 Ticks (min), 15.503 cycles
     16920 Ticks (avg), 16.920 cycles
     17497 Ticks (rms), 17.497 cycles

FOO (8) johnsa#2
     36510 Ticks (min), 36.510 cycles
     40207 Ticks (avg), 40.207 cycles
     41457 Ticks (rms), 41.458 cycles

FOO (9) dioxin
     14505 Ticks (min), 14.505 cycles
     15793 Ticks (avg), 15.793 cycles
     16409 Ticks (rms), 16.410 cycles

News:

SSE Integer Comparisons