Hey all,
So i'm working on a piece of code to compare to 32byte (aligned) arrays using SSE. I've done a lot of work with SSE for FP stuff but not too much on the integer side save for the usual bulk moves/writes etc.
So this is what I've got sofar.. i initially thought using ptest xmm,xmm would be better but my requirement is that it must run on SSE3.
align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD
push esi
push edi
mov esi,arr1Ptr
mov edi,arr2Ptr
movaps xmm0,[esi]
movaps xmm1,[esi+16]
movaps xmm2,[edi]
movaps xmm3,[edi+16]
psubd xmm0,xmm2
psubd xmm1,xmm3
pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero
xor eax,eax
mov edx,eax
sub edx,1
pmovmskb ebx,xmm0
cmp bx,0ffffh
je short @F
mov eax,-1
pop edi
pop esi
ret
@@:
pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx
pop edi
pop esi
ret
CompareArrays ENDP
I'm not happy with the mask operations and subsequent comparison/branch. Can anyone think of ways to improve this?
Attempt 2:
align 16
CompareArrays PROC arr1Ptr:DWORD, arr2Ptr:DWORD
push esi
push edi
push ebx
push edx
mov esi,arr1Ptr
mov edi,arr2Ptr
movaps xmm0,[esi]
movaps xmm1,[esi+16]
psubd xmm0,oword ptr [edi]
psubd xmm1,oword ptr [edi+16]
pcmpeqd xmm0,oword ptr sse_zero
pcmpeqd xmm1,oword ptr sse_zero
xor eax,eax
mov edx,eax
sub edx,1
pmovmskb ebx,xmm0
cmp bx,0ffffh
cmovne eax,edx
pmovmskb ebx,xmm1
cmp bx,0ffffh
cmovne eax,edx
pop edx
pop ebx
pop edi
pop esi
ret
CompareArrays ENDP
Why not a direct comparison of esi and edi?
@@: movapd xmm0, [esi]
pcmpeqb xmm0, [edi] ; compare [esi] and [edi]
lea edi, [edi+16] ; increase position
lea esi, [esi+16] ; in memory
pmovmskb edx, xmm0 ; set byte mask in edx
cmp edx, 65535
jne @F
dec ecx
jns @B
@@:
:wink
Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh ;0FFFFFFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd xmm0,[edx]
pcmpeqd xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
and eax, edx
;shl eax, 16
;add eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
lingo, nice improvement!
my test of 100 million comparison went down from 1400ms to 1000ms with my second attempt, then my 3rd attempt got it to 800ms and yours runs in 400ms for me. Nice! :)
Pretty sweet Lingo. BSWAP EAX; OR EAX,EDX would be a wash
If the bit ordering is unimportant, only the inequity
Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh ; Note only looking at 16 bits, as high/low words and'ed, both bits must be high to propagate
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm1,[eax+16]
pcmpeqd xmm0,[edx]
pcmpeqd xmm1,[edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
Or a tad faster
Usage:
invoke Comp32Array, addr array1, addr array2
cmp eax, 0FFFFh
jne NotEqual
;Equal:
......
.....
NotEqual:
......
option prologue:none
option epilogue:none
align 16
Comp32Array proc arr1Ptr:dword, arr2Ptr:dword
pop ecx
pop eax
pop edx
movdqa xmm0,[eax]
movdqa xmm2,[edx]
movdqa xmm1,[eax+16]
movdqa xmm3,[edx+16]
pcmpeqd xmm0,xmm2
pmovmskb eax,xmm0
pcmpeqd xmm1,xmm3
pmovmskb edx,xmm1
and eax, edx
jmp ecx
Comp32Array endp
option prologue:PrologueDef
option epilogue:EpilogueDef
Testbed. 6 cycles for 15 instructions is not so bad...
Celeron M:
6 cycles for CmpArray1
8 cycles for CmpArray2
7 cycles for CmpArray3
Not to appear argumentative, but I think we loose something with the integer result from the division of the average. The better/worse is going to be fractional cycles.
Empty
1612 Ticks (min), 0.054 cycles
FOO (1) Lingo
18165 Ticks (min), 18.165 cycles
21177 Ticks (avg), 21.177 cycles
21950 Ticks (rms), 21.951 cycles
FOO (2) Clive#1
18158 Ticks (min), 18.158 cycles
19113 Ticks (avg), 19.113 cycles
19564 Ticks (rms), 19.565 cycles
FOO (3)
17656 Ticks (min), 17.656 cycles
19291 Ticks (avg), 19.291 cycles
19837 Ticks (rms), 19.837 cycles
FOO (4)
18158 Ticks (min), 18.158 cycles
20207 Ticks (avg), 20.207 cycles
20901 Ticks (rms), 20.902 cycles
FOO (5) Clive#2
17648 Ticks (min), 17.648 cycles
24062 Ticks (avg), 24.062 cycles
22784 Ticks (rms), 22.785 cycles
FOO (6) JJ (lingo w/ret)
17003 Ticks (min), 17.003 cycles
17805 Ticks (avg), 17.805 cycles
18128 Ticks (rms), 18.128 cycles
FOO (7) Clive#2 w/ret
15503 Ticks (min), 15.503 cycles
17790 Ticks (avg), 17.790 cycles
18634 Ticks (rms), 18.634 cycles
FOO (8) johnsa#2
36510 Ticks (min), 36.510 cycles
39217 Ticks (avg), 39.217 cycles
36707 Ticks (rms), 36.708 cycles
"Not to appear argumentative,"
In this forum every numbers published without files are obvious manipulation or just garbage....
As you will. The numbers seem to track the real world tests johnsa did of yours/his. Your algorithm works really nicely (clean/efficient), using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted.
-Clive
johnsa,
Clive is right about:
shl eax, 16
add eax, edx
is slower than
and eax, edx
SO,I edited my algo and Usage too..
Clive,
"using the RET method that JJ applied improves the observed performance, probably because the RET can be predicted."
It is just NOT TRUE ( see jj times) and he really can "improve" my algo with ecx and edx preservation only... :lol
Atom N270 @ 1.6 GHz, the use of RET is quite profound vs register linking. Looks decidedly like a pipeline flush. It looks particularly bad because it loops so close to the return point.
http://www.anandtech.com/show/2493/11
67 cycles for CmpArray1
50 cycles for CmpArray2
67 cycles for CmpArray3
--- ok ---
CmpArray2 proc strA, strB ; Lingo but "normal" ret
mov eax, [esp+4]
mov edx, [esp+8]
movdqa xmm0, [eax]
movdqa xmm1, [eax+16]
pcmpeqd xmm0, [edx]
pcmpeqd xmm1, [edx+16]
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8
CmpArray2 endp
The various different versions here all seem to average roughly the same for me over 100,000,000 iterations of about 375-390ms
I think this is probably about as good as it can possibly get! Thanks for the hugely useful input on this guys, much appreciated.
Is it not quicker to merge the results of the SSE compares before transferring them to CPU registers?
Instead of this:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
shl eax, 16
add eax, edx
ret 8
Try something like this:
pand xmm0,xmm1
pmovmskb eax, xmm0 'eax now contains &hffff if the 32 bytes matched
ret 8
Less instructions and avoids one of the slow pmovmskb instructions.
Paul.
On a Prescott losing the PMOVMSKB EDX,XMM1; AND EAX,EDX appears to save a cycle. Loading all the values into XMM register first is faster on the Prescott, no net gain on Atom.
Dioxin code fragment being sampled.
mov ebx,1000
@@: ; ALIGN 16
push edi
push esi
call callthis
cmp eax,0xFFFF
dec ebx
jnz @B
..
ALIGN 16
callthis:
mov eax, [esp+4]
mov edx, [esp+8]
movdqa xmm0,[eax]
movdqa xmm2,[edx]
movdqa xmm1,[eax+16]
movdqa xmm3,[edx+16]
pcmpeqd xmm0,xmm2
pcmpeqd xmm1,xmm3
pand xmm0,xmm1
pmovmskb eax,xmm0
ret 8
Prescott P4 3 GHz
FOO (1) Lingo
18158 Ticks (min), 18.158 cycles
19623 Ticks (avg), 19.623 cycles
20173 Ticks (rms), 20.173 cycles
FOO (2) Clive#1
18158 Ticks (min), 18.158 cycles
20262 Ticks (avg), 20.262 cycles
20931 Ticks (rms), 20.932 cycles
FOO (3)
17641 Ticks (min), 17.641 cycles
20868 Ticks (avg), 20.868 cycles
21710 Ticks (rms), 21.710 cycles
FOO (4)
18158 Ticks (min), 18.158 cycles
20407 Ticks (avg), 20.407 cycles
21118 Ticks (rms), 21.119 cycles
FOO (5) Clive#2
17655 Ticks (min), 17.655 cycles
19658 Ticks (avg), 19.658 cycles
20321 Ticks (rms), 20.322 cycles
FOO (6) JJ (lingo w/ret)
16988 Ticks (min), 16.988 cycles
18490 Ticks (avg), 18.490 cycles
19292 Ticks (rms), 19.293 cycles
FOO (7) Clive#2 w/ret
15503 Ticks (min), 15.503 cycles
16920 Ticks (avg), 16.920 cycles
17497 Ticks (rms), 17.497 cycles
FOO (8) johnsa#2
36510 Ticks (min), 36.510 cycles
40207 Ticks (avg), 40.207 cycles
41457 Ticks (rms), 41.458 cycles
FOO (9) dioxin
14505 Ticks (min), 14.505 cycles
15793 Ticks (avg), 15.793 cycles
16409 Ticks (rms), 16.410 cycles
My version, Celeron M, with source.
416 ticks for CmpArray1, Lingo -1
518 ticks for CmpArray2, Lingo+ret -1
418 ticks for CmpArray3, Clive 65535
404 ticks for CmpArray4, dioxin 65535
I find it faster to do this on a Phenom II:
!movdqu xmm0,[esi] 'get the first 32 bytes
!movdqu xmm1,[esi+16]
!pcmpeqd xmm0,[edi] 'compare with the second 32 bytes
!pcmpeqd xmm1,[edi+16]
!pand xmm0,xmm1
!pmovmskb eax, xmm0 'eax now contains &hffff if the 32 bytes matched
Not loading the data to register before comparing, instead just compare with the data in memory.
Since the code is now so short, wouldn't it make sense to inline it to remove the CALL/RET overhead?
Paul.
Quote from: dioxin on April 21, 2010, 11:53:17 PM
I find it faster to do this on a Phenom II:...
Paul.
Paul,
That's what I implemented in the attachment as CmpArray4. It is marginally faster than the others, even though I used movdqa.
Regards,
Jochen
This update: moving from proc to macro and and'ing the values in xmm regs first doubles the speed for me on my core2 duo 2.0ghz.
It's now doing 600,000,000 per second instead of 300,000,000. Huge boost.
On a Willamette P4, its all pretty much a wash.
Empty
1592 Ticks (min), 0.053 cycles
FOO (1) Lingo
13016 Ticks (min), 13.016 cycles
13607 Ticks (avg), 13.607 cycles
20153 Ticks (rms), 20.154 cycles
FOO (2) Clive#1
12520 Ticks (min), 12.520 cycles
12568 Ticks (avg), 12.568 cycles
12591 Ticks (rms), 12.592 cycles
FOO (3)
12520 Ticks (min), 12.520 cycles
12539 Ticks (avg), 12.539 cycles
12539 Ticks (rms), 12.539 cycles
FOO (4)
12520 Ticks (min), 12.520 cycles
12542 Ticks (avg), 12.542 cycles
12542 Ticks (rms), 12.543 cycles
FOO (5) Clive#2
12520 Ticks (min), 12.520 cycles
12960 Ticks (avg), 12.960 cycles
18549 Ticks (rms), 18.550 cycles
FOO (6) JJ (lingo w/ret)
12516 Ticks (min), 12.516 cycles
12537 Ticks (avg), 12.537 cycles
12537 Ticks (rms), 12.537 cycles
FOO (7) Clive#2 w/ret
12500 Ticks (min), 12.500 cycles
12532 Ticks (avg), 12.532 cycles
12561 Ticks (rms), 12.561 cycles
FOO (8) johnsa#2
25524 Ticks (min), 25.524 cycles
27684 Ticks (avg), 27.684 cycles
53286 Ticks (rms), 53.286 cycles
FOO (9) dioxin
12496 Ticks (min), 12.496 cycles
13210 Ticks (avg), 13.210 cycles
24524 Ticks (rms), 24.525 cycles