Hey guys, I gotta write this at some point but I didnt want to deny you the chance to go to town on it.... I couldnt find it on the search tool and it's a bit of an easy fun one :lol.... I suggest testing on 1280x1024 for timings, SSE perfect task (I think).... :bg.... Is there a default testbed, if not can we add a sticky one? I could rip out the guts of one of JJs previous ones maybe? (I would try and do it humanely :lol)
For speed what is important is the bit depth value. I prefer 24 or 32-bit.
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once. I'm not sure on their relative timings.
SSE would definately be faster if everything is aligned properly :P
Is SSE really that faster?
What is the average timing between mov with movq?
Looks like BSWAP has longer latency and same throughput (depending on the CPU), so PSHUFB should probably be about 4x faster overall in well-optimized code (i.e. throughput-bound). :bg
Using only integer instructions and running on a P3 I can't get below about 4 cycles per pixel.
;=====================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;=====================================================================
.data
buffer dd 1000 dup(0)
.code
;=====================================================================
start:
;=====================================================================
lea edi, buffer
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
mov ecx, 1000-1
@@:
UF=4
DISP=0
REPEAT UF
mov eax, [edi+ecx*4+DISP]
mov edx, eax
bswap eax
and edx, 0ff000000h
shr eax, 8
or eax, edx
mov [edi+ecx*4+DISP], eax
DISP=DISP+4
ENDM
sub ecx,UF
jns @B
counter_end
print ustr$(eax)," cycles",13,10
counter_begin 1000, HIGH_PRIORITY_CLASS
counter_end
print ustr$(eax)," cycles",13,10,13,10
inkey "Press any key to exit..."
exit
;=====================================================================
end start
It would be interesting to see how much faster a version based on PSHUFD would be.
:/ with that timing I'll have to create BGRA in a seperate process dependant on card features.... It's always the simple things that are most difficult :lol
1280x1024 RGBA-BGRA:
File Size 5Mb
9651293 cycles
0 cycles
Press any key to exit...
Ty for your input Michael
Hi,
Looking at the code posted by MichaelW, I decided to
run an excursion as it looked strange to me. Four variants
were run on my PIII. The results surprised me a bit. SHR
is apparently much faster than ROR, which was the first
surprise. The extra memory access variant also was not
what I expected. Oh well, so much for expectations.
Regards,
Steve N.
mov eax, [edi+ecx*4+DISP]
mov edx, eax
bswap eax
and edx, 0ff000000h
shr eax, 8
or eax, edx
mov [edi+ecx*4+DISP], eax
4071 cycles
0 cycles
Press any key to exit...
mov eax, [edi+ecx*4+DISP]
; mov edx, eax
bswap eax
; and edx, 0ff000000h
; shr eax, 8
ROR eax, 8
; or eax, edx
mov [edi+ecx*4+DISP], eax
4301 cycles
0 cycles
Press any key to exit...
mov eax, [edi+ecx*4+DISP]
; mov edx, eax
bswap eax
; and edx, 0ff000000h
shr eax, 8
; ROR eax, 8
; or eax, edx
mov [edi+ecx*4+DISP], eax
3279 cycles
0 cycles
Press any key to exit...
; mov eax, [edi+ecx*4+DISP]
MOV AL, [edi+ecx*4+DISP]
MOV DL, [edi+ecx*4+DISP+2]
; mov edx, eax
; bswap eax
; and edx, 0ff000000h
; shr eax, 8
; ROR eax, 8
; or eax, edx
; mov [edi+ecx*4+DISP], eax
mov [edi+ecx*4+DISP+2], AL
mov [edi+ecx*4+DISP], DL
2785 cycles
MOVZX EAX, BYTE PTR [edi+ecx*4+DISP]
MOVZX EDX, BYTE PTR [edi+ecx*4+DISP+2]
MOV [edi+ecx*4+DISP+2], AL
MOV [edi+ecx*4+DISP], DL
2852 cycles
0 cycles
Press any key to exit...
Quote from: Neo on May 25, 2010, 03:19:10 AM
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once. I'm not sure on their relative timings.
Probably a lot faster but you impose a serious limit on the hardware: pshufb is SS
SE3, sometimes called SSE4...
these are SSE2 instructions - maybe you can find something to do the job
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.
On old hardware I wonder if this approach is of any use.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL rgba :DWORD
mov rgba, 0099AAEEh
print uhex$(rgba),13,10
lea edx, rgba
movzx eax, BYTE PTR [edx]
movzx ecx, BYTE PTR [edx+2]
mov [edx], cl
mov [edx+2], al
print uhex$(rgba),13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
SSE version attached. I have a feeling that it could be improved a lot... ::)
3831 cycles for RgbSwap MichaelW
2389 cycles for RgbSwap Hutch
2417 cycles for RgbSwap Hutch2
7280 cycles for RgbSwapSSE
Any ideas how to get rid of two of the pshufs? It looks as if it should be possible...
Quote@@: dec ecx
movq xmm1, qword ptr [edx+8*ecx]
pxor xmm0, xmm0 ; xmm0 may contain garbage
punpcklbw xmm0, xmm1 ; expand 8 bytes to 8 xmm0 words
pshufd xmm0, xmm0, 00011011b ; inspired by drizz (http://www.asmcommunity.net/board/index.php?topic=29743.0)
pshuflw xmm0, xmm0, 10110001b
pshufhw xmm0, xmm0, 10110001b ; all words swapped
pshufd xmm0, xmm0, 01001110b ; switch low dwords
psrlq xmm0, 24 ; shift right by three bytes
pxor xmm2, xmm2
pand xmm1, TheAnd
packsswb xmm0, xmm2 ; pack words to bytes
paddb xmm0, xmm1
movq qword ptr [edx+8*ecx], xmm0 ; back to mem, 8 bytes
test ecx, ecx
jne @B
jj,
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Here my suggestion for an sse2 version:
align 16
RgbSwapSSE2 proc ; Src, n4Pixels
pop eax ; ret address
pop edx ; Src
pop ecx ; N*4 pixels
align 16
@@: dec ecx
movdqa xmm0,[edx]
movdqa xmm1,xmm0 ; movdqa xmm1,[edx]
psllw xmm0,8
psrlw xmm1,8
pshuflw xmm0,xmm0,10110001y
pshufhw xmm0,xmm0,10110001y
psrldq xmm0,1
pslldq xmm1,1
por xmm0,xmm1
movdqa [edx],xmm0
lea edx,[edx+16]
test ecx, ecx
jne @B
@@: jmp eax
RgbSwapSSE2 endp
core2duo:
3176 cycles for RgbSwap MichaelW
2238 cycles for RgbSwap Hutch
1470 cycles for RgbSwapSSE2 qWord
2217 cycles for RgbSwap Hutch2
4875 cycles for RgbSwapSSE
qWord
See if unrolling the algo gives it some more legs. This just read 2 swaps at a time twice. The idea is to use a compatible algo for older hardware and use SSE for later processors and just do a processor detect to see what can run what.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL rgba[4]:DWORD
push ebx
push esi
push edi
lea esi, rgba
mov [esi], DWORD PTR 0099AAEEh
mov [esi+4], DWORD PTR 007744DDh
mov [esi+8], DWORD PTR 003399FFh
mov [esi+12], DWORD PTR 007744DDh
print uhex$([esi])," "
print uhex$([esi+4])," "
print uhex$([esi+8])," "
print uhex$([esi+12]),13,10
; --------------------------------------
movzx eax, BYTE PTR [esi]
movzx ebx, BYTE PTR [esi+2]
movzx ecx, BYTE PTR [esi+4]
movzx edx, BYTE PTR [esi+6]
mov [esi], bl
mov [esi+2], al
mov [esi+4], dl
mov [esi+6], cl
; --------------------------------------
movzx eax, BYTE PTR [esi+8]
movzx ebx, BYTE PTR [esi+10]
movzx ecx, BYTE PTR [esi+12]
movzx edx, BYTE PTR [esi+14]
mov [esi+8], bl
mov [esi+10], al
mov [esi+12], dl
mov [esi+14], cl
; --------------------------------------
print uhex$([esi])," "
print uhex$([esi+4])," "
print uhex$([esi+8])," "
print uhex$([esi+12]),13,10
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
Quote from: qWord on May 26, 2010, 10:18:58 PM
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Not so relevant, it works because the bytes will never saturate the word; but it is utterly slow.
Your algo is cute. I took the liberty to comment it.
Quotealign 16 ; qWord (http://www.masm32.com/board/index.php?topic=14058.msg111586#msg111586)
RgbSwapSSE2 proc ; Src, n4Pixels
pop eax ; ret address
pop edx ; Src
pop ecx ; Bytes
shr ecx, 2 ; we process 16 bytes, 4 dwords per loop
align 16
@@:
movdqa xmm0, [edx]
movdqa xmm1, xmm0 ; movdqa xmm1,[edx]
; 31726762 31524742 61726762 41524742 aka 1rgb 1RGB argb ARGB
psllw xmm0, 8 ; shift left by one byte, draw in zeros
; 72006200 52004200 72006200 52004200 aka r_b_ R_B r_b_ R_B_
psrlw xmm1, 8 ; shift right by one byte, draw in zeros
; 00310067 00310047 00610067 00410047 aka _1_g _1_G _a_g _A_G
pshuflw xmm0, xmm0, 10110001y ; reverse order of lower half words
pshufhw xmm0, xmm0, 10110001y ; reverse order of upper half words
; 62007200 42005200 62007200 42005200 aka b_r_ B_R_ B_R_ B_R_
psrldq xmm0, 1 ; shift right again
; 00620072 00420052 00620072 00420052 aka _b_r _B_R _B_R _B_R
pslldq xmm1, 1 ; shift left again
; 31006700 31004700 61006700 41004700 aka 1_g_ 1_G_ a_g_ A_G_
por xmm0, xmm1
; 31626772 31424752 61626772 41424752 aka 1bgr 1BGR abgr ABGR
movdqa [edx], xmm0
lea edx, [edx+16]
dec ecx
jg @B ; jump if greater than zero
jmp eax ; ret address still in eax
RgbSwapSSE2 endp
Quote from: sinsi on May 27, 2010, 04:57:47 AM
How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
If you've already got threads set up for the app as a whole, there's not much downside to multithreading it unless it's pretty short (i.e. the down side would be the thread sync time and a mild nuisance to the caches). That said, multi-threading it is the easy part and can be done on top of single-thread optimizations.
btw, thanks guys for pointing out the obvious thing I completely missed, i.e. that it's not
just a BSWAP to reorder the bytes, especially if you're using the alpha. :U PSHUFB is probably much faster then, especially if you unroll it (though on an i7, unrolling might not matter as much). If I wasn't dead tired at the moment, I'd fire up my performance viewer and get the full scaling of it; it should only take a couple minutes, but maybe tomorrow night.
Quote from: Neo on May 27, 2010, 08:39:36 AM
PSHUFB is probably much faster
Can't test it with my legacy CPUs :(
Intel: (http://www.intel.com/technology/itj/2007/v11i4/1-inside/5-vectorizer.htm)
This conversion between a little-endian and big-endian representation of 32-bit data
elements (4 bytes) can be vectorized effectively as follows.
@@: movdqa xmm1, XMMWORD PTR [_b+eax] ; load 16-bytes from b
pshufb xmm1, xmm0 ; shuffle 16-bytes as defined in xmm0
movdqa XMMWORD PTR [_a+eax], xmm1 ; store 16-bytes into a
add eax, 16
cmp eax, ebx
jb @B
Here, register xmm0 is pre-loaded with the appropriate 4x4 reshuffling pattern.
I decided it was worth the loss of sleep to run it in my editor to get some neat plots. :bg
Time in clock cycles up to 1600 pixels (the unlabelled one is the RgbSwapSSE2):
(http://ndickson.files.wordpress.com/2010/05/rgbaperf1600.png)
Time for larger numbers of pixels (only showing PSHUFB, though for now it's still less than the rest):
(switch upward occurs ~32KB, the per-core L1 cache size on this CPU)
(http://ndickson.files.wordpress.com/2010/05/rgbaperf16k.png)
(http://ndickson.files.wordpress.com/2010/05/rgbaperf160k.png)
Behaviour > 1MB (per-core L2 cache size) is dominated by memory access time:
(http://ndickson.files.wordpress.com/2010/05/rgbaperf1m.png)
Edit: Fixed misquoted size and clarified cache sizes.
Quote from: qWord on May 26, 2010, 10:18:58 PM
Here my suggestion for an sse2 version:
Celeron M:
3831 cycles for RgbSwap MichaelW
2409 cycles for RgbSwap Hutch
3711 cycles for RgbSwapSSE2 qWord
2415 cycles for RgbSwap Hutch2
7305 cycles for RgbSwapSSE
Strange behaviour when compared to P4 results.
JJ,
here is your last test on the 2 quads I work on.
Core2 3.0 gig Quad
Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1
3099 cycles for RgbSwap MichaelW
2050 cycles for RgbSwap Hutch
1271 cycles for RgbSwapSSE2 qWord
2057 cycles for RgbSwap Hutch2
3549 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
i7 2.8 gig Quad
Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1
2592 cycles for RgbSwap MichaelW
1934 cycles for RgbSwap Hutch
1014 cycles for RgbSwapSSE2 qWord
1778 cycles for RgbSwap Hutch2
2190 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled (16 pixels per loop).
align 16
RgbSwapSSE2 proc ; Src, n16Pixels
pop eax ; ret address
pop edx ; Src
pop ecx ; N*16 pixels
_RGB SEGMENT page
rgb_simd_msk1 OWORD 000ff00ff00ff00ff00ff00ff00ff00ffh
rgb_simd_msk2 OWORD 0ff00ff00ff00ff00ff00ff00ff00ff00h
_RGB ENDS
align 16
@@: movdqa xmm0,[edx+0*16]
movdqa xmm1,xmm0
movdqa xmm2,[edx+1*16]
movdqa xmm3,xmm2
movdqa xmm4,[edx+2*16]
movdqa xmm5,xmm4
movdqa xmm6,[edx+3*16]
movdqa xmm7,xmm6
pand xmm0,rgb_simd_msk1
pand xmm1,rgb_simd_msk2
pshuflw xmm0,xmm0,10110001y
pshufhw xmm0,xmm0,10110001y
por xmm0,xmm1
pand xmm2,rgb_simd_msk1
pand xmm3,rgb_simd_msk2
pshuflw xmm2,xmm2,10110001y
pshufhw xmm2,xmm2,10110001y
por xmm2,xmm3
pand xmm4,rgb_simd_msk1
pand xmm5,rgb_simd_msk2
pshuflw xmm4,xmm4,10110001y
pshufhw xmm4,xmm4,10110001y
por xmm4,xmm5
pand xmm6,rgb_simd_msk1
pand xmm7,rgb_simd_msk2
pshuflw xmm6,xmm6,10110001y
pshufhw xmm6,xmm6,10110001y
por xmm6,xmm7
movdqa [edx+0*16],xmm0
movdqa [edx+1*16],xmm1
movdqa [edx+2*16],xmm2
movdqa [edx+3*16],xmm3
lea edx,[edx+4*16]
dec ecx
jne @B
@@: jmp eax
RgbSwapSSE2 endp
c2d:
3167 cycles for RgbSwap MichaelW
2211 cycles for RgbSwap Hutch
957 cycles for RgbSwapSSE2 qWord
2229 cycles for RgbSwap Hutch2
4898 cycles for RgbSwapSSE
0 cycles
Yes, that is a lot faster with the SSE.
Core2 3.0 gig Quad
Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1
3120 cycles for RgbSwap MichaelW
2032 cycles for RgbSwap Hutch
783 cycles for RgbSwapSSE2 qWord
2026 cycles for RgbSwap Hutch2
3518 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
3277 cycles for RgbSwap MichaelW
2253 cycles for RgbSwap Hutch
997 cycles for RgbSwapSSE2 qWord
1197 cycles for RgbSwapSSSE3
2232 cycles for RgbSwap Hutch2
4796 cycles for RgbSwapSSE
0 cycles
Quote from: qWord on May 28, 2010, 12:33:31 AM
interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
Try unrolling the loop 4x and using a single register for addressing. That's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
:winkTest for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
3034 cycles for RgbSwap MichaelW
2023 cycles for RgbSwap Hutch
1281 cycles for RgbSwapSSE2 qWord
525 cycles for RgbSwapLingo
2057 cycles for RgbSwap Hutch2
3692 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
and code
align 16
Msk dq 0704050603000102h
dq 0F0C0D0E0B08090Ah
db 5 Dup(0cch)
RgbSwapLingo proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
add eax, 16
pshufb xmm1, xmm0
add ecx, -1
movdqa [eax-16], xmm1
jne @b
jmp edx
RgbSwapLingo endp
Unrolled version:
RgbSwapLingoUn proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
movdqa xmm2, [eax+16]
pshufb xmm1, xmm0
add eax, 32
pshufb xmm2, xmm0
add ecx, -2
movdqa [eax-32], xmm1
movdqa [eax-16], xmm2
jne @b
jmp edx
RgbSwapLingoUn endp
Quote from: qWord on May 27, 2010, 11:49:41 PM
I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled
Celeron M:
3839 cycles for RgbSwap MichaelW
2388 cycles for RgbSwap Hutch
2082 cycles for RgbSwapSSE2 qWord
2413 cycles for RgbSwap Hutch2
7301 cycles for RgbSwapSSE
Prescott P4:
7305 cycles for RgbSwap MichaelW
6261 cycles for RgbSwap Hutch
1858 cycles for RgbSwapSSE2 qWord
8943 cycles for RgbSwap Hutch2
15262 cycles for RgbSwapSSE
:U
P.S.: Lingo's algo triggers a GPF, but this time it's my hardware's fault :bg
Lingo's test.
Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
3041 cycles for RgbSwap MichaelW
2053 cycles for RgbSwap Hutch
1268 cycles for RgbSwapSSE2 qWord
527 cycles for RgbSwapLingo
2046 cycles for RgbSwap Hutch2
3521 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
Quote from: Neo on May 28, 2010, 05:14:03 AMThat's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
What programm are you using?
I've modified the test bed a bit and add lingo's test:
12463 cycles for RgbSwap MichaelW
8852 cycles for RgbSwap Hutch
3414 cycles for RgbSwapSSE2 qWord
3144 cycles for RgbSwapLingo (SSSE3)
9036 cycles for RgbSwap Hutch2
18519 cycles for RgbSwapSSE
0 cycles
Quote from: hutch-- on May 26, 2010, 04:53:10 PM
On old hardware I wonder if this approach is of any use.
Hi,
Updated my data in Reply #8.
Cheers,
Steve N.
Possible variants. I feel handicapped because both my CPUs don't have pshufb...
RgbSwapLingo2 proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
lea ecx, [eax+4*ecx] ; create limit
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
add eax, 16
pshufb xmm1, xmm0
cmp eax, ecx
movdqa [eax-16], xmm1
jl @b
jmp edx
RgbSwapLingo2 endp
RgbSwapLingo3 proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
lea ecx, [eax+4*ecx] ; create limit
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax] ; unrolled once
pshufb xmm1, xmm0
movdqa [eax], xmm1
movdqa xmm1, [eax+16]
pshufb xmm1, xmm0
movdqa [eax+16], xmm1
add eax, 32
cmp eax, ecx
jl @b
jmp edx
RgbSwapLingo3 endp
"What programm are you using?
I've modified the test bed a bit and add lingo's test:"
I don't think unrolled variants are safety...anyway:
Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1
12421 cycles for RgbSwap MichaelW
8217 cycles for RgbSwap Hutch
3256 cycles for RgbSwapSSE2 qWord
2068 cycles for RgbSwapLingo (SSSE3)
1556 cycles for RgbSwapLingoUnrolled
8215 cycles for RgbSwap Hutch2
14371 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
i7 quad 2.8 gig
10131 cycles for RgbSwap MichaelW
6638 cycles for RgbSwap Hutch
2681 cycles for RgbSwapSSE2 qWord
2667 cycles for RgbSwapLingo (SSSE3)
1259 cycles for RgbSwapLingoUnrolled
6645 cycles for RgbSwap Hutch2
8405 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
Core2 quad 3.0 gig
12433 cycles for RgbSwap MichaelW
8237 cycles for RgbSwap Hutch
3260 cycles for RgbSwapSSE2 qWord
2068 cycles for RgbSwapLingo (SSSE3)
1556 cycles for RgbSwapLingoUnrolled
8224 cycles for RgbSwap Hutch2
14363 cycles for RgbSwapSSE
0 cycles
Press any key to exit...
:P
Coming to you from my 2002 SHITBOX, this is the fastest of the legacy algos. I tweaked Michaels bswap algo and got it about 25% faster but a bswap and ror are enough to make it slower than 2 memory reads and 2 memory writes per pixel.
RgbSwapH2 proc rsSrc, rsBytes
push ebx
push esi
push edi
mov esi, rsBytes
mov edi, rsSrc
shr esi, 2
@@:
mov al, BYTE PTR [edi]
mov bl, BYTE PTR [edi+2]
mov [edi], bl
mov [edi+2], al
mov cl, BYTE PTR [edi+4]
mov dl, BYTE PTR [edi+2+4]
mov [edi+4], dl
mov [edi+2+4], cl
mov al, BYTE PTR [edi+8]
mov bl, BYTE PTR [edi+2+8]
mov [edi+8], bl
mov [edi+2+8], al
mov cl, BYTE PTR [edi+12]
mov dl, BYTE PTR [edi+2+12]
mov [edi+12], dl
mov [edi+2+12], cl
add edi, 16
dec esi
jne @B
pop edi
pop esi
pop ebx
ret
RgbSwapH2 endp
Quote from: qWord on May 28, 2010, 01:19:59 PM
Quote from: Neo on May 28, 2010, 05:14:03 AMThat's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
What programm are you using?
I'm using Inventor IDE with the not-yet-released performance testing add-in, which I can't seem to get completely separated from the main app, so I keep delaying its release. Maybe I should just have it in the main app. It's really handy, but still has a few kinks to be worked out (e.g. the performance test settings don't get saved yet).