News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

RGBA to BGRA (and back again)

Started by oex, May 24, 2010, 01:31:00 AM

Previous topic - Next topic

sinsi

How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: qWord on May 26, 2010, 10:18:58 PM
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Not so relevant, it works because the bytes will never saturate the word; but it is utterly slow.
Your algo is cute. I took the liberty to comment it.

Quotealign 16   ; qWord
RgbSwapSSE2 proc   ; Src, n4Pixels
   pop eax      ; ret address
   
pop edx      ; Src
   pop ecx      ; Bytes
   shr ecx, 2   ; we process 16 bytes, 4 dwords per loop
   align 16
@@:
   movdqa xmm0, [edx]
   movdqa xmm1, xmm0   ; movdqa xmm1,[edx]
   ; 31726762 31524742 61726762 41524742 aka 1rgb 1RGB argb ARGB
   psllw xmm0, 8   ; shift left by one byte, draw in zeros
   ; 72006200 52004200 72006200 52004200 aka r_b_ R_B r_b_ R_B_
   psrlw xmm1, 8   ; shift right by one byte, draw in zeros
   ; 00310067 00310047 00610067 00410047 aka _1_g _1_G _a_g _A_G
   pshuflw xmm0, xmm0, 10110001y   ; reverse order of lower half words
   pshufhw xmm0, xmm0, 10110001y   ; reverse order of upper half words
   ; 62007200 42005200 62007200 42005200 aka b_r_ B_R_ B_R_ B_R_

   psrldq xmm0, 1   ; shift right again
   ; 00620072 00420052 00620072 00420052 aka _b_r _B_R _B_R _B_R
   pslldq xmm1, 1   ; shift left again
   ; 31006700 31004700 61006700 41004700 aka 1_g_ 1_G_ a_g_ A_G_
   por xmm0, xmm1
   ; 31626772 31424752 61626772 41424752 aka 1bgr 1BGR abgr ABGR
   movdqa [edx], xmm0
   lea edx, [edx+16]
   dec ecx
   jg @B   ; jump if greater than zero
   jmp eax   ; ret address still in eax
RgbSwapSSE2 endp

Neo

Quote from: sinsi on May 27, 2010, 04:57:47 AM
How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
If you've already got threads set up for the app as a whole, there's not much downside to multithreading it unless it's pretty short (i.e. the down side would be the thread sync time and a mild nuisance to the caches).  That said, multi-threading it is the easy part and can be done on top of single-thread optimizations.

btw, thanks guys for pointing out the obvious thing I completely missed, i.e. that it's not just a BSWAP to reorder the bytes, especially if you're using the alpha.  :U  PSHUFB is probably much faster then, especially if you unroll it (though on an i7, unrolling might not matter as much).  If I wasn't dead tired at the moment, I'd fire up my performance viewer and get the full scaling of it; it should only take a couple minutes, but maybe tomorrow night.

jj2007

Quote from: Neo on May 27, 2010, 08:39:36 AM
PSHUFB is probably much faster

Can't test it with my legacy CPUs :(

Intel:
This conversion between a little-endian and big-endian representation of 32-bit data
elements (4 bytes) can be vectorized effectively as follows.

@@: movdqa xmm1, XMMWORD PTR [_b+eax] ; load 16-bytes from b
pshufb xmm1, xmm0 ; shuffle 16-bytes as defined in xmm0
movdqa XMMWORD PTR [_a+eax], xmm1 ; store 16-bytes into a
add eax, 16
cmp eax, ebx
jb @B

Here, register xmm0 is pre-loaded with the appropriate 4x4 reshuffling pattern.

Neo

I decided it was worth the loss of sleep to run it in my editor to get some neat plots.  :bg

Time in clock cycles up to 1600 pixels (the unlabelled one is the RgbSwapSSE2):


Time for larger numbers of pixels (only showing PSHUFB, though for now it's still less than the rest):
(switch upward occurs ~32KB, the per-core L1 cache size on this CPU)



Behaviour > 1MB (per-core L2 cache size) is dominated by memory access time:


Edit: Fixed misquoted size and clarified cache sizes.

jj2007

Quote from: qWord on May 26, 2010, 10:18:58 PM
Here my suggestion for an sse2 version:

Celeron M:
3831 cycles for RgbSwap MichaelW
2409 cycles for RgbSwap Hutch
3711 cycles for RgbSwapSSE2 qWord
2415 cycles for RgbSwap Hutch2
7305 cycles for RgbSwapSSE


Strange behaviour when compared to P4 results.

hutch--

JJ,

here is your last test on the 2 quads I work on.


Core2 3.0 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

3099 cycles for RgbSwap MichaelW
2050 cycles for RgbSwap Hutch
1271 cycles for RgbSwapSSE2 qWord
2057 cycles for RgbSwap Hutch2
3549 cycles for RgbSwapSSE
0 cycles

Press any key to exit...

i7 2.8 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

2592 cycles for RgbSwap MichaelW
1934 cycles for RgbSwap Hutch
1014 cycles for RgbSwapSSE2 qWord
1778 cycles for RgbSwap Hutch2
2190 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

qWord

I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled (16 pixels per loop).

align 16
RgbSwapSSE2 proc ; Src, n16Pixels
    pop eax ; ret address
    pop edx ; Src
    pop ecx ; N*16 pixels
   
    _RGB SEGMENT page
        rgb_simd_msk1 OWORD 000ff00ff00ff00ff00ff00ff00ff00ffh
        rgb_simd_msk2 OWORD 0ff00ff00ff00ff00ff00ff00ff00ff00h
    _RGB ENDS   
   
    align 16
@@: movdqa xmm0,[edx+0*16]
    movdqa xmm1,xmm0
    movdqa xmm2,[edx+1*16]
    movdqa xmm3,xmm2
    movdqa xmm4,[edx+2*16]
    movdqa xmm5,xmm4
    movdqa xmm6,[edx+3*16]
    movdqa xmm7,xmm6

    pand xmm0,rgb_simd_msk1
    pand xmm1,rgb_simd_msk2
    pshuflw xmm0,xmm0,10110001y
    pshufhw xmm0,xmm0,10110001y
    por xmm0,xmm1
    pand xmm2,rgb_simd_msk1
    pand xmm3,rgb_simd_msk2
    pshuflw xmm2,xmm2,10110001y
    pshufhw xmm2,xmm2,10110001y
    por xmm2,xmm3
    pand xmm4,rgb_simd_msk1
    pand xmm5,rgb_simd_msk2
    pshuflw xmm4,xmm4,10110001y
    pshufhw xmm4,xmm4,10110001y
    por xmm4,xmm5
    pand xmm6,rgb_simd_msk1
    pand xmm7,rgb_simd_msk2
    pshuflw xmm6,xmm6,10110001y
    pshufhw xmm6,xmm6,10110001y
    por xmm6,xmm7

    movdqa [edx+0*16],xmm0
    movdqa [edx+1*16],xmm1
    movdqa [edx+2*16],xmm2
    movdqa [edx+3*16],xmm3
    lea edx,[edx+4*16]   
    dec ecx
    jne @B
@@: jmp eax
RgbSwapSSE2 endp

c2d:
3167 cycles for RgbSwap MichaelW
2211 cycles for RgbSwap Hutch
957 cycles for RgbSwapSSE2 qWord
2229 cycles for RgbSwap Hutch2
4898 cycles for RgbSwapSSE
0 cycles
FPU in a trice: SmplMath
It's that simple!

hutch--

Yes, that is a lot faster with the SSE.

Core2 3.0 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

3120 cycles for RgbSwap MichaelW
2032 cycles for RgbSwap Hutch
783 cycles for RgbSwapSSE2 qWord
2026 cycles for RgbSwap Hutch2
3518 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

qWord

interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
3277 cycles for RgbSwap MichaelW
2253 cycles for RgbSwap Hutch
997 cycles for RgbSwapSSE2 qWord
1197 cycles for RgbSwapSSSE3
2232 cycles for RgbSwap Hutch2
4796 cycles for RgbSwapSSE
0 cycles


FPU in a trice: SmplMath
It's that simple!

Neo

Quote from: qWord on May 28, 2010, 12:33:31 AM
interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
Try unrolling the loop 4x and using a single register for addressing.  That's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.

lingo

#26
 :winkTest for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1

3034 cycles for RgbSwap MichaelW
2023 cycles for RgbSwap Hutch
1281 cycles for RgbSwapSSE2 qWord
525  cycles for RgbSwapLingo
2057 cycles for RgbSwap Hutch2
3692 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
and code
align 16
Msk dq 0704050603000102h
dq 0F0C0D0E0B08090Ah
db 5 Dup(0cch)
RgbSwapLingo proc ; lpSrc, bytes
pop    edx
pop    eax
pop    ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
add    eax, 16   
pshufb xmm1, xmm0
add    ecx, -1 
movdqa [eax-16],  xmm1 
jne    @b
jmp    edx
RgbSwapLingo endp

Unrolled version:

RgbSwapLingoUn proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
movdqa xmm2, [eax+16]
pshufb xmm1, xmm0
add eax, 32   
pshufb xmm2, xmm0
add ecx, -2
movdqa [eax-32],  xmm1 
movdqa [eax-16],  xmm2 
jne @b
jmp edx
RgbSwapLingoUn endp

jj2007

#27
Quote from: qWord on May 27, 2010, 11:49:41 PM
I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled

Celeron M:
3839 cycles for RgbSwap MichaelW
2388 cycles for RgbSwap Hutch
2082 cycles for RgbSwapSSE2 qWord
2413 cycles for RgbSwap Hutch2
7301 cycles for RgbSwapSSE


Prescott P4:
7305 cycles for RgbSwap MichaelW
6261 cycles for RgbSwap Hutch
1858 cycles for RgbSwapSSE2 qWord
8943 cycles for RgbSwap Hutch2
15262 cycles for RgbSwapSSE

:U

P.S.: Lingo's algo triggers a GPF, but this time it's my hardware's fault :bg

hutch--

Lingo's test.


Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1

3041 cycles for RgbSwap MichaelW
2053 cycles for RgbSwap Hutch
1268 cycles for RgbSwapSSE2 qWord
527  cycles for RgbSwapLingo
2046 cycles for RgbSwap Hutch2
3521 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

qWord

Quote from: Neo on May 28, 2010, 05:14:03 AMThat's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
What programm are you using?

I've modified the test bed a bit and add lingo's test:

12463   cycles for RgbSwap MichaelW
8852    cycles for RgbSwap Hutch
3414    cycles for RgbSwapSSE2 qWord
3144    cycles for RgbSwapLingo (SSSE3)
9036    cycles for RgbSwap Hutch2
18519   cycles for RgbSwapSSE
0       cycles


FPU in a trice: SmplMath
It's that simple!