The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: oex on May 24, 2010, 01:31:00 AM

Title: RGBA to BGRA (and back again)
Post by: oex on May 24, 2010, 01:31:00 AM
Hey guys, I gotta write this at some point but I didnt want to deny you the chance to go to town on it.... I couldnt find it on the search tool and it's a bit of an easy fun one :lol.... I suggest testing on 1280x1024 for timings, SSE perfect task (I think).... :bg.... Is there a default testbed, if not can we add a sticky one? I could rip out the guts of one of JJs previous ones maybe? (I would try and do it humanely :lol)
Title: Re: RGBA to BGRA (and back again)
Post by: Farabi on May 25, 2010, 02:10:53 AM
For speed what is important is the bit depth value. I prefer 24 or 32-bit.
Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 25, 2010, 03:19:10 AM
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once.  I'm not sure on their relative timings.
Title: Re: RGBA to BGRA (and back again)
Post by: dedndave on May 25, 2010, 01:44:40 PM
SSE would definately be faster if everything is aligned properly   :P
Title: Re: RGBA to BGRA (and back again)
Post by: Farabi on May 26, 2010, 03:06:23 AM
Is SSE really that faster?
What is the average timing between mov with movq?
Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 26, 2010, 05:55:46 AM
Looks like BSWAP has longer latency and same throughput (depending on the CPU), so PSHUFB should probably be about 4x faster overall in well-optimized code (i.e. throughput-bound).  :bg
Title: Re: RGBA to BGRA (and back again)
Post by: MichaelW on May 26, 2010, 06:08:14 AM
Using only integer instructions and running on a P3 I can't get below about 4 cycles per pixel.

;=====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;=====================================================================
    .data
      buffer dd 1000 dup(0)
    .code
;=====================================================================
start:
;=====================================================================

    lea edi, buffer

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
        mov ecx, 1000-1
      @@:
        UF=4
        DISP=0
        REPEAT UF
          mov eax, [edi+ecx*4+DISP]
          mov edx, eax
          bswap eax
          and edx, 0ff000000h
          shr eax, 8
          or eax, edx
          mov [edi+ecx*4+DISP], eax
          DISP=DISP+4
        ENDM
        sub ecx,UF
        jns @B
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
;=====================================================================
end start


It would be interesting to see how much faster a version based on PSHUFD would be.
Title: Re: RGBA to BGRA (and back again)
Post by: oex on May 26, 2010, 12:31:11 PM
:/ with that timing I'll have to create BGRA in a seperate process dependant on card features.... It's always the simple things that are most difficult :lol

1280x1024 RGBA-BGRA:

File Size 5Mb

9651293 cycles
0 cycles

Press any key to exit...

Ty for your input Michael
Title: Re: RGBA to BGRA (and back again)
Post by: FORTRANS on May 26, 2010, 12:48:54 PM
Hi,

   Looking at the code posted by MichaelW, I decided to
run an excursion as it looked strange to me.  Four variants
were run on my PIII.  The results surprised me a bit.  SHR
is apparently much faster than ROR, which was the first
surprise.  The extra memory access variant also was not
what I expected.  Oh well, so much for expectations.

Regards,

Steve N.


          mov eax, [edi+ecx*4+DISP]
          mov edx, eax
          bswap eax
          and edx, 0ff000000h
          shr eax, 8
          or eax, edx
          mov [edi+ecx*4+DISP], eax
4071 cycles
0 cycles

Press any key to exit...

          mov eax, [edi+ecx*4+DISP]
;          mov edx, eax
          bswap eax
;          and edx, 0ff000000h
;          shr eax, 8
          ROR eax, 8
;          or eax, edx
          mov [edi+ecx*4+DISP], eax
4301 cycles
0 cycles

Press any key to exit...

          mov eax, [edi+ecx*4+DISP]
;          mov edx, eax
          bswap eax
;          and edx, 0ff000000h
          shr eax, 8
;          ROR eax, 8
;          or eax, edx
          mov [edi+ecx*4+DISP], eax
3279 cycles
0 cycles

Press any key to exit...

;          mov eax, [edi+ecx*4+DISP]
          MOV AL, [edi+ecx*4+DISP]
          MOV DL, [edi+ecx*4+DISP+2]
;          mov edx, eax
;          bswap eax
;          and edx, 0ff000000h
;          shr eax, 8
;          ROR eax, 8
;          or eax, edx
;          mov [edi+ecx*4+DISP], eax
          mov [edi+ecx*4+DISP+2], AL
          mov [edi+ecx*4+DISP], DL
2785 cycles

          MOVZX EAX, BYTE PTR [edi+ecx*4+DISP]
          MOVZX EDX, BYTE PTR [edi+ecx*4+DISP+2]
          MOV [edi+ecx*4+DISP+2], AL
          MOV [edi+ecx*4+DISP], DL
2852 cycles
0 cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 26, 2010, 12:56:44 PM
Quote from: Neo on May 25, 2010, 03:19:10 AM
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once.  I'm not sure on their relative timings.

Probably a lot faster but you impose a serious limit on the hardware: pshufb is SSSE3, sometimes called SSE4...
Title: Re: RGBA to BGRA (and back again)
Post by: dedndave on May 26, 2010, 01:16:45 PM
these are SSE2 instructions - maybe you can find something to do the job
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.

Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 26, 2010, 04:53:10 PM
On old hardware I wonder if this approach is of any use.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba  :DWORD

    mov rgba, 0099AAEEh

    print uhex$(rgba),13,10

    lea edx, rgba

    movzx eax, BYTE PTR [edx]
    movzx ecx, BYTE PTR [edx+2]
    mov [edx], cl
    mov [edx+2], al

    print uhex$(rgba),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 26, 2010, 07:42:29 PM
SSE version attached. I have a feeling that it could be improved a lot... ::)

3831 cycles for RgbSwap MichaelW
2389 cycles for RgbSwap Hutch
2417 cycles for RgbSwap Hutch2
7280 cycles for RgbSwapSSE


Any ideas how to get rid of two of the pshufs? It looks as if it should be possible...

Quote@@:   dec ecx
   movq xmm1, qword ptr [edx+8*ecx]
   pxor xmm0, xmm0   ; xmm0 may contain garbage
   punpcklbw xmm0, xmm1   ; expand 8 bytes to 8 xmm0 words
   pshufd xmm0, xmm0, 00011011b   ; inspired by drizz (http://www.asmcommunity.net/board/index.php?topic=29743.0)
   pshuflw xmm0, xmm0, 10110001b
   pshufhw xmm0, xmm0, 10110001b   ; all words swapped
   pshufd xmm0, xmm0, 01001110b   ; switch low dwords
   psrlq xmm0, 24   ; shift right by three bytes
   pxor xmm2, xmm2
   pand xmm1, TheAnd
   packsswb xmm0, xmm2   ; pack words to bytes
   paddb xmm0, xmm1
   movq qword ptr [edx+8*ecx], xmm0   ; back to mem, 8 bytes
   test ecx, ecx
   jne @B
Title: Re: RGBA to BGRA (and back again)
Post by: qWord on May 26, 2010, 10:18:58 PM
jj,
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Here my suggestion for an sse2 version:
align 16
RgbSwapSSE2 proc ; Src, n4Pixels
    pop eax ; ret address
    pop edx ; Src
    pop ecx ; N*4 pixels
    align 16
@@: dec ecx
    movdqa xmm0,[edx]
    movdqa xmm1,xmm0    ; movdqa xmm1,[edx]
    psllw xmm0,8
    psrlw xmm1,8
    pshuflw xmm0,xmm0,10110001y
    pshufhw xmm0,xmm0,10110001y
    psrldq xmm0,1
    pslldq xmm1,1
    por xmm0,xmm1
    movdqa [edx],xmm0
    lea edx,[edx+16]   
    test ecx, ecx
    jne @B
@@: jmp eax
RgbSwapSSE2 endp

core2duo:
3176 cycles for RgbSwap MichaelW
2238 cycles for RgbSwap Hutch
1470 cycles for RgbSwapSSE2 qWord
2217 cycles for RgbSwap Hutch2
4875 cycles for RgbSwapSSE


qWord
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 27, 2010, 04:44:00 AM
See if unrolling the algo gives it some more legs. This just read 2 swaps at a time twice. The idea is to use a compatible algo for older hardware and use SSE for later processors and just do a processor detect to see what can run what.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba[4]:DWORD

    push ebx
    push esi
    push edi

    lea esi, rgba

    mov [esi],    DWORD PTR 0099AAEEh
    mov [esi+4],  DWORD PTR 007744DDh
    mov [esi+8],  DWORD PTR 003399FFh
    mov [esi+12], DWORD PTR 007744DDh

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

  ; --------------------------------------

    movzx eax, BYTE PTR [esi]
    movzx ebx, BYTE PTR [esi+2]
    movzx ecx, BYTE PTR [esi+4]
    movzx edx, BYTE PTR [esi+6]

    mov [esi], bl
    mov [esi+2], al
    mov [esi+4], dl
    mov [esi+6], cl

  ; --------------------------------------

    movzx eax, BYTE PTR [esi+8]
    movzx ebx, BYTE PTR [esi+10]
    movzx ecx, BYTE PTR [esi+12]
    movzx edx, BYTE PTR [esi+14]

    mov [esi+8], bl
    mov [esi+10], al
    mov [esi+12], dl
    mov [esi+14], cl

  ; --------------------------------------

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Title: Re: RGBA to BGRA (and back again)
Post by: sinsi on May 27, 2010, 04:57:47 AM
How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 27, 2010, 08:28:00 AM
Quote from: qWord on May 26, 2010, 10:18:58 PM
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Not so relevant, it works because the bytes will never saturate the word; but it is utterly slow.
Your algo is cute. I took the liberty to comment it.

Quotealign 16   ; qWord (http://www.masm32.com/board/index.php?topic=14058.msg111586#msg111586)
RgbSwapSSE2 proc   ; Src, n4Pixels
   pop eax      ; ret address
   pop edx      ; Src
   pop ecx      ; Bytes
   shr ecx, 2   ; we process 16 bytes, 4 dwords per loop
   align 16
@@:
   movdqa xmm0, [edx]
   movdqa xmm1, xmm0   ; movdqa xmm1,[edx]
   ; 31726762 31524742 61726762 41524742 aka 1rgb 1RGB argb ARGB
   psllw xmm0, 8   ; shift left by one byte, draw in zeros
   ; 72006200 52004200 72006200 52004200 aka r_b_ R_B r_b_ R_B_
   psrlw xmm1, 8   ; shift right by one byte, draw in zeros
   ; 00310067 00310047 00610067 00410047 aka _1_g _1_G _a_g _A_G
   pshuflw xmm0, xmm0, 10110001y   ; reverse order of lower half words
   pshufhw xmm0, xmm0, 10110001y   ; reverse order of upper half words
   ; 62007200 42005200 62007200 42005200 aka b_r_ B_R_ B_R_ B_R_
   psrldq xmm0, 1   ; shift right again
   ; 00620072 00420052 00620072 00420052 aka _b_r _B_R _B_R _B_R
   pslldq xmm1, 1   ; shift left again
   ; 31006700 31004700 61006700 41004700 aka 1_g_ 1_G_ a_g_ A_G_
   por xmm0, xmm1
   ; 31626772 31424752 61626772 41424752 aka 1bgr 1BGR abgr ABGR
   movdqa [edx], xmm0
   lea edx, [edx+16]
   dec ecx
   jg @B   ; jump if greater than zero
   jmp eax   ; ret address still in eax
RgbSwapSSE2 endp
Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 27, 2010, 08:39:36 AM
Quote from: sinsi on May 27, 2010, 04:57:47 AM
How would multithreading go with this? Or would 4MB be too small to justify setting up another thread?
If you've already got threads set up for the app as a whole, there's not much downside to multithreading it unless it's pretty short (i.e. the down side would be the thread sync time and a mild nuisance to the caches).  That said, multi-threading it is the easy part and can be done on top of single-thread optimizations.

btw, thanks guys for pointing out the obvious thing I completely missed, i.e. that it's not just a BSWAP to reorder the bytes, especially if you're using the alpha.  :U  PSHUFB is probably much faster then, especially if you unroll it (though on an i7, unrolling might not matter as much).  If I wasn't dead tired at the moment, I'd fire up my performance viewer and get the full scaling of it; it should only take a couple minutes, but maybe tomorrow night.
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 27, 2010, 09:46:46 AM
Quote from: Neo on May 27, 2010, 08:39:36 AM
PSHUFB is probably much faster

Can't test it with my legacy CPUs :(

Intel: (http://www.intel.com/technology/itj/2007/v11i4/1-inside/5-vectorizer.htm)
This conversion between a little-endian and big-endian representation of 32-bit data
elements (4 bytes) can be vectorized effectively as follows.

@@: movdqa xmm1, XMMWORD PTR [_b+eax] ; load 16-bytes from b
pshufb xmm1, xmm0 ; shuffle 16-bytes as defined in xmm0
movdqa XMMWORD PTR [_a+eax], xmm1 ; store 16-bytes into a
add eax, 16
cmp eax, ebx
jb @B

Here, register xmm0 is pre-loaded with the appropriate 4x4 reshuffling pattern.
Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 27, 2010, 10:14:08 AM
I decided it was worth the loss of sleep to run it in my editor to get some neat plots.  :bg

Time in clock cycles up to 1600 pixels (the unlabelled one is the RgbSwapSSE2):
(http://ndickson.files.wordpress.com/2010/05/rgbaperf1600.png)

Time for larger numbers of pixels (only showing PSHUFB, though for now it's still less than the rest):
(switch upward occurs ~32KB, the per-core L1 cache size on this CPU)
(http://ndickson.files.wordpress.com/2010/05/rgbaperf16k.png)
(http://ndickson.files.wordpress.com/2010/05/rgbaperf160k.png)

Behaviour > 1MB (per-core L2 cache size) is dominated by memory access time:
(http://ndickson.files.wordpress.com/2010/05/rgbaperf1m.png)

Edit: Fixed misquoted size and clarified cache sizes.
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 27, 2010, 11:44:52 AM
Quote from: qWord on May 26, 2010, 10:18:58 PM
Here my suggestion for an sse2 version:

Celeron M:
3831 cycles for RgbSwap MichaelW
2409 cycles for RgbSwap Hutch
3711 cycles for RgbSwapSSE2 qWord
2415 cycles for RgbSwap Hutch2
7305 cycles for RgbSwapSSE


Strange behaviour when compared to P4 results.
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 27, 2010, 01:21:09 PM
JJ,

here is your last test on the 2 quads I work on.


Core2 3.0 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

3099 cycles for RgbSwap MichaelW
2050 cycles for RgbSwap Hutch
1271 cycles for RgbSwapSSE2 qWord
2057 cycles for RgbSwap Hutch2
3549 cycles for RgbSwapSSE
0 cycles

Press any key to exit...

i7 2.8 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

2592 cycles for RgbSwap MichaelW
1934 cycles for RgbSwap Hutch
1014 cycles for RgbSwapSSE2 qWord
1778 cycles for RgbSwap Hutch2
2190 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: qWord on May 27, 2010, 11:49:41 PM
I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled (16 pixels per loop).

align 16
RgbSwapSSE2 proc ; Src, n16Pixels
    pop eax ; ret address
    pop edx ; Src
    pop ecx ; N*16 pixels
   
    _RGB SEGMENT page
        rgb_simd_msk1 OWORD 000ff00ff00ff00ff00ff00ff00ff00ffh
        rgb_simd_msk2 OWORD 0ff00ff00ff00ff00ff00ff00ff00ff00h
    _RGB ENDS   
   
    align 16
@@: movdqa xmm0,[edx+0*16]
    movdqa xmm1,xmm0
    movdqa xmm2,[edx+1*16]
    movdqa xmm3,xmm2
    movdqa xmm4,[edx+2*16]
    movdqa xmm5,xmm4
    movdqa xmm6,[edx+3*16]
    movdqa xmm7,xmm6

    pand xmm0,rgb_simd_msk1
    pand xmm1,rgb_simd_msk2
    pshuflw xmm0,xmm0,10110001y
    pshufhw xmm0,xmm0,10110001y
    por xmm0,xmm1
    pand xmm2,rgb_simd_msk1
    pand xmm3,rgb_simd_msk2
    pshuflw xmm2,xmm2,10110001y
    pshufhw xmm2,xmm2,10110001y
    por xmm2,xmm3
    pand xmm4,rgb_simd_msk1
    pand xmm5,rgb_simd_msk2
    pshuflw xmm4,xmm4,10110001y
    pshufhw xmm4,xmm4,10110001y
    por xmm4,xmm5
    pand xmm6,rgb_simd_msk1
    pand xmm7,rgb_simd_msk2
    pshuflw xmm6,xmm6,10110001y
    pshufhw xmm6,xmm6,10110001y
    por xmm6,xmm7

    movdqa [edx+0*16],xmm0
    movdqa [edx+1*16],xmm1
    movdqa [edx+2*16],xmm2
    movdqa [edx+3*16],xmm3
    lea edx,[edx+4*16]   
    dec ecx
    jne @B
@@: jmp eax
RgbSwapSSE2 endp

c2d:
3167 cycles for RgbSwap MichaelW
2211 cycles for RgbSwap Hutch
957 cycles for RgbSwapSSE2 qWord
2229 cycles for RgbSwap Hutch2
4898 cycles for RgbSwapSSE
0 cycles
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 27, 2010, 11:54:37 PM
Yes, that is a lot faster with the SSE.

Core2 3.0 gig Quad

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

3120 cycles for RgbSwap MichaelW
2032 cycles for RgbSwap Hutch
783 cycles for RgbSwapSSE2 qWord
2026 cycles for RgbSwap Hutch2
3518 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: qWord on May 28, 2010, 12:33:31 AM
interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
3277 cycles for RgbSwap MichaelW
2253 cycles for RgbSwap Hutch
997 cycles for RgbSwapSSE2 qWord
1197 cycles for RgbSwapSSSE3
2232 cycles for RgbSwap Hutch2
4796 cycles for RgbSwapSSE
0 cycles


Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 28, 2010, 05:14:03 AM
Quote from: qWord on May 28, 2010, 12:33:31 AM
interesting ... on my core2 duo an SSSE3 version (pshufb) is slower than the SSE2 version:
Try unrolling the loop 4x and using a single register for addressing.  That's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
Title: Re: RGBA to BGRA (and back again)
Post by: lingo on May 28, 2010, 05:22:44 AM
 :winkTest for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1

3034 cycles for RgbSwap MichaelW
2023 cycles for RgbSwap Hutch
1281 cycles for RgbSwapSSE2 qWord
525  cycles for RgbSwapLingo
2057 cycles for RgbSwap Hutch2
3692 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
and code
align 16
Msk dq 0704050603000102h
dq 0F0C0D0E0B08090Ah
db 5 Dup(0cch)
RgbSwapLingo proc ; lpSrc, bytes
pop    edx
pop    eax
pop    ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
add    eax, 16   
pshufb xmm1, xmm0
add    ecx, -1 
movdqa [eax-16],  xmm1 
jne    @b
jmp    edx
RgbSwapLingo endp

Unrolled version:

RgbSwapLingoUn proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
movdqa xmm2, [eax+16]
pshufb xmm1, xmm0
add eax, 32   
pshufb xmm2, xmm0
add ecx, -2
movdqa [eax-32],  xmm1 
movdqa [eax-16],  xmm2 
jne @b
jmp edx
RgbSwapLingoUn endp
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 28, 2010, 06:33:05 AM
Quote from: qWord on May 27, 2010, 11:49:41 PM
I've modified the sse2 version a bit: the shifts are replaced by pand/por and the loop has been unrolled

Celeron M:
3839 cycles for RgbSwap MichaelW
2388 cycles for RgbSwap Hutch
2082 cycles for RgbSwapSSE2 qWord
2413 cycles for RgbSwap Hutch2
7301 cycles for RgbSwapSSE


Prescott P4:
7305 cycles for RgbSwap MichaelW
6261 cycles for RgbSwap Hutch
1858 cycles for RgbSwapSSE2 qWord
8943 cycles for RgbSwap Hutch2
15262 cycles for RgbSwapSSE

:U

P.S.: Lingo's algo triggers a GPF, but this time it's my hardware's fault :bg
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 28, 2010, 09:31:58 AM
Lingo's test.


Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1

3041 cycles for RgbSwap MichaelW
2053 cycles for RgbSwap Hutch
1268 cycles for RgbSwapSSE2 qWord
527  cycles for RgbSwapLingo
2046 cycles for RgbSwap Hutch2
3521 cycles for RgbSwapSSE
0 cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: qWord on May 28, 2010, 01:19:59 PM
Quote from: Neo on May 28, 2010, 05:14:03 AMThat's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
What programm are you using?

I've modified the test bed a bit and add lingo's test:

12463   cycles for RgbSwap MichaelW
8852    cycles for RgbSwap Hutch
3414    cycles for RgbSwapSSE2 qWord
3144    cycles for RgbSwapLingo (SSSE3)
9036    cycles for RgbSwap Hutch2
18519   cycles for RgbSwapSSE
0       cycles


Title: Re: RGBA to BGRA (and back again)
Post by: FORTRANS on May 28, 2010, 01:57:37 PM
Quote from: hutch-- on May 26, 2010, 04:53:10 PM
On old hardware I wonder if this approach is of any use.

Hi,

   Updated my data in Reply #8.

Cheers,

Steve N.
Title: Re: RGBA to BGRA (and back again)
Post by: jj2007 on May 28, 2010, 02:41:22 PM
Possible variants. I feel handicapped because both my CPUs don't have pshufb...

RgbSwapLingo2 proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
lea ecx, [eax+4*ecx] ; create limit
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax]
add eax, 16
pshufb xmm1, xmm0
cmp eax, ecx
movdqa [eax-16], xmm1
jl @b
jmp edx
RgbSwapLingo2 endp

RgbSwapLingo3 proc ; lpSrc, bytes
pop edx
pop eax
pop ecx ; N*4 pixels
lea ecx, [eax+4*ecx] ; create limit
pshufd xmm0, oword ptr Msk, 0E4h
@@:
movdqa xmm1, [eax] ; unrolled once
pshufb xmm1, xmm0
movdqa [eax], xmm1
movdqa xmm1, [eax+16]
pshufb xmm1, xmm0
movdqa [eax+16], xmm1
add eax, 32
cmp eax, ecx
jl @b
jmp edx
RgbSwapLingo3 endp
Title: Re: RGBA to BGRA (and back again)
Post by: lingo on May 29, 2010, 10:54:35 AM
"What programm are you using?
I've modified the test bed a bit and add lingo's test:"


I don't think unrolled variants are safety...anyway:

Test for correctness:
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
RGBArgbaRGB1rgb1
BGRAbgraBGR1bgr1
BGRAbgraBGR1bgr1

12421   cycles for RgbSwap MichaelW
8217    cycles for RgbSwap Hutch
3256    cycles for RgbSwapSSE2 qWord
2068    cycles for RgbSwapLingo (SSSE3)
1556    cycles for RgbSwapLingoUnrolled
8215    cycles for RgbSwap Hutch2
14371   cycles for RgbSwapSSE
0       cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 29, 2010, 11:31:21 AM

i7 quad 2.8 gig

10131   cycles for RgbSwap MichaelW
6638    cycles for RgbSwap Hutch
2681    cycles for RgbSwapSSE2 qWord
2667    cycles for RgbSwapLingo (SSSE3)
1259    cycles for RgbSwapLingoUnrolled
6645    cycles for RgbSwap Hutch2
8405    cycles for RgbSwapSSE
0       cycles

Press any key to exit...

Core2 quad 3.0 gig

12433   cycles for RgbSwap MichaelW
8237    cycles for RgbSwap Hutch
3260    cycles for RgbSwapSSE2 qWord
2068    cycles for RgbSwapLingo (SSSE3)
1556    cycles for RgbSwapLingoUnrolled
8224    cycles for RgbSwap Hutch2
14363   cycles for RgbSwapSSE
0       cycles

Press any key to exit...
Title: Re: RGBA to BGRA (and back again)
Post by: hutch-- on May 29, 2010, 04:07:42 PM
 :P

Coming to you from my 2002 SHITBOX, this is the fastest of the legacy algos. I tweaked Michaels bswap algo and got it about 25% faster but a bswap and ror are enough to make it slower than 2 memory reads and 2 memory writes per pixel.



RgbSwapH2 proc rsSrc, rsBytes

    push ebx
    push esi
    push edi

    mov esi, rsBytes
    mov edi, rsSrc
    shr esi, 2
@@:
    mov al, BYTE PTR [edi]
    mov bl, BYTE PTR [edi+2]
    mov [edi], bl
    mov [edi+2], al
    mov cl, BYTE PTR [edi+4]
    mov dl, BYTE PTR [edi+2+4]
    mov [edi+4], dl
    mov [edi+2+4], cl
    mov al, BYTE PTR [edi+8]
    mov bl, BYTE PTR [edi+2+8]
    mov [edi+8], bl
    mov [edi+2+8], al
    mov cl, BYTE PTR [edi+12]
    mov dl, BYTE PTR [edi+2+12]
    mov [edi+12], dl
    mov [edi+2+12], cl
    add edi, 16
    dec esi
    jne @B
   
    pop edi
    pop esi
    pop ebx

    ret

RgbSwapH2 endp
Title: Re: RGBA to BGRA (and back again)
Post by: Neo on May 30, 2010, 08:52:36 AM
Quote from: qWord on May 28, 2010, 01:19:59 PM
Quote from: Neo on May 28, 2010, 05:14:03 AMThat's what I did for the data labelled "PSHUFB" on the plot above, also on a core 2 duo.
What programm are you using?
I'm using Inventor IDE with the not-yet-released performance testing add-in, which I can't seem to get completely separated from the main app, so I keep delaying its release.  Maybe I should just have it in the main app.  It's really handy, but still has a few kinks to be worked out (e.g. the performance test settings don't get saved yet).