RGBA to BGRA (and back again)

oex · May 24, 2010, 01:31:00 AM

Hey guys, I gotta write this at some point but I didnt want to deny you the chance to go to town on it.... I couldnt find it on the search tool and it's a bit of an easy fun one :lol.... I suggest testing on 1280x1024 for timings, SSE perfect task (I think).... :bg.... Is there a default testbed, if not can we add a sticky one? I could rip out the guts of one of JJs previous ones maybe? (I would try and do it humanely :lol)

Farabi · May 25, 2010, 02:10:53 AM

For speed what is important is the bit depth value. I prefer 24 or 32-bit.

Neo · May 25, 2010, 03:19:10 AM

Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once. I'm not sure on their relative timings.

dedndave · May 25, 2010, 01:44:40 PM

SSE would definately be faster if everything is aligned properly :P

Farabi · May 26, 2010, 03:06:23 AM

Is SSE really that faster?
What is the average timing between mov with movq?

Neo · May 26, 2010, 05:55:46 AM

Looks like BSWAP has longer latency and same throughput (depending on the CPU), so PSHUFB should probably be about 4x faster overall in well-optimized code (i.e. throughput-bound). :bg

MichaelW · May 26, 2010, 06:08:14 AM

Using only integer instructions and running on a P3 I can't get below about 4 cycles per pixel.

Code Select


;=====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;=====================================================================
    .data
      buffer dd 1000 dup(0)
    .code
;=====================================================================
start:
;=====================================================================

    lea edi, buffer

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
        mov ecx, 1000-1
      @@:
        UF=4
        DISP=0
        REPEAT UF
          mov eax, [edi+ecx*4+DISP]
          mov edx, eax
          bswap eax
          and edx, 0ff000000h
          shr eax, 8
          or eax, edx
          mov [edi+ecx*4+DISP], eax
          DISP=DISP+4
        ENDM
        sub ecx,UF
        jns @B
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
;=====================================================================
end start

It would be interesting to see how much faster a version based on PSHUFD would be.

oex · May 26, 2010, 12:31:11 PM

:/ with that timing I'll have to create BGRA in a seperate process dependant on card features.... It's always the simple things that are most difficult :lol

1280x1024 RGBA-BGRA:

File Size 5Mb

9651293 cycles
0 cycles

Press any key to exit...

Ty for your input Michael

FORTRANS · May 26, 2010, 12:48:54 PM

Hi,

Looking at the code posted by MichaelW, I decided to
run an excursion as it looked strange to me. Four variants
were run on my PIII. The results surprised me a bit. SHR
is apparently much faster than ROR, which was the first
surprise. The extra memory access variant also was not
what I expected. Oh well, so much for expectations.

Regards,

Steve N.

mov eax, [edi+ecx*4+DISP]
mov edx, eax
bswap eax
and edx, 0ff000000h
shr eax, 8
or eax, edx
mov [edi+ecx*4+DISP], eax
4071 cycles
0 cycles

Press any key to exit...

mov eax, [edi+ecx*4+DISP]
; mov edx, eax
bswap eax
; and edx, 0ff000000h
; shr eax, 8
ROR eax, 8
; or eax, edx
mov [edi+ecx*4+DISP], eax
4301 cycles
0 cycles

Press any key to exit...

mov eax, [edi+ecx*4+DISP]
; mov edx, eax
bswap eax
; and edx, 0ff000000h
shr eax, 8
; ROR eax, 8
; or eax, edx
mov [edi+ecx*4+DISP], eax
3279 cycles
0 cycles

Press any key to exit...

; mov eax, [edi+ecx*4+DISP]
MOV AL, [edi+ecx*4+DISP]
MOV DL, [edi+ecx*4+DISP+2]
; mov edx, eax
; bswap eax
; and edx, 0ff000000h
; shr eax, 8
; ROR eax, 8
; or eax, edx
; mov [edi+ecx*4+DISP], eax
mov [edi+ecx*4+DISP+2], AL
mov [edi+ecx*4+DISP], DL
2785 cycles

MOVZX EAX, BYTE PTR [edi+ecx*4+DISP]
MOVZX EDX, BYTE PTR [edi+ecx*4+DISP+2]
MOV [edi+ecx*4+DISP+2], AL
MOV [edi+ecx*4+DISP], DL
2852 cycles
0 cycles

Press any key to exit...

jj2007 · May 26, 2010, 12:56:44 PM

Quote from: Neo on May 25, 2010, 03:19:10 AM
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once. I'm not sure on their relative timings.

Probably a lot faster but you impose a serious limit on the hardware: pshufb is SSSE3, sometimes called SSE4...

dedndave · May 26, 2010, 01:16:45 PM

these are SSE2 instructions - maybe you can find something to do the job

Code Select

pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.

hutch-- · May 26, 2010, 04:53:10 PM

On old hardware I wonder if this approach is of any use.

Code Select


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba  :DWORD

    mov rgba, 0099AAEEh

    print uhex$(rgba),13,10

    lea edx, rgba

    movzx eax, BYTE PTR [edx]
    movzx ecx, BYTE PTR [edx+2]
    mov [edx], cl
    mov [edx+2], al

    print uhex$(rgba),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

jj2007 · May 26, 2010, 07:42:29 PM

SSE version attached. I have a feeling that it could be improved a lot... ::)

Code Select

3831 cycles for RgbSwap MichaelW
2389 cycles for RgbSwap Hutch
2417 cycles for RgbSwap Hutch2
7280 cycles for RgbSwapSSE

Any ideas how to get rid of two of the pshufs? It looks as if it should be possible...

Quote@@:   dec ecx
   movq xmm1, qword ptr [edx+8*ecx]
   pxor xmm0, xmm0   ; xmm0 may contain garbage
   punpcklbw xmm0, xmm1   ; expand 8 bytes to 8 xmm0 words
   pshufd xmm0, xmm0, 00011011b   ; inspired by drizz
   pshuflw xmm0, xmm0, 10110001b
   pshufhw xmm0, xmm0, 10110001b   ; all words swapped
   pshufd xmm0, xmm0, 01001110b   ; switch low dwords
   psrlq xmm0, 24   ; shift right by three bytes
   pxor xmm2, xmm2
   pand xmm1, TheAnd
   packsswb xmm0, xmm2   ; pack words to bytes
   paddb xmm0, xmm1
   movq qword ptr [edx+8*ecx], xmm0   ; back to mem, 8 bytes
   test ecx, ecx
   jne @B

qWord · May 26, 2010, 10:18:58 PM

jj,
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Here my suggestion for an sse2 version:

Code Select

align 16
RgbSwapSSE2 proc ; Src, n4Pixels
    pop eax ; ret address
    pop edx ; Src
    pop ecx ; N*4 pixels
    align 16
@@: dec ecx 
    movdqa xmm0,[edx]
    movdqa xmm1,xmm0    ; movdqa xmm1,[edx]
    psllw xmm0,8
    psrlw xmm1,8
    pshuflw xmm0,xmm0,10110001y
    pshufhw xmm0,xmm0,10110001y
    psrldq xmm0,1
    pslldq xmm1,1
    por xmm0,xmm1
    movdqa [edx],xmm0
    lea edx,[edx+16]    
    test ecx, ecx
    jne @B
@@: jmp eax
RgbSwapSSE2 endp

core2duo:

Code Select

3176 cycles for RgbSwap MichaelW
2238 cycles for RgbSwap Hutch
1470 cycles for RgbSwapSSE2 qWord
2217 cycles for RgbSwap Hutch2
4875 cycles for RgbSwapSSE

qWord

hutch-- · May 27, 2010, 04:44:00 AM

See if unrolling the algo gives it some more legs. This just read 2 swaps at a time twice. The idea is to use a compatible algo for older hardware and use SSE for later processors and just do a processor detect to see what can run what.

Code Select


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba[4]:DWORD

    push ebx
    push esi
    push edi

    lea esi, rgba

    mov [esi],    DWORD PTR 0099AAEEh
    mov [esi+4],  DWORD PTR 007744DDh
    mov [esi+8],  DWORD PTR 003399FFh
    mov [esi+12], DWORD PTR 007744DDh

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

  ; --------------------------------------

    movzx eax, BYTE PTR [esi]
    movzx ebx, BYTE PTR [esi+2]
    movzx ecx, BYTE PTR [esi+4]
    movzx edx, BYTE PTR [esi+6]

    mov [esi], bl
    mov [esi+2], al
    mov [esi+4], dl
    mov [esi+6], cl

  ; --------------------------------------

    movzx eax, BYTE PTR [esi+8]
    movzx ebx, BYTE PTR [esi+10]
    movzx ecx, BYTE PTR [esi+12]
    movzx edx, BYTE PTR [esi+14]

    mov [esi+8], bl
    mov [esi+10], al
    mov [esi+12], dl
    mov [esi+14], cl

  ; --------------------------------------

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start

News:

RGBA to BGRA (and back again)