News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

RGBA to BGRA (and back again)

Started by oex, May 24, 2010, 01:31:00 AM

Previous topic - Next topic

oex

Hey guys, I gotta write this at some point but I didnt want to deny you the chance to go to town on it.... I couldnt find it on the search tool and it's a bit of an easy fun one :lol.... I suggest testing on 1280x1024 for timings, SSE perfect task (I think).... :bg.... Is there a default testbed, if not can we add a sticky one? I could rip out the guts of one of JJs previous ones maybe? (I would try and do it humanely :lol)
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

Farabi

For speed what is important is the bit depth value. I prefer 24 or 32-bit.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Neo

Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once.  I'm not sure on their relative timings.

dedndave

SSE would definately be faster if everything is aligned properly   :P

Farabi

Is SSE really that faster?
What is the average timing between mov with movq?
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Neo

Looks like BSWAP has longer latency and same throughput (depending on the CPU), so PSHUFB should probably be about 4x faster overall in well-optimized code (i.e. throughput-bound).  :bg

MichaelW

Using only integer instructions and running on a P3 I can't get below about 4 cycles per pixel.

;=====================================================================
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
;=====================================================================
    .data
      buffer dd 1000 dup(0)
    .code
;=====================================================================
start:
;=====================================================================

    lea edi, buffer

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
        mov ecx, 1000-1
      @@:
        UF=4
        DISP=0
        REPEAT UF
          mov eax, [edi+ecx*4+DISP]
          mov edx, eax
          bswap eax
          and edx, 0ff000000h
          shr eax, 8
          or eax, edx
          mov [edi+ecx*4+DISP], eax
          DISP=DISP+4
        ENDM
        sub ecx,UF
        jns @B
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
    counter_end
    print ustr$(eax)," cycles",13,10,13,10

    inkey "Press any key to exit..."
    exit
;=====================================================================
end start


It would be interesting to see how much faster a version based on PSHUFD would be.
eschew obfuscation

oex

:/ with that timing I'll have to create BGRA in a seperate process dependant on card features.... It's always the simple things that are most difficult :lol

1280x1024 RGBA-BGRA:

File Size 5Mb

9651293 cycles
0 cycles

Press any key to exit...

Ty for your input Michael
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

FORTRANS

#8
Hi,

   Looking at the code posted by MichaelW, I decided to
run an excursion as it looked strange to me.  Four variants
were run on my PIII.  The results surprised me a bit.  SHR
is apparently much faster than ROR, which was the first
surprise.  The extra memory access variant also was not
what I expected.  Oh well, so much for expectations.

Regards,

Steve N.


          mov eax, [edi+ecx*4+DISP]
          mov edx, eax
          bswap eax
          and edx, 0ff000000h
          shr eax, 8
          or eax, edx
          mov [edi+ecx*4+DISP], eax
4071 cycles
0 cycles

Press any key to exit...

          mov eax, [edi+ecx*4+DISP]
;          mov edx, eax
          bswap eax
;          and edx, 0ff000000h
;          shr eax, 8
          ROR eax, 8
;          or eax, edx
          mov [edi+ecx*4+DISP], eax
4301 cycles
0 cycles

Press any key to exit...

          mov eax, [edi+ecx*4+DISP]
;          mov edx, eax
          bswap eax
;          and edx, 0ff000000h
          shr eax, 8
;          ROR eax, 8
;          or eax, edx
          mov [edi+ecx*4+DISP], eax
3279 cycles
0 cycles

Press any key to exit...

;          mov eax, [edi+ecx*4+DISP]
          MOV AL, [edi+ecx*4+DISP]
          MOV DL, [edi+ecx*4+DISP+2]
;          mov edx, eax
;          bswap eax
;          and edx, 0ff000000h
;          shr eax, 8
;          ROR eax, 8
;          or eax, edx
;          mov [edi+ecx*4+DISP], eax
          mov [edi+ecx*4+DISP+2], AL
          mov [edi+ecx*4+DISP], DL
2785 cycles

          MOVZX EAX, BYTE PTR [edi+ecx*4+DISP]
          MOVZX EDX, BYTE PTR [edi+ecx*4+DISP+2]
          MOV [edi+ecx*4+DISP+2], AL
          MOV [edi+ecx*4+DISP], DL
2852 cycles
0 cycles

Press any key to exit...

jj2007

Quote from: Neo on May 25, 2010, 03:19:10 AM
Actually, it sounds like a perfect task for the BSWAP instruction, but you might be able to use PSHUFB to do 4 at once.  I'm not sure on their relative timings.

Probably a lot faster but you impose a serious limit on the hardware: pshufb is SSSE3, sometimes called SSE4...

dedndave

these are SSE2 instructions - maybe you can find something to do the job
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.


hutch--

On old hardware I wonder if this approach is of any use.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba  :DWORD

    mov rgba, 0099AAEEh

    print uhex$(rgba),13,10

    lea edx, rgba

    movzx eax, BYTE PTR [edx]
    movzx ecx, BYTE PTR [edx+2]
    mov [edx], cl
    mov [edx+2], al

    print uhex$(rgba),13,10

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

SSE version attached. I have a feeling that it could be improved a lot... ::)

3831 cycles for RgbSwap MichaelW
2389 cycles for RgbSwap Hutch
2417 cycles for RgbSwap Hutch2
7280 cycles for RgbSwapSSE


Any ideas how to get rid of two of the pshufs? It looks as if it should be possible...

Quote@@:   dec ecx
   movq xmm1, qword ptr [edx+8*ecx]
   pxor xmm0, xmm0   ; xmm0 may contain garbage
   punpcklbw xmm0, xmm1   ; expand 8 bytes to 8 xmm0 words
   pshufd xmm0, xmm0, 00011011b   ; inspired by drizz
   pshuflw xmm0, xmm0, 10110001b
   pshufhw xmm0, xmm0, 10110001b   ; all words swapped
   pshufd xmm0, xmm0, 01001110b   ; switch low dwords
   psrlq xmm0, 24   ; shift right by three bytes
   pxor xmm2, xmm2
   pand xmm1, TheAnd
   packsswb xmm0, xmm2   ; pack words to bytes
   paddb xmm0, xmm1
   movq qword ptr [edx+8*ecx], xmm0   ; back to mem, 8 bytes
   
test ecx, ecx
   jne @B

qWord

jj,
there is a small bug in you algo: you are using packsswb, which should be packuswb.(?)
Here my suggestion for an sse2 version:
align 16
RgbSwapSSE2 proc ; Src, n4Pixels
    pop eax ; ret address
    pop edx ; Src
    pop ecx ; N*4 pixels
    align 16
@@: dec ecx
    movdqa xmm0,[edx]
    movdqa xmm1,xmm0    ; movdqa xmm1,[edx]
    psllw xmm0,8
    psrlw xmm1,8
    pshuflw xmm0,xmm0,10110001y
    pshufhw xmm0,xmm0,10110001y
    psrldq xmm0,1
    pslldq xmm1,1
    por xmm0,xmm1
    movdqa [edx],xmm0
    lea edx,[edx+16]   
    test ecx, ecx
    jne @B
@@: jmp eax
RgbSwapSSE2 endp

core2duo:
3176 cycles for RgbSwap MichaelW
2238 cycles for RgbSwap Hutch
1470 cycles for RgbSwapSSE2 qWord
2217 cycles for RgbSwap Hutch2
4875 cycles for RgbSwapSSE


qWord
FPU in a trice: SmplMath
It's that simple!

hutch--

See if unrolling the algo gives it some more legs. This just read 2 swaps at a time twice. The idea is to use a compatible algo for older hardware and use SSE for later processors and just do a processor detect to see what can run what.


IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL rgba[4]:DWORD

    push ebx
    push esi
    push edi

    lea esi, rgba

    mov [esi],    DWORD PTR 0099AAEEh
    mov [esi+4],  DWORD PTR 007744DDh
    mov [esi+8],  DWORD PTR 003399FFh
    mov [esi+12], DWORD PTR 007744DDh

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

  ; --------------------------------------

    movzx eax, BYTE PTR [esi]
    movzx ebx, BYTE PTR [esi+2]
    movzx ecx, BYTE PTR [esi+4]
    movzx edx, BYTE PTR [esi+6]

    mov [esi], bl
    mov [esi+2], al
    mov [esi+4], dl
    mov [esi+6], cl

  ; --------------------------------------

    movzx eax, BYTE PTR [esi+8]
    movzx ebx, BYTE PTR [esi+10]
    movzx ecx, BYTE PTR [esi+12]
    movzx edx, BYTE PTR [esi+14]

    mov [esi+8], bl
    mov [esi+10], al
    mov [esi+12], dl
    mov [esi+14], cl

  ; --------------------------------------

    print uhex$([esi])," "
    print uhex$([esi+4])," "
    print uhex$([esi+8])," "
    print uhex$([esi+12]),13,10

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php