News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Graphics a la FPU

Started by donkey, January 09, 2005, 04:02:45 AM

Previous topic - Next topic

raymond

Here are the results and observations of more tests to verify the accuracy and relative speeds of the FPU 32-bit floats and CPU 32-bit fixed-point math algos while retaining the full precision of each for the YUV values.

The entire range of RGB values from 000000 to FFFFFF was verified by first converting the RGB to YUV, converting back from the stored YUV to RGB, and comparing the returned RGB to its original one. Both algos were 100% accurate.

The time required on my P3-550 to perform those 16,777,216 iterations was:
7.3 seconds for the CPU-based algo
8.5 seconds for the FPU-based algo

Removing the "finit" instruction from each of the two procedures in the FPU algo (and using it only once before starting the timing) resulted in reducing the measured time to: 6.9 seconds.
(This would necessitate that the programmer can manage the content of the FPU registers to prevent "FPU stack overflow")

The above time was further reduced to: 5.3 seconds
by changing the "Precision Control" bits of the FPU Control Word for REAL4 precision instead of the default REAL10 precision when the FPU is initialized. (Doing this after the finit instruction when it is retained in each procedure had no significant effect on the time required.)

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dioxin

Raymond,
the non-FPU version I posted earlier is also 100% accurate and for the whole range 0-FFFFFF it runs in 0.65s on an Athlon XP2600+ (1900MHz).

A closer comparrison with yours, it runs in 3.03secs on a  K6-III/400MHz.

Paul.

raymond

dioxin,

I agree that your code would be 100% accurate for converting RGBs to YUVs and back to RGBs. The main speed improvement I can see is performing a multiplication with a declared reciprocal instead of a division. I will test that variation in my algo. I don't think that the other major differences should have that much of an effect on timing, most of the cycles being taken by multiplications.

The only problem I would have with your version is how well it would perform when you start modifying the stored YUV values with only 9 bits available for the integer portion, one of those bits being required for the sign. Any increase of the absolute value of the integer above 255 would change the sign and then be disastrous on the resulting RGB.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

raymond

Replacing the divisions with multiplications (of reciprocals) yielded the following relative improvements over previously reported results (still with 100% accuracy).

CPU algo: from 7.3 down to 4.1 seconds
FPU algo: from 5.3 down to 3.8 seconds

Adapting dioxin's code provided only a marginal improvement over my CPU algo but does not perform any checking for underflow nor overflow.

It would seem that the FPU would be the route to take. It should be simpler and less prone to errors (and more portable) to work with floats when modifying the YUVs.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dioxin

Raymond,
Quotebut does not perform any checking for underflow nor overflow.
No checking needs to be done since no overflow is possible when starting with RGB values.

Paul.

raymond

Quote from: dioxin on January 11, 2005, 05:46:26 PM
Raymond,
Quotebut does not perform any checking for underflow nor overflow.
No checking needs to be done since no overflow is possible when starting with RGB values.

Paul.

I totally agree with you on that point. However, RGBs are converted to YUV to perform some operations on those values before converting the modified YUVs back to RGBs. Those modifications could result in an overflow/underflow situation. The YUV2RGB procedure MUST check that possibility to avoid erroneous results.

As an example, increasing the brightness of a color where the RED is already at its maximum would increase the value of each of the color components, including that of the RED, which would then overflow. If the increase is small, the level of the RED could go from FF to 02 which is certainly not what you should expect from an increase of brightness.

Raymond

When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

dioxin

Raymond,
   I see what you're saying. In the past I've only ever used YUV as a broadcast standard, never to process data. Video processing was always done in RGB, and then, at the end, converted to YUV ready for transmission.

   I can see that my method doesn't check for over/underflow when converting (presumably invalid) data from YUV to RGB but shouldn't it be the job of the YUV processing to make sure the output is valid?


   Having said that, I think the main problem here is that there has been no standard specified for how YUV data should be stored.
   RGB appears to be in the form 00BBGGRR but everyone here has chosen their own way to represent YUV. Perhaps Donkey has a "standard" in mind for how YUV should be stored? Perhaps there is a real standard way to store YUV that everyone (except me) knows? If there is to be signal processing of the YUV data then we need a standard way to store it.


   On the business of CPU vs FPU.
   I reckon there is scope to use both to give the best result.
   The RGB->YUV conversion is probably too simple to use both but the YUV-> RGB conversion is more complex and may allow the FPU to do useful stuff while the CPU is also doing useful stuff.

   I've already modified my code to speed it up a bit. The RGB to YUV conversion can be done as follows:


mov ebx,col        ;get the RGB colour
movzx edi,bh       ;edi=green
movzx esi,bl       ;esi=red
shr ebx,16         ;ebx=blue

imul edi,&h4B22D0      ;GREEN*0.587
imul edx,ebx,&hE978D   ;BLUE*0.114
imul ecx,esi,&h2645A1  ;RED*0.299

add edi,edx            ;accumulate in edi
add edi,ecx            ;accumulate in edi. edi now contains Y

mov y,edi      ;store Y


mov eax,esi    ;red
shl eax,23     ;line up with Y
sub eax,edi    ;(R-Y)
imul v0877&    ;0.877*(R-Y)

shl eax,1      ;double result to correct for v0877& being half size to prevent overflow
rcl edx,1

mov V,edx      ;V done


mov eax,ebx    ;blue
shl eax,23     ;line up with Y
sub eax,edi    ;(B-Y)
imul v0492&    ;0.492*(B-Y)
mov U,edx      ;U done



   I'm reluctant to do any further work if the format of the YUV values is in question since a lot of the code is very specific to the way the data is stored.


Paul.

donkey

Hi Guys,

I'm really sorry that I haven't gotten back to this, my work week sucks up much of my time and I plan to tackle the problem again this weekend with some approximations that I have been trying to work out. My preferred format is unimportant though I have chosen in my test to use 3 QWORD floats (REAL8) . Because this is an internal only type routine it is open for change.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

dioxin

Donkey,
Quote
3 QWORD floats
   Aww, bugger. That messes up my method!


Paul

raymond

My tests seemed to indicate that 32-bit floats may be accurate enough if stored in that format (vs storing only their rounded integer value). Using 64-bit floats may be an overkill.

Raymond
When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

donkey

Hi Raymond,

Yes, I tested my routine with 3 32 bit floats and it works so I think 3 x 32 bit floats is the structure I will be using. I am not sure that it is possible to reliably convert back and forth across the full range without the FPU so for now I am looking at optimizing the FPU routine. As I have little experience with the FPU it is a bit difficult for me to know where the bottle necks are and what can be done more effectively. Certainly the finit is a problem, but the routine is unstable without it.

YUV STRUCT
Y DD ?
U DD ?
V DD ?
ENDS

YUV2RGB FRAME pYUV
uses esi
LOCAL RED :D
LOCAL GREEN :D
LOCAL BLUE :D
LOCAL garbage :D

CONST SECTION
n2p032 DD 2.032
n1p703 DD 1.703
n1p14 DD 1.14
np509 DD 0.509
np194 DD 0.194

CODE SECTION

mov esi,[pYUV]

finit

fld D[esi+YUV.V]
fld D[n1p14]
fmul
fld D[esi+YUV.Y]
fadd ST0,ST1
fist D[RED]
fxch ST0,ST1
fstp D[garbage]

fld D[esi+YUV.U]
fld D[n2p032]
fmul
fld D[esi+YUV.Y]
fadd ST0,ST1
fist D[BLUE]
fxch ST0,ST1
fstp D[garbage]

fld D[esi+YUV.Y]
fld D[n1p703]
fmul

; Bring RED to the batters box...
fxch ST0,ST2
fld D[np509]
fmul

; Bring BLUE to the batters box...
fxch ST0,ST1
fld D[np194]
fmul

; Bring Y to the batters box
fxch ST0,ST2
fsub ST0,ST1
fsub ST0,ST2
fistp D[GREEN]

and D[GREEN],0FFh
and D[RED],0FFh
and D[BLUE],0FFh

mov eax,[BLUE]
shl eax,8
or eax,[GREEN]
shl eax,8
or eax,[RED]

RET
ENDF

RGB2YUV FRAME clrRGB, pYUV
uses esi
LOCAL RED :D
LOCAL GREEN :D
LOCAL BLUE :D

CONST SECTION
n877 DD 0.877
n492 DD 0.492
n114 DD 0.114
n299 DD 0.299
n587 DD 0.587

CODE SECTION

/*
Y = 0.299 R + 0.587 G + 0.114 B
U = 0.492 (B - Y)
V = 0.877 (R - Y)
*/
finit

mov esi, [pYUV]
mov eax,[clrRGB]
and eax,0FFh
mov [RED],eax

mov eax,[clrRGB]
shr eax,8
and eax,0FFh
mov [GREEN],eax

mov eax,[clrRGB]
shr eax,16
and eax,0FFh
mov [BLUE],eax

; ######### Y
fild D[RED]
fld D[n299]
fmul
fild D[BLUE]
fld D[n114]
fmul
fild D[GREEN]
fld D[n587]
fmul
fadd ST0,ST1
fadd ST0,ST2
fst D[esi+YUV.Y]

; ######### U
fild D[BLUE]
fsub ST0,ST1
fld D[n492]
fmul
fstp D[esi+YUV.U]

; ######### V
fild D[RED]
fsub ST0,ST1
fld D[n877]
fmul
fstp D[esi+YUV.V]

RET
ENDF
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

dioxin

#26
Donkey,
  you can speed up the splitting of 00BBGGRR into the colours using something like:

mov ebx,col        ;get the RGB colour
movzx edi,bh       ;edi=green
movzx esi,bl       ;esi=red
shr ebx,16         ;ebx=blue
mov red,esi        ;store in memory where the FPU can get at them
mov green,edi
mov blue,ebx


Quote
Certainly the finit is a problem, but the routine is unstable without it.
It looks like you're overflowing the FPU stack. You load lots onto it but rarely seem to pop anything off it.
Don't forget, an instruction like FMUL leaves one operand on the stack and overwrites the other with the result. It doesn't pop the stack unless you explicitly tell it to using FMULP.
If you make sure the stack is sorted then the FINIT problem will be solved.

Paul.

edit: It looks like FMUL with no parameters might compile as FMULP which does pop the stack but the jist of my comment is still valid, you need to explicitly pop the stack using the P version of the instructions e.g. FADDP not FADD, FSUBP not FSUB unless you want to keep the old value on the stack for later.

dioxin

Donkey,
   to try to show the problem more clearly, the FPU part of your RGB->YUV routine does this:
   

; ######### Y           st0             st1             st2             st3             st4
fild D[RED]             R
fld D[n299]             n299            R
fmul                    red*.299
fild D[BLUE]            B               .299R
fld D[n114]             n144            B               .299R
fmul                    .144B           .299R
fild D[GREEN]           G               .144B           .299R
fld D[n587]             n587            G               .144B           .299R
fmul                    .587G           .144B           .299R
(1)     fadd ST0,ST1            .587G+.114R     .144B           .299R
(2)     fadd ST0,ST2            Y               .144B           .299R
fst D[esi+YUV.Y]        Y               .144B           .299R

; ######### U
fild D[BLUE]            B               Y               .144B           .299R
fsub ST0,ST1            B-Y             Y               .144B           .299R
fld D[n492]             .492            B-Y             Y               .144B           .299R
fmul                    U               Y               .144B           .299R
fstp D[esi+YUV.U]       Y               .144B           .299R

; ######### V
fild D[RED]             R               Y               .144B           .299R
(3)     fsub ST0,ST1            R-Y             Y               .144B           .299R
fld D[n877]             n877            R-Y             Y               .144B           .299R
fmul                    V               Y               .144B           .299R
fstp D[esi+YUV.V]       Y               .144B           .299R 


   
   It looks like there are 3 remaining items on the stack when there should be none.

   at (1) you should have used faddp st1,st0 to remove the .144B from the stack
   this would add ST0 to ST1, but then pops the stack leaving the result in ST0

   at (2) you should have used faddp st1,st0 to remove the .299R from the stack
   this would add ST0 to ST1, but then pops the stack leaving the result in ST0

   at (3) you should have used fsubrp  to remove Y from the stack.
   this would sub st1 from st0, pop the stack leaving result in st0

   You have similar problems in the YUV->RGB code.

   If you sort out these then it shouldn't be necessary to finit before each conversion.


Quote
I am not sure that it is possible to reliably convert back and forth across the full range without the FPU

   That's not right. I did it in my earlier posting.
   Keep in mind, the ALU is 32 bit, I used 9.23 as the fixed point integer format, although only 31 bits were often used to prevent overflows.
   The FPU using SINGLEs (32-bit FPU values) only has 24 bit precision, the other bits are exponent.
   So, it's LESS accurate to use FP SINGLES than it is to us fixed point 32 bit integers.


Paul.

raymond

donkey

Working with values already on the FPU is always faster than loading it from memory every time you need it. It only requires good management of the FPU registers. Good practice is to keep track of the content of each FPU register after each FPU instruction (and knowing what effect those instructions will do to the registers).

Unless you are using REAL10 (80-bit) floats or QWORD integers from memory, it is generally not necessary to load a memory variable before using it with the content of ST0 (such as adding, multiplying, etc.).

In the modified following code:
- instructions which have not been changed are left with the original indent.
- instructions which have been modified have an extra 3 space indent
- new instructions have only 3 spaces less indent
- instructions which have been deleted are preceded with  ;;;
- comments have been added for the content of FPU registers

Your reciprocal constants have been declared with maximum precision for REAL4 floats. Otherwise, your conversion would not be accurate.

Your logic for computing the GREEN has been corrected.

Code has been added to correct the computed RGBs for underflow/overflow. Simply ANDing with FF is wrong. If you slightly decrease a value of 0, it would become negative such as FFFFFFFE for -2. If you only AND it with FF, it would result in an intensity of FE which is certainly not what you would expect. Similarly, slightly increasing a maximum value of FF could give 102 which would result in only 02 when ANDed with FF; that color component get eliminated, again not what you want.

I tried to continue with your syntax except for the use of the @@: label which I don't know if you can use.

Raymond
YUV STRUCT
      Y     DD    ?
      U     DD    ?
      V     DD    ?
ENDS

YUV2RGB FRAME pYUV
      uses esi
      LOCAL RED   :D
      LOCAL GREEN :D
      LOCAL BLUE  :D
;;;      LOCAL garbage :D

CONST SECTION
n2p032 DD 2.0325203    ;1/0.492
n1p703 DD 1.7035775    ;1/0.587
n1p14       DD 1.1402509    ;1/0.877
   n114 DD 0.114
   n299 DD 0.299
;;; np509       DD 0.509
;;; np194       DD 0.194

CODE SECTION

mov esi,[pYUV]

;;; finit
   fld D[esi+YUV.Y]     ;load first, used several times
fld D[esi+YUV.V]  ;V   Y
;;; fld D[n1p14]
  fmul D[n1p14]   ;(V/.877)   Y
;;; fld D[esi+YUV.Y]
fadd ST0,ST1      ;(V/.877+Y)   Y

fist D[RED]       ;store rounded integer, keep value on FPU for reuse
;;; fxch ST0,ST1
;;; fstp D[garbage]

fld D[esi+YUV.U]  ;U   (RED)   Y
;;; fld D[n2p032]
   fmul D[n2p032] ;(U/.492)   (RED)   Y

;----------------------
;cleanup RED while FPU busy doing multiplication

   mov  eax, D[RED]
   or   eax,eax         ;test for negative
   jns  @F
   xor  eax,eax         ;replace with 0 if negative (underflow)
@@:
   cmp  eax,255
   jbe  @F
   mov  eax,255         ;replace with maximum if overflow
@@:
   mov  D[RED],eax
;----------------------

;;; fld D[esi+YUV.Y]
   fadd ST0,ST2   ;(U/.492+Y)   (RED)   Y

fist D[BLUE]      ;store rounded integer, keep value on FPU for reuse
;;; fxch ST0,ST1
;;; fstp D[garbage]

;;; fld D[esi+YUV.Y]
;;; fld D[n1p703]
;;; fmul

;;; ; Bring RED to the batters box...
   ;BLUE is currently in ST0
;;; fxch ST0,ST2
;;; fld D[np509]
   fmul  n114     ;(BLUE*0.114)   (RED)   Y

;----------------------
;cleanup BLUE while FPU busy doing multiplication

   mov  eax, D[BLUE]
   or   eax,eax
   jns  @F
   xor  eax,eax
@@:
   cmp  eax,255
   jbe  @F
   mov  eax,255
@@:
   mov  D[BLUE],eax
;---------------------

   fsubp ST2,ST0        ;(RED)   (Y-BLUE*0.114)

;;; ; Bring BLUE to the batters box...
   ;RED is now currently in ST0
;;; fxch ST0,ST1
;;; fld D[np194]
   fmul  n299     ;(RED*0.299)   (Y-BLUE*0.114)
   fsubp ST1,ST0        ;(Y-BLUE*0.114-RED*0.299)
   fmul  n1p703         ;(Y-BLUE*0.114-RED*0.299)/0.587 = GREEN

;;; ; Bring Y to the batters box
;;; fxch ST0,ST2
;;; fsub ST0,ST1
;;; fsub ST0,ST2
fistp D[GREEN]    ;ALL registers on the FPU are now free

;----------------------
;cleanup GREEN

   mov  eax, D[GREEN]
   or   eax,eax
   jns  @F
   xor  eax,eax
@@:
   cmp  eax,255
   jbe  @F
   mov  eax,255
@@:
   mov  D[GREEN],eax
;---------------------

;;; and D[GREEN],0FFh
;;; and D[RED],0FFh
;;; and D[BLUE],0FFh

mov eax,[BLUE]
shl eax,8
or eax,[GREEN]
shl eax,8
or eax,[RED]

RET
ENDF

RGB2YUV FRAME clrRGB, pYUV
uses esi
LOCAL RED :D
LOCAL GREEN :D
LOCAL BLUE :D

CONST SECTION
n877 DD 0.877
n492 DD 0.492
n114 DD 0.114
n299 DD 0.299
n587 DD 0.587

CODE SECTION

/*
Y = 0.299 R + 0.587 G + 0.114 B
U = 0.492 (B - Y)
V = 0.877 (R - Y)
*/
;;; finit

mov esi, [pYUV]
;;; mov eax,[clrRGB]
;;; and eax,0FFh
;;; mov [RED],eax

;;; mov eax,[clrRGB]
;;; shr eax,8
;;; and eax,0FFh
;;; mov [GREEN],eax

mov eax,[clrRGB]

;per dioxin suggestion
   movzx ecx,al
   mov [RED],ecx
   movzx ecx,ah
   mov [GREEN],ecx

shr eax,16
and eax,0FFh
mov [BLUE],eax

; ######### Y
fild D[RED]       ;RED
   fild D[BLUE]           ;BLUE   RED
   fld  ST1               ;RED   BLUE   RED
;;; fld D[n299]
   fmul D[n299]   ;(RED*.299)   BLUE   RED
fld  ST1          ;BLUE   (RED*.299)   BLUE   RED
;;; fld D[n114]
   fmul D[n114]   ;(BLUE*.114)   (RED*.299)   BLUE   RED
fild D[GREEN]     ;GREEN   (BLUE*.114)   (RED*.299)   BLUE   RED
;;; fld D[n587]
   fmul D[n587]   ;(GREEN*.587)   (BLUE*.114)   (RED*.299)   BLUE   RED
   faddp ST1,ST0  ;(GREEN*.587+BLUE*.114)   (RED*.299)   BLUE   RED
   faddp ST1,ST0  ;Y   BLUE   RED
fst D[esi+YUV.Y]  ;Y   BLUE   RED

; ######### U
;;; fild D[BLUE]
   fsub ST2,ST0   ;Y   BLUE   (R-Y)
   fsubp ST1,ST0          ;(B-Y)   (R-Y)
;;; fld D[n492]
   fmul D[n492]   ;((B-Y)*.492)   (R-Y)
fstp D[esi+YUV.U] ;(R-Y)

; ######### V
;;; fild D[RED]
;;; fsub D[RED]   ;(R-Y)
;;; fld D[n877]
   fmul D[n877]   ;((R-Y)*.877)
fstp D[esi+YUV.V] ;ALL registers on the FPU are now free

RET
ENDF

When you assume something, you risk being wrong half the time
http://www.ray.masmcode.com

daydreamer

#29
just for the fun, I am coding a SSE version
but I have to get my shuffles right first before I post it

ARGB    dw 00FFh,08040h ;ARGB format
ALIGN 16
CNSTRGB    REAL4 0.0,0.299,0.587,0.114
CBYRY   REAL4 0.492,0.877,0.492,0.877

msk    dq 000000FF000000FFh
.CODE
RGBtoYUV  PROC
    ;load constants before loop
    MOVAPS XMM7,[CNSTRGB]
    MOVAPS XMM6,[CBYRY]
    MOVQ MM2,[msk]
    PXOR MM1,MM1
    PINSRW MM0,[ARGB],0 ;load 16 bits at a time from ARGB
    PINSRW MM0,[ARGB+1],2
    PINSRW MM1,[ARGB+2],0
    PAND MM0,MM2
    PAND MM1,MM2 ;mask out upper half of 16bit numbers retrieved with PINSRW
   
    CVTPI2PS XMM0,MM0 ;mov and converts two dw->2floats A and R
    MOVLHPS XMM1,XMM0 ;movs to upper half two floats
    CVTPI2PS XMM1,MM1 ;mov and convert second two G and B
    MOVAPS XMM0,XMM1 ;copy in order to save B and R
       MULPS XMM0,XMM7
       MOVHLPS XMM2,XMM0
       ADDPS XMM0,XMM2 ;A + xG, xR+xB
               
       ;shufps ??? shuffle to line up for next add
       ADDSS XMM0,XMM2 ;final add
       MOVLHPS XMM0,XMM2 ;make two copies of Y
       ;pshufd instr to line up BY,RY
       SUBPS XMM0,XMM2
       MULPS XMM0,XMM6 ;multiply with last two constants
       ;xmm2 contains Y, xmm0 contains U and V     

    EMMS ;after looping thru all pixels in a picture
    ret
RGBtoYUV ENDP