Graphics a la FPU

raymond · January 10, 2005, 09:11:51 PM

Here are the results and observations of more tests to verify the accuracy and relative speeds of the FPU 32-bit floats and CPU 32-bit fixed-point math algos while retaining the full precision of each for the YUV values.

The entire range of RGB values from 000000 to FFFFFF was verified by first converting the RGB to YUV, converting back from the stored YUV to RGB, and comparing the returned RGB to its original one. Both algos were 100% accurate.

The time required on my P3-550 to perform those 16,777,216 iterations was:
7.3 seconds for the CPU-based algo
8.5 seconds for the FPU-based algo

Removing the "finit" instruction from each of the two procedures in the FPU algo (and using it only once before starting the timing) resulted in reducing the measured time to: 6.9 seconds.
(This would necessitate that the programmer can manage the content of the FPU registers to prevent "FPU stack overflow")

The above time was further reduced to: 5.3 seconds
by changing the "Precision Control" bits of the FPU Control Word for REAL4 precision instead of the default REAL10 precision when the FPU is initialized. (Doing this after the finit instruction when it is retained in each procedure had no significant effect on the time required.)

Raymond

dioxin · January 10, 2005, 09:29:48 PM

Raymond,
the non-FPU version I posted earlier is also 100% accurate and for the whole range 0-FFFFFF it runs in 0.65s on an Athlon XP2600+ (1900MHz).

A closer comparrison with yours, it runs in 3.03secs on a K6-III/400MHz.

Paul.

raymond · January 11, 2005, 02:05:29 AM

dioxin,

I agree that your code would be 100% accurate for converting RGBs to YUVs and back to RGBs. The main speed improvement I can see is performing a multiplication with a declared reciprocal instead of a division. I will test that variation in my algo. I don't think that the other major differences should have that much of an effect on timing, most of the cycles being taken by multiplications.

The only problem I would have with your version is how well it would perform when you start modifying the stored YUV values with only 9 bits available for the integer portion, one of those bits being required for the sign. Any increase of the absolute value of the integer above 255 would change the sign and then be disastrous on the resulting RGB.

Raymond

raymond · January 11, 2005, 05:41:51 PM

Replacing the divisions with multiplications (of reciprocals) yielded the following relative improvements over previously reported results (still with 100% accuracy).

CPU algo: from 7.3 down to 4.1 seconds
FPU algo: from 5.3 down to 3.8 seconds

Adapting dioxin's code provided only a marginal improvement over my CPU algo but does not perform any checking for underflow nor overflow.

It would seem that the FPU would be the route to take. It should be simpler and less prone to errors (and more portable) to work with floats when modifying the YUVs.

Raymond

dioxin · January 11, 2005, 05:46:26 PM

Raymond,

Quotebut does not perform any checking for underflow nor overflow.

No checking needs to be done since no overflow is possible when starting with RGB values.

Paul.

raymond · January 12, 2005, 01:20:48 AM

Quote from: dioxin on January 11, 2005, 05:46:26 PM
Raymond,
Quotebut does not perform any checking for underflow nor overflow.
No checking needs to be done since no overflow is possible when starting with RGB values.

Paul.

I totally agree with you on that point. However, RGBs are converted to YUV to perform some operations on those values before converting the modified YUVs back to RGBs. Those modifications could result in an overflow/underflow situation. The YUV2RGB procedure MUST check that possibility to avoid erroneous results.

As an example, increasing the brightness of a color where the RED is already at its maximum would increase the value of each of the color components, including that of the RED, which would then overflow. If the increase is small, the level of the RED could go from FF to 02 which is certainly not what you should expect from an increase of brightness.

Raymond

dioxin · January 12, 2005, 11:53:25 PM

Raymond,
   I see what you're saying. In the past I've only ever used YUV as a broadcast standard, never to process data. Video processing was always done in RGB, and then, at the end, converted to YUV ready for transmission.

   I can see that my method doesn't check for over/underflow when converting (presumably invalid) data from YUV to RGB but shouldn't it be the job of the YUV processing to make sure the output is valid?

   Having said that, I think the main problem here is that there has been no standard specified for how YUV data should be stored.
   RGB appears to be in the form 00BBGGRR but everyone here has chosen their own way to represent YUV. Perhaps Donkey has a "standard" in mind for how YUV should be stored? Perhaps there is a real standard way to store YUV that everyone (except me) knows? If there is to be signal processing of the YUV data then we need a standard way to store it.

   On the business of CPU vs FPU.
   I reckon there is scope to use both to give the best result.
   The RGB->YUV conversion is probably too simple to use both but the YUV-> RGB conversion is more complex and may allow the FPU to do useful stuff while the CPU is also doing useful stuff.

   I've already modified my code to speed it up a bit. The RGB to YUV conversion can be done as follows:

Code Select


mov ebx,col        ;get the RGB colour
movzx edi,bh       ;edi=green
movzx esi,bl       ;esi=red
shr ebx,16         ;ebx=blue

imul edi,&h4B22D0      ;GREEN*0.587
imul edx,ebx,&hE978D   ;BLUE*0.114
imul ecx,esi,&h2645A1  ;RED*0.299

add edi,edx            ;accumulate in edi
add edi,ecx            ;accumulate in edi. edi now contains Y

mov y,edi      ;store Y


mov eax,esi    ;red
shl eax,23     ;line up with Y
sub eax,edi    ;(R-Y)
imul v0877&    ;0.877*(R-Y)

shl eax,1      ;double result to correct for v0877& being half size to prevent overflow
rcl edx,1

mov V,edx      ;V done


mov eax,ebx    ;blue
shl eax,23     ;line up with Y
sub eax,edi    ;(B-Y)
imul v0492&    ;0.492*(B-Y)
mov U,edx      ;U done

I'm reluctant to do any further work if the format of the YUV values is in question since a lot of the code is very specific to the way the data is stored.

Paul.

donkey · January 12, 2005, 11:59:40 PM

Hi Guys,

I'm really sorry that I haven't gotten back to this, my work week sucks up much of my time and I plan to tackle the problem again this weekend with some approximations that I have been trying to work out. My preferred format is unimportant though I have chosen in my test to use 3 QWORD floats (REAL8) . Because this is an internal only type routine it is open for change.

dioxin · January 13, 2005, 12:23:52 AM

Donkey,

Quote
3 QWORD floats

Aww, bugger. That messes up my method!

Paul

raymond · January 13, 2005, 04:40:52 AM

My tests seemed to indicate that 32-bit floats may be accurate enough if stored in that format (vs storing only their rounded integer value). Using 64-bit floats may be an overkill.

Raymond

donkey · January 13, 2005, 08:05:24 AM

Hi Raymond,

Yes, I tested my routine with 3 32 bit floats and it works so I think 3 x 32 bit floats is the structure I will be using. I am not sure that it is possible to reliably convert back and forth across the full range without the FPU so for now I am looking at optimizing the FPU routine. As I have little experience with the FPU it is a bit difficult for me to know where the bottle necks are and what can be done more effectively. Certainly the finit is a problem, but the routine is unstable without it.

Code Select

YUV STRUCT
	Y	DD	?
	U	DD	?
	V	DD	?
ENDS

YUV2RGB FRAME pYUV
	uses esi
	LOCAL RED	:D
	LOCAL GREEN	:D
	LOCAL BLUE	:D
	LOCAL garbage	:D

	CONST SECTION
		n2p032	DD	2.032
		n1p703	DD	1.703
		n1p14	DD	1.14
		np509	DD	0.509
		np194	DD	0.194

	CODE SECTION

	mov esi,[pYUV]
	
	finit

	fld D[esi+YUV.V]
	fld D[n1p14]
	fmul
	fld D[esi+YUV.Y]
	fadd ST0,ST1
	fist D[RED]
	fxch ST0,ST1
	fstp D[garbage]
	
	fld D[esi+YUV.U]
	fld D[n2p032]
	fmul
	fld D[esi+YUV.Y]
	fadd ST0,ST1
	fist D[BLUE]
	fxch ST0,ST1
	fstp D[garbage]

	fld D[esi+YUV.Y]
	fld D[n1p703]
	fmul

	; Bring RED to the batters box...
	fxch ST0,ST2
	fld D[np509]
	fmul

	; Bring BLUE to the batters box...
	fxch ST0,ST1
	fld D[np194]
	fmul
	
	; Bring Y to the batters box
	fxch ST0,ST2
	fsub ST0,ST1
	fsub ST0,ST2
	fistp D[GREEN]

	and D[GREEN],0FFh
	and D[RED],0FFh
	and D[BLUE],0FFh

	mov eax,[BLUE]
	shl eax,8
	or eax,[GREEN]
	shl eax,8
	or eax,[RED]

	RET
ENDF

RGB2YUV FRAME clrRGB, pYUV
	uses esi
	LOCAL RED	:D
	LOCAL GREEN	:D
	LOCAL BLUE	:D

	CONST SECTION
		n877	DD	0.877
		n492	DD	0.492
		n114	DD	0.114
		n299	DD	0.299
		n587	DD	0.587

	CODE SECTION

	/*
	Y = 0.299 R + 0.587 G + 0.114 B
	U = 0.492 (B - Y)
	V = 0.877 (R - Y)
	*/
	finit

	mov esi, [pYUV]
	mov eax,[clrRGB]
	and eax,0FFh
	mov [RED],eax

	mov eax,[clrRGB]
	shr eax,8
	and eax,0FFh
	mov [GREEN],eax
	
	mov eax,[clrRGB]
	shr eax,16
	and eax,0FFh
	mov [BLUE],eax

	; ######### Y
	fild D[RED]
	fld D[n299]
	fmul
	fild D[BLUE]
	fld D[n114]
	fmul
	fild D[GREEN]
	fld D[n587]
	fmul
	fadd ST0,ST1
	fadd ST0,ST2
	fst D[esi+YUV.Y]

	; ######### U
	fild D[BLUE]
	fsub ST0,ST1
	fld D[n492]
	fmul
	fstp D[esi+YUV.U]

	; ######### V
	fild D[RED]
	fsub ST0,ST1
	fld D[n877]
	fmul
	fstp D[esi+YUV.V]

	RET
ENDF

dioxin · January 13, 2005, 02:57:34 PM

Donkey,
you can speed up the splitting of 00BBGGRR into the colours using something like:

Code Select


mov ebx,col        ;get the RGB colour
movzx edi,bh       ;edi=green
movzx esi,bl       ;esi=red
shr ebx,16         ;ebx=blue
mov red,esi        ;store in memory where the FPU can get at them
mov green,edi
mov blue,ebx

Quote
Certainly the finit is a problem, but the routine is unstable without it.

It looks like you're overflowing the FPU stack. You load lots onto it but rarely seem to pop anything off it.
Don't forget, an instruction like FMUL leaves one operand on the stack and overwrites the other with the result. It doesn't pop the stack unless you explicitly tell it to using FMULP.
If you make sure the stack is sorted then the FINIT problem will be solved.

Paul.

edit: It looks like FMUL with no parameters might compile as FMULP which does pop the stack but the jist of my comment is still valid, you need to explicitly pop the stack using the P version of the instructions e.g. FADDP not FADD, FSUBP not FSUB unless you want to keep the old value on the stack for later.

dioxin · January 13, 2005, 04:49:11 PM

Donkey,
to try to show the problem more clearly, the FPU part of your RGB->YUV routine does this:

Code Select


	; ######### Y           st0             st1             st2             st3             st4
	fild D[RED]             R	
	fld D[n299]             n299            R
	fmul                    red*.299
	fild D[BLUE]            B               .299R
	fld D[n114]             n144            B               .299R
	fmul                    .144B           .299R
	fild D[GREEN]           G               .144B           .299R
	fld D[n587]             n587            G               .144B           .299R
	fmul                    .587G           .144B           .299R
 (1)     fadd ST0,ST1            .587G+.114R     .144B           .299R
 (2)     fadd ST0,ST2            Y               .144B           .299R
	fst D[esi+YUV.Y]        Y               .144B           .299R

	; ######### U
	fild D[BLUE]            B               Y               .144B           .299R
	fsub ST0,ST1            B-Y             Y               .144B           .299R
	fld D[n492]             .492            B-Y             Y               .144B           .299R
	fmul                    U               Y               .144B           .299R
	fstp D[esi+YUV.U]       Y               .144B           .299R

	; ######### V
	fild D[RED]             R               Y               .144B           .299R
 (3)     fsub ST0,ST1            R-Y             Y               .144B           .299R
	fld D[n877]             n877            R-Y             Y               .144B           .299R
	fmul                    V               Y               .144B           .299R	
	fstp D[esi+YUV.V]       Y               .144B           .299R

   It looks like there are 3 remaining items on the stack when there should be none.

   at (1) you should have used faddp st1,st0 to remove the .144B from the stack
   this would add ST0 to ST1, but then pops the stack leaving the result in ST0

   at (2) you should have used faddp st1,st0 to remove the .299R from the stack
   this would add ST0 to ST1, but then pops the stack leaving the result in ST0

   at (3) you should have used fsubrp to remove Y from the stack.
   this would sub st1 from st0, pop the stack leaving result in st0

   You have similar problems in the YUV->RGB code.

   If you sort out these then it shouldn't be necessary to finit before each conversion.

Quote
I am not sure that it is possible to reliably convert back and forth across the full range without the FPU

   That's not right. I did it in my earlier posting.
   Keep in mind, the ALU is 32 bit, I used 9.23 as the fixed point integer format, although only 31 bits were often used to prevent overflows.
   The FPU using SINGLEs (32-bit FPU values) only has 24 bit precision, the other bits are exponent.
   So, it's LESS accurate to use FP SINGLES than it is to us fixed point 32 bit integers.

Paul.

raymond · January 13, 2005, 09:39:36 PM

donkey

Working with values already on the FPU is always faster than loading it from memory every time you need it. It only requires good management of the FPU registers. Good practice is to keep track of the content of each FPU register after each FPU instruction (and knowing what effect those instructions will do to the registers).

Unless you are using REAL10 (80-bit) floats or QWORD integers from memory, it is generally not necessary to load a memory variable before using it with the content of ST0 (such as adding, multiplying, etc.).

In the modified following code:
- instructions which have not been changed are left with the original indent.
- instructions which have been modified have an extra 3 space indent
- new instructions have only 3 spaces less indent
- instructions which have been deleted are preceded with ;;;
- comments have been added for the content of FPU registers

Your reciprocal constants have been declared with maximum precision for REAL4 floats. Otherwise, your conversion would not be accurate.

Your logic for computing the GREEN has been corrected.

Code has been added to correct the computed RGBs for underflow/overflow. Simply ANDing with FF is wrong. If you slightly decrease a value of 0, it would become negative such as FFFFFFFE for -2. If you only AND it with FF, it would result in an intensity of FE which is certainly not what you would expect. Similarly, slightly increasing a maximum value of FF could give 102 which would result in only 02 when ANDed with FF; that color component get eliminated, again not what you want.

I tried to continue with your syntax except for the use of the @@: label which I don't know if you can use.

Raymond

Code Select

YUV STRUCT
      Y     DD    ?
      U     DD    ?
      V     DD    ?
ENDS

YUV2RGB FRAME pYUV
      uses esi
      LOCAL RED   :D
      LOCAL GREEN :D
      LOCAL BLUE  :D
;;;      LOCAL garbage :D

	CONST SECTION
		n2p032	DD	2.0325203    ;1/0.492
		n1p703	DD	1.7035775    ;1/0.587
		n1p14	      DD	1.1402509    ;1/0.877
	   n114	DD	0.114
	   n299	DD	0.299
;;;		np509	      DD	0.509
;;;		np194	      DD	0.194

	CODE SECTION

	mov esi,[pYUV]
	
;;;	finit
   fld D[esi+YUV.Y]     ;load first, used several times
	fld D[esi+YUV.V]  ;V   Y
;;;	fld D[n1p14]
	  fmul D[n1p14]   ;(V/.877)   Y
;;;	fld D[esi+YUV.Y]
	fadd ST0,ST1      ;(V/.877+Y)   Y

	fist D[RED]       ;store rounded integer, keep value on FPU for reuse
;;;	fxch ST0,ST1
;;;	fstp D[garbage]
	
	fld D[esi+YUV.U]  ;U   (RED)   Y
;;;	fld D[n2p032]
	   fmul D[n2p032] ;(U/.492)   (RED)   Y

;----------------------
;cleanup RED while FPU busy doing multiplication

   mov  eax, D[RED]
   or   eax,eax         ;test for negative
   jns  @F
   xor  eax,eax         ;replace with 0 if negative (underflow)
 @@:
   cmp  eax,255
   jbe  @F
   mov  eax,255         ;replace with maximum if overflow
 @@:
   mov  D[RED],eax
;----------------------

;;;	fld D[esi+YUV.Y]
	   fadd ST0,ST2   ;(U/.492+Y)   (RED)   Y

	fist D[BLUE]      ;store rounded integer, keep value on FPU for reuse
;;;	fxch ST0,ST1
;;;	fstp D[garbage]

;;;	fld D[esi+YUV.Y]
;;;	fld D[n1p703]
;;;	fmul

;;;	; Bring RED to the batters box...
   ;BLUE is currently in ST0
;;;	fxch ST0,ST2
;;;	fld D[np509]
	   fmul  n114     ;(BLUE*0.114)   (RED)   Y

;----------------------
;cleanup BLUE while FPU busy doing multiplication

   mov  eax, D[BLUE]
   or   eax,eax
   jns  @F
   xor  eax,eax
 @@:
   cmp  eax,255
   jbe  @F
   mov  eax,255
 @@:
   mov  D[BLUE],eax
;---------------------

   fsubp ST2,ST0        ;(RED)   (Y-BLUE*0.114)

;;;	; Bring BLUE to the batters box...
   ;RED is now currently in ST0
;;;	fxch ST0,ST1
;;;	fld D[np194]
	   fmul  n299     ;(RED*0.299)   (Y-BLUE*0.114)
   fsubp ST1,ST0        ;(Y-BLUE*0.114-RED*0.299)
   fmul  n1p703         ;(Y-BLUE*0.114-RED*0.299)/0.587 = GREEN
	
;;;	; Bring Y to the batters box
;;;	fxch ST0,ST2
;;;	fsub ST0,ST1
;;;	fsub ST0,ST2
	fistp D[GREEN]    ;ALL registers on the FPU are now free

;----------------------
;cleanup GREEN

   mov  eax, D[GREEN]
   or   eax,eax
   jns  @F
   xor  eax,eax
 @@:
   cmp  eax,255
   jbe  @F
   mov  eax,255
 @@:
   mov  D[GREEN],eax
;---------------------

;;;	and D[GREEN],0FFh
;;;	and D[RED],0FFh
;;;	and D[BLUE],0FFh

	mov eax,[BLUE]
	shl eax,8
	or eax,[GREEN]
	shl eax,8
	or eax,[RED]

	RET
ENDF

RGB2YUV FRAME clrRGB, pYUV
	uses esi
	LOCAL RED	:D
	LOCAL GREEN	:D
	LOCAL BLUE	:D

	CONST SECTION
		n877	DD	0.877
		n492	DD	0.492
		n114	DD	0.114
		n299	DD	0.299
		n587	DD	0.587

	CODE SECTION

	/*
	Y = 0.299 R + 0.587 G + 0.114 B
	U = 0.492 (B - Y)
	V = 0.877 (R - Y)
	*/
;;;	finit

	mov esi, [pYUV]
;;;	mov eax,[clrRGB]
;;;	and eax,0FFh
;;;	mov [RED],eax

;;;	mov eax,[clrRGB]
;;;	shr eax,8
;;;	and eax,0FFh
;;;	mov [GREEN],eax
	
	mov eax,[clrRGB]

;per dioxin suggestion
   movzx ecx,al
   mov [RED],ecx
   movzx ecx,ah
   mov [GREEN],ecx

	shr eax,16
	and eax,0FFh
	mov [BLUE],eax

	; ######### Y
	fild D[RED]       ;RED
   fild D[BLUE]           ;BLUE   RED
   fld  ST1               ;RED   BLUE   RED
;;;	fld D[n299]
	   fmul D[n299]   ;(RED*.299)   BLUE   RED
	fld  ST1          ;BLUE   (RED*.299)   BLUE   RED
;;;	fld D[n114]
	   fmul D[n114]   ;(BLUE*.114)   (RED*.299)   BLUE   RED
	fild D[GREEN]     ;GREEN   (BLUE*.114)   (RED*.299)   BLUE   RED
;;;	fld D[n587]
	   fmul D[n587]   ;(GREEN*.587)   (BLUE*.114)   (RED*.299)   BLUE   RED
	   faddp ST1,ST0  ;(GREEN*.587+BLUE*.114)   (RED*.299)   BLUE   RED
	   faddp ST1,ST0  ;Y   BLUE   RED
	fst D[esi+YUV.Y]  ;Y   BLUE   RED

	; ######### U
;;;	fild D[BLUE]
	   fsub ST2,ST0   ;Y   BLUE   (R-Y)
   fsubp ST1,ST0          ;(B-Y)   (R-Y)
;;;	fld D[n492]
	   fmul D[n492]   ;((B-Y)*.492)   (R-Y)
	fstp D[esi+YUV.U] ;(R-Y)

	; ######### V
;;;	fild D[RED]
;;;	fsub D[RED]   ;(R-Y)
;;;	fld D[n877]
	   fmul D[n877]   ;((R-Y)*.877)
	fstp D[esi+YUV.V] ;ALL registers on the FPU are now free

	RET
ENDF

daydreamer · January 14, 2005, 01:52:17 AM

just for the fun, I am coding a SSE version
but I have to get my shuffles right first before I post it

Code Select


ARGB    dw 00FFh,08040h ;ARGB format
ALIGN 16
CNSTRGB    REAL4 0.0,0.299,0.587,0.114
CBYRY   REAL4 0.492,0.877,0.492,0.877

msk    dq 000000FF000000FFh
.CODE
RGBtoYUV  PROC
    ;load constants before loop
    MOVAPS XMM7,[CNSTRGB]
    MOVAPS XMM6,[CBYRY]
    MOVQ MM2,[msk]
    PXOR MM1,MM1
    PINSRW MM0,[ARGB],0 ;load 16 bits at a time from ARGB
    PINSRW MM0,[ARGB+1],2
    PINSRW MM1,[ARGB+2],0
    PAND MM0,MM2
    PAND MM1,MM2 ;mask out upper half of 16bit numbers retrieved with PINSRW
    
    CVTPI2PS XMM0,MM0 ;mov and converts two dw->2floats A and R
    MOVLHPS XMM1,XMM0 ;movs to upper half two floats
    CVTPI2PS XMM1,MM1 ;mov and convert second two G and B
    MOVAPS XMM0,XMM1 ;copy in order to save B and R
       MULPS XMM0,XMM7
       MOVHLPS XMM2,XMM0
       ADDPS XMM0,XMM2 ;A + xG, xR+xB
                
       ;shufps ??? shuffle to line up for next add
       ADDSS XMM0,XMM2 ;final add
       MOVLHPS XMM0,XMM2 ;make two copies of Y
       ;pshufd instr to line up BY,RY
       SUBPS XMM0,XMM2
       MULPS XMM0,XMM6 ;multiply with last two constants
       ;xmm2 contains Y, xmm0 contains U and V      

    EMMS ;after looping thru all pixels in a picture
    ret
RGBtoYUV ENDP

News:

Graphics a la FPU