News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Intro and coding problem/question

Started by robione, December 13, 2009, 08:18:32 PM

Previous topic - Next topic

robione

If you take the 'dec ecx' instruction at the beginning of the loop and put it underneath the movups I squeaked out some more speed. Before ~11,900 iterations/minute, after ~12,700.

dedndave

in practice, it makes little difference, but if you want the equation with more resolution.....

Y = 0.587G + 0.299R + 0.114B

in fact, i have seen programs that use only the green component for grey - lol
the resulting grey scale images are obviously not as good, but i was surprised to see how good they were   :P

drizz

Working on 32 bpp data makes things a lot easier!

PS #define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10) approximation seems like a better option for grayscaling.



// by working on RGBQUAD instead of RGBTRIPLE this can be modified to process 4 RGBQUADs with SSE.

void GreyScale24bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
__asm
{
mov eax,src
mov edx,dest
mov ecx,pixels
movq mm2,qword ptr rgbmult
L1: punpcklbw mm0,[eax]
psrlw mm0,8
pmaddwd mm0,mm2
punpckldq mm1,mm0
paddd mm1,mm0
psrlq mm1,32 + 10
punpcklbw mm1,mm1
punpcklbw mm1,mm1
movd [edx],mm1
add eax,3
add edx,3
sub ecx,1
jnz L1
}
}



The truth cannot be learned ... it can only be recognized.

dedndave

Drizz - where'd you get the odd-ball luminance ratios ?

drizz

dedndave,

QuoteY = 0.587G + 0.299R + 0.114B

=>

G*587/1000 + R*299/1000 + B*114/1000

=> approx

G*601/1024 + R*306/1024 + B*117/1024


587 * (1024/1000) = 601,088 ~~ 601
299 * (1024/1000) = 306,176 ~~ 306
114 * (1024/1000) = 116,736 ~~ 117

=> finally

Y = ( G*601 + R*306 + B*117 ) >> 10


The truth cannot be learned ... it can only be recognized.

dedndave

ahhhhhhhhhhhh - i see - they are multiplied by 1.024 !! - lol
that does make the divide faster   :P

drizz

test 32bpp version:
void GreyScale32bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
__asm
{
mov eax,src
mov edx,dest
sub eax,edx
mov ecx,pixels
pcmpeqb xmm5,xmm5
movdqa xmm7,xmmword ptr rgbmult
psrld xmm5,8
pxor xmm6,xmm6
//movdqa - eax,edx must be aligned @ 16
// _mm_malloc ( ,16)
L1: movdqa xmm0,[eax+edx]//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
movdqa xmm1,xmm0
punpckhqdq xmm1,xmm1
punpcklbw xmm0,xmm6
punpcklbw xmm1,xmm6
pmaddwd xmm0,xmm7
pmaddwd xmm1,xmm7
movdqa xmm2,xmm0
movdqa xmm3,xmm1
psrlq xmm2,32
psrlq xmm3,32
paddd xmm0,xmm2
paddd xmm1,xmm3
pshufd xmm0,xmm0,10001000b
pshufd xmm1,xmm1,10001000b
punpcklqdq xmm0,xmm1
psrld xmm0,10
packssdw xmm0,xmm0
packsswb xmm0,xmm0
punpcklbw xmm0,xmm0
punpcklwd xmm0,xmm0
pand xmm0,xmm5
movdqa [edx],xmm0//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
add edx,16
sub ecx,4
jnz L1
}
}
The truth cannot be learned ... it can only be recognized.

robione

It's amazing this is like the third time I was writing a reply and got a response while writing LOL. So this is kinda split in two sections. First Drizz's greyscale converter and then questions I had about your MMX Sobel  code.

I guess this is a lesson about not ignoring older technology. In my SSE code I cant start off with an unpack because the src is not guaranteed to be aligned properly. The source buffer is allocated in the CreateDIBSection() call I need to make before BitBlt()'s and I have no idea what windows is doing. And the other dumb thing about SSE/2/3/3.1 (SSSE3) is that there is no integer multiplication/division (ok well there is pmulhw and pmullw. I need to read up on them). So I was curious to see what your code could do for time. About ~8900. I told myself I needed to unroll this to be fair. 2 pixels/loop .... ~ 11,800. 4 pixels/loop .... ~13,400. MMX beats SSE3

How long have you been doing this Drizz? My impression is this is like second nature to you.... It takes me forever to code something simple in asssembly. Then to add tricks on top... like realizing to multiply the float conversions by 1024 then shift left 10.... never happen to me.... not yet anyway :).

**********************************************************************
Response after DednDave's comment about green
**********************************************************************

Yeah green I don't have a problem seeing. It's such a large component of luminosity. But lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain on this.... well between this and physics. I'm recalling my original thinking now. I am by default creating 3+ times the work by doing all color channels and there is no way around that. I can only save time by possibly parallelizing the computation on one row. (I need to write this stuff down as I had completely different thoughts in my last post.)

Drizz I'd like to benchmark what you did on my machine but I'm having some problems understanding some lines as I have to convert this to compile on Visual C++ 98. I'm having problems mainly before the first label. As I've been writing I've been really looking at your code and I'm answering some of my questions but I still have a few:

I also dont really get what the purpose of rep stosd is? You're writing -1 in EAX to ES, ECX number of times? How are you later accessing the -1's in ES? Why are there -1's in ES? Also I was curious why the following occurs:


mov esi,src
mov edi,dest
sub esi,edi
//loop
lea eax,[esi+edi]


Isn't eax essentially pointing to edi? Ok.. the debugger says it points back to the original src.... Ok I think I understand, sort of. It's done so you can skip the 'inc esi' at the end of the loop? But this gets me curious about something else. Doesn't 'lea eax,[esi]' 'lea eax,[esi+edi]' 'lea eax,[esi+edi*3-3]' expand into multiple instructions within the architecture? Like 'loop' pretty much expands into  'dec ecx; cmp ecx,0; jnz label'. That would explain the behavior I had earlier encountered in my Sobel code... as the instruction count shrunk but the time was similar.

robione

I can't type fast enough LOL. I wanted to try to code it  :(

FORTRANS

Hi,

QuoteYeah green I don't have a problem seeing. It's such a large component of luminosity. But
lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain
on this.... well between this and physics. I'm recalling my original thinking now. I am by default
creating 3+ times the work by doing all color channels and there is no way around that.

   By using a gray image, you are implicitly doing a conversion
from one color space (RGB) to another (YIQ or YCbCr)
that uses a luminance (Y) and two chromaticity components.
If an edge is not showing in RGB or grayscale, perhaps it
will show up in another component in a different color space.

   There are a bunch of color spaces defined for different uses.
RGB, CMY, and CMYK are easy to use.  Hue, Saturation, Value
(HSV) and Hue, Luminosity, Saturation (HLS) are supposed to
be more intuitive for human interaction.  NTSB defined YIQ and
JPEG uses YCbCr to reduce bandwith or compress the data stream.
Humans are less sensitive to changes in color (chromaticity) than
changes in brightness (intensity or luminosity) and thus those
components can use reduced precision.

   There are color spaces that try to emulate the human visual
sensitivity as well, (L*a*b*, L*u*v*, XYZ C.I.E standards {I think}).
My reference is "Digital Image Processing" by William K. Pratt for
these last mentioned.  He has conversion matrices for those and
some of the others.


Y = 0.299*R + 0.587*G + 0.114*B
I = 0.596*R - 0.274*G - 0.322*B
Q = 0.211*R - 0.523*G + 0.312*B


   If you can see changes that your Sobel operation is not finding,
perhaps a different color component would help.  Pratt also
discusses a number of different edge detector algorithms as well.

Regards,

Steve N.

dedndave

#25
here are some "sepia tone" conversion numbers as well

outputRed = (inputRed * .393) + (inputGreen *.769) + (inputBlue * .189)

outputGreen = (inputRed * .349) + (inputGreen *.686) + (inputBlue * .168)

outputBlue = (inputRed * .272) + (inputGreen *.534) + (inputBlue * .131)

EDIT - if the total adds up to more than 255, the 255 limit is set
i am not sure i agree with these numbers, but this is what is supposedly recommended by MS

greyscale



sepia tone


sprint

Ahnn Dave...from where the hell did u get that car :dazzled:....man hope u will gift me one on my birthday...!!! a real one ..not a framed pic..i mean it :wink

dedndave

not my car Asif - lol
although, Zara used to have a nice collection of cars and motorcycles   :bg
i wish she had kept her Jag - although, i couldn't afford to change the brakes on that thing - lol
an old beat up pickup truck is more my speed   :bg



robione

I think the gryscale converter has all the speed squeezed out of it possible now. What started out doing ~2400 iterations/min in C, that I got up to ~11000 myself... and with the help here is now ~22000. It's insanity!!! (... or assembly :) ) I changed Drizz's code around a little bit as it wasn't coming out quite right. It's been distilled down to this:


void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
//while(time_f-time_i < 60000) {
/**/__asm {
mov esi,src
mov edi,dest
mov ecx,nPixels
movdqa xmm7,rgbmult
pxor xmm6,xmm6

L1:
movdqu xmm0,[esi]
sub ecx,4
add esi,16
movdqa xmm1,xmm0
punpckhbw xmm0,xmm6
punpcklbw xmm1,xmm6
pmaddwd xmm0,xmm7
pmaddwd xmm1,xmm7
pshufd xmm2,xmm0,0x31
pshufd xmm3,xmm1,0x31
paddd xmm0,xmm2
paddd xmm1,xmm3
pshufd xmm0,xmm0,10001000b
pshufd xmm1,xmm1,10001000b
punpcklqdq xmm1,xmm0
psrld xmm1,10
packuswb xmm1,xmm1
packuswb xmm1,xmm1
movss [edi],xmm1

add edi,4
cmp ecx,0
jnz L1
}
/**/count++;
time_f = ::timeGetTime();
}/**/
}


A note on color spaces. I think the most intuitive to use would be HSB but the goal of this little project is realtime edge detection, segmentation, feature extraction, etc. In the end I'd rather not convert anything if I can help it. I'm not sure if Sobel or Laplace is the right algorithm to use either. I'm just trying a bunch of stuff out. I am kinda curious about seeing how the YIQ comes out. If memory serves me it's a bit easier on the CPU then the RGB->HSB formula wikipedia has. After my final the 23rd I'll have a bunch more time to tinker with this stuff :)

DednDave is that the Ford GT40? I like that car.

jj2007

Quote from: robione on December 19, 2009, 07:39:33 AM
I think the gryscale converter has all the speed squeezed out of it possible now.

movdqu xmm0,[esi] is relatively slow; on a P4, lddqu is faster.
movdqa is even faster but requires 16-byte alignment. It looks as if you could provide that, but test yourself.