Print Page - Intro and coding problem/question

Title: Intro and coding problem/question
Post by: robione on December 13, 2009, 08:18:32 PM

Hi. I learned x86 many years ago on my own. "Learned" is probably the wrong word LOL.... I bought Borland C++ 3.0 when I was 14 and struggled with it. C was way easier to me. I tried to tinker with assembly over the years but things didn't fall into place in my mind. Kinda funny how that works cause many years later in college I took an assembler course and barely did any work cause things just seemed to make sense. So... many years later :) .... at the present I'm still pretty much a nooblet. I don't know much beyond moves, shifts, logical ops, jumps, and arithmetic instructions.

I started to tinker with computer vision a few weeks back. Just really for the fun of it and I wrote a greyscale converter, and Sobel/Laplace edge detectors in C++. Then came the need for speed :)..... so I optimized my C code. It was good but not quite fast enough.... as I'm continually doing a fullscreen windows BitBlt() and then doing my processing from that image. So I started to convert my algorithms from C to assembly.... I think I'm slower then the release builds... not by much but I'm way faster then the debug builds.

So on my quest for more speed I looked into SSE!!! And that made my greyscale converter 4x as fast. (Although I take 7 instructions just to rearrange the registers from xxxAxxxBxxxCxxxD to xxxx|xxxx|xxxx|ABCD :/ ) I'm not too sure if I can take advantage of it for my edge detectors though. But my problem that I'm trying to tackle now is converting my Sobel detector into plain old x86 assembly.... and have it work. The C++ release build does about 3500 iterations in 1 minute correctly.... my failed code can do about 3350 in a minute. It's failed cause it doesn't properly determine the edges

I was hoping if you guys wouldn't mind helping me understand why it's not working and perhaps give me some pointers.

Code Select

/*iSum[0] = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
			(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
			(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
*//**
		mov		dl,byte ptr[esi+1]
		mov		bl,byte ptr[esi+eax+1]
		mov		cl,byte ptr[esi+eax-1]
		shl		edx,1
		neg		eax
		sub		ebx,ecx
		add		ebx,edx
		xor		edx,edx
		mov		dl,byte ptr[esi-1]
		mov		cl,byte ptr[esi+eax-1]
		shl		edx,1
		sub		ebx,ecx
		sub		ebx,edx
		mov		cl,byte ptr[esi+eax+1]
		add		ebx,ecx

esi = byPreCalcSrc
eax = width

And the final line ebx = iSum[0]

This is in a nested loop with the loop counters saved to the stack. And ebx,ecx,edx are cleared before this part. The code above is where 97+% of the time is spent. It continues after this to calculate iSum[1] which is pretty similar to iSum[0] and then finishes up some misc stuff before repeating. I originally had width in edx but I got an error with that. I also get the same error doing something like mov al,[esi-eax+1] then on the next line mov al,[esi+eax+1]. Actually 'esi-eax' gives me the error. I assume my 'neg eax' is probably the cause of my current errors. How would I approach converting this over?

If you're wondering why I'd like to convert this if the C code is faster, it's because I'd like to bring this over to SSE, skip the greyscale conversion and implement Sobel on each color channel. I can probably figure out the SSE if I can get this working. Thanks guys for your help.

Title: Re: Intro and coding problem/question
Post by: jj2007 on December 13, 2009, 10:24:22 PM

You seem not to clear your registers before the byte size moves. Try

Code Select

movzx edx, byte ptr[esi+1]
instead of

Code Select

mov dl, byte ptr[esi+1]

Title: Re: Intro and coding problem/question
Post by: drizz on December 14, 2009, 12:09:05 AM

It would help if you post the whole procedure.

Code Select

	; remove clearing of ebx ecx edx
	sub esi,eax
	movzx ebx,byte ptr[esi+1]
	movzx ecx,byte ptr[esi+eax*2+1]
	add ebx,ecx;(*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc + width + 1))
	movzx edx,byte ptr[esi-1]
	movzx ecx,byte ptr[esi+eax*2-1]
	sub ebx,edx; - (*(byPreCalcSrc - width - 1))
	sub ebx,ecx; - (*(byPreCalcSrc + width - 1))
	movzx ecx,byte ptr[esi+eax+1]
	movzx edx,byte ptr[esi+eax-1]
	sub ecx,edx;(*(byPreCalcSrc + 1)) - (*(byPreCalcSrc - 1));
	lea ebx,[ebx+ecx*2];ebx += ecx<<1
	add esi,eax; restore esi

Title: Re: Intro and coding problem/question
Post by: robione on December 14, 2009, 12:24:39 AM

Oooo I like that instruction.... that's what I meant about being a noob lol... the only move instruction I knew is 'mov'.... I cleared them with xor's before and as needed as I progressed through the algo. I 'll change my code. I think it'll run faster but I don't think it'll solve my problem. Also my compiler will give me an error moving a byte to a 32 register..... ahhh but I see that's the whole purpose for this command :D

I kinda learned most of my assembler by looking at the disassembled stuff in my debugger. I never came across movzx before. Well that just sped up my code a bunch as I no longer need the xor's. I'm a hair faster then the release build :D. I think the problem is that IDK how to address the memory properly that I need to.

The algo is essentially an approximation of the 1st derivative of the change in color. It's two 3x3 masks run over every pixel. One being [-1,0,1][-2,0,2][-1,0,1] The second is pretty much the transposition of that matrix. In my C code I basically look back one row, look at the current row I'm on, and look one row ahead. I thought I could do the same in assembly? Apparently my copmiler doesn't like something like:

Code Select


	mov		eax,width
	mov		esi,src
	movzx	ebx,byte ptr[esi-eax+1]
	movzx	edx,byte ptr[esi+eax-1]

That gives me the following error: D:\Visual Studio Projects\edge\edge.cpp(464) : error C2425: '-' : non-constant expression in 'second operand'

So I tried doing a 'neg eax' beforehand. You can't actually subtract from the base address when using the []'s can you? That's what I was trying to accomplish with the neg. I have a sneaky suspicion I'm adding a huge number to the base address in esi instead of subtracting 1024. Can I do something like 'mov edx,byte ptr [esi+eax+eax+eax+1]'? Does all the '+' and '*' add a performance cost?

Thanks

Title: Re: Intro and coding problem/question
Post by: hutch-- on December 14, 2009, 02:07:37 AM

Rob,

> movzx ebx,byte ptr[esi-eax+1]

The complex addressing modes do not allow the subtraction of the index, its still a BASE address plus the INDEX then any displacement after as a constant. The real problem is there is no opcode to do this task.

Code it as "movzx ebx, [esi+eax1]" and modify the EAX register if you wish to go below the base address in ESI.

Title: Re: Intro and coding problem/question
Post by: robione on December 14, 2009, 02:24:16 AM

TYVM guys!!!!! I got it working with your help. I did make a mistake outside of what I posted but that's fixed. The times seem to be a bit more variable now after changing it to just look forward instead of backward and forward but it did squeak above 3700 iterations a minute :D (It also got down to 3650.... the back/forward looking version was a little more consistent around 3670. Kinda surprises me since I "removed" instructions. Does LEA "add" them in?)

Here's the working code:

Code Select


void Sobel(BYTE *src, BYTE *dest, int width, int height) {
/**/DWORD time_i,time_f;
::timeBeginPeriod(1);
time_f = time_i = ::timeGetTime();

while(time_f-time_i < 60000) {
/**/
	__asm {
		mov		esi,src
		mov		edi,dest
		mov		ebx,width
		mov		ecx,height
		inc		esi
		lea		edi,[edi+ebx+1]

		mov		eax,ebx
		sub		ebx,2
		sub		ecx,2

rows:
		dec		ecx
		push	ecx
		mov		ecx,ebx
cols:
		dec		ecx
		push	ebx
		push	ecx		

		movzx	edx,byte ptr[esi+eax*2+1]
		movzx	ebx,byte ptr[esi+1]
		movzx	ecx,byte ptr[esi-1]
		sub		ebx,ecx
		add		ebx,edx
		movzx	ecx,byte ptr[esi+eax-1]
		movzx	edx,byte ptr[esi+eax*2-1]
		sub		ebx,edx
		movzx	edx,byte ptr[esi+eax+1]
		sub		edx,ecx
		lea		ebx,[ebx+edx*2]

		push	ebx

		movzx	edx,byte ptr[esi]
		movzx	ebx,byte ptr[esi+1]
		movzx	ecx,byte ptr[esi+eax*2]
		sub		edx,ecx
		lea		ebx,[ebx+edx*2]
		movzx	edx,byte ptr[esi-1]
		movzx	ecx,byte ptr[esi+eax*2-1]
		add		ebx,edx
		sub		ebx,ecx
		movzx	ecx,byte ptr[esi+eax*2+1]
		sub		ebx,ecx

		pop		edx

		cmp		edx,0
		jge		Ex_nlt0
		neg		edx
Ex_nlt0:
		cmp		ebx,0
		jge		Ey_nlt0
		neg		ebx
Ey_nlt0:
		add		edx,ebx

		cmp		edx,0xff
		jle		ngt255
		mov		edx,0xff
ngt255:
//		cmp		eax,0
//		jge		nlt0
//		xor		eax,eax
//nlt0:
//For the above to be true a very large sum must be present so that SumX + SumY > 2^31
		mov		byte ptr[edi],dl

		pop		ecx
		pop		ebx
		inc		esi
		inc		edi

		cmp		ecx,0
		jnz		cols

		pop		ecx
		add		esi,2
		add		edi,2
		cmp		ecx,0
		jnz		rows
	}
/**/
count++;
time_f = ::timeGetTime();
}
::timeEndPeriod(1);
/**/
}

Title: Re: Intro and coding problem/question
Post by: drizz on December 14, 2009, 12:10:12 PM

Can you post the c++ version too?

Title: Re: Intro and coding problem/question
Post by: robione on December 14, 2009, 05:35:40 PM

This is what I have left of it commented out at the bottom of my .cpp file. Hopefully the missing variable declarations are easy enough to figure out :) Pretty much just add the declarations and replace the assembly code and it should be good to go. This assumes it's greyscale. Working on a BitBlitter (that probably can't do anything..... dam Windows.... I only say that in the nicest way :) ), then converting this to SSE... well I'll try anyway :)

Code Select


/**
	for(y = 1,byPreCalcDest = dest + width + 1,byPreCalcSrc = src + width + 1;
		y<ylim;
		++y,byPreCalcDest+=2,byPreCalcSrc+=2) {

		for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
			iSumX = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
				(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
			iSumY = (*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc - width - 1)) -
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
				(*(byPreCalcSrc - width) << 1) - (*(byPreCalcSrc + width) << 1);


			iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
			
			if(iSumY > 255) iSumY = 255;
//			if(iSumY < 0) iSumY = 0;

			*(byPreCalcDest) = (unsigned char)iSumY;
		}
	}
*/

Oopsie.... xlim and ylim are the height and width minus one.

Title: Re: Intro and coding problem/question
Post by: drizz on December 14, 2009, 06:57:28 PM

You should first try to optimize the c++ version.
1) There is no need to read the same memory twice (good compilers will optimize this of course) (see: TRY1)
2) I'm not sure if reading a dword will give any significant performance increase since it adds some overhead (see: TRY2)
but I'm sure you can get increase in speed by reducing the number of memory reads for more than a dword.
You can try unrolling by 2 just by just by doing (tri_p>>=8;; tri_m>>=8;) and copy-pasting;
This will lead you to the SSE version.

PS this is all by heart, i didn't try to compile anything

Code Select

		for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
			
			// TRY 1
			// {
			iTmp = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1));
				
			iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2  + iTmp;
			iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) * 2  + iTmp;
			// }
			
			// TRY 2
			// {
			// read dword :
			// low byte = *(byPreCalcSrc +- width - 1)
			// 2nd byte = *(byPreCalcSrc +- width)
			// 3rd byte = *(byPreCalcSrc +- width + 1)
			DWORD tri_p=*((DWORD *)(byPreCalcSrc + width - 1));
			DWORD tri_m=*((DWORD *)(byPreCalcSrc - width - 1));
			
			
			iSumY = (((tri_m&0xFF00)+(tri_p&0xFF00))>>7); // 7 ==> >>8 <<1
			iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2;
			
			iTmp = (((tri_m&0xFF0000) + (tri_p&0xFF0000)) >>16)
				-  (((tri_m&0xFF) + (tri_p&0xFF)));
			
			iSumX += iTmp;
			iSumY += iTmp;
			// }

			iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
			
			if(iSumY > 255) iSumY = 255;
//			if(iSumY < 0) iSumY = 0;

			*(byPreCalcDest) = (unsigned char)iSumY;
		}

Title: Re: Intro and coding problem/question
Post by: drizz on December 14, 2009, 07:10:50 PM

sorry iTmp calc was wrong, should be something like:

Code Select

DWORD B1=(*(byPreCalcSrc - width + 1));
DWORD B2=(*(byPreCalcSrc - width - 1));
DWORD B3=(*(byPreCalcSrc + width + 1));
DWORD B4=(*(byPreCalcSrc + width - 1));

iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) *2;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) *2;

// ==>

iSumX += B1 - B2 + B3 - B4;
iSumY += B1 + B2 - B3 - B4;
	

// ==>

iSumX += (B1 - B4) - (B2 - B3);
iSumY += (B1 - B4) + (B2 - B3);
	
// ==>

V1 = B1 - B4;
V2 = B2 - B3;
iSumX += V1 - V2
iSumY += V1 + V2

// this should be in asm:
lea eax,[ebx+ecx];V1 + V2
sub ebx,ecx;V1 - V2

Title: Re: Intro and coding problem/question
Post by: robione on December 15, 2009, 09:33:44 PM

If I catch your meaning right your last post would replace TRY2 in the previous post? The hex stuff is a little hard for me to work with without a pen and paper lol. You shoulda seen the original version. I never benchmarked it but I know it was slow just by looking at it. It was the epitome of unoptimized code. I thought I did an okay job....... Jeez Drizz. You have any idea how much faster it is now? It went from ~3600 iterations/minute to ~4400, a 25+% improvement.

Thank You!!!!

Title: Re: Intro and coding problem/question
Post by: drizz on December 16, 2009, 01:48:04 AM

Since you didn't provide a full example to play with i did a little search and found this page with c source http://www.pages.drexel.edu/~weg22/edge.html assimilated it and tried to optimize. Finally i wrote a sse2 version that seems to run 2x slower !! :) yay

my guess is even if I improve the sse2 version, x2 seems like a lot to catch up. :(

Title: Re: Intro and coding problem/question
Post by: dedndave on December 16, 2009, 01:29:48 PM

that is interesting, Drizz
i am not up to speed on sse yet
but, i figure that if you can't do it, noone can - lol
sse isn't for everything, i guess
if you can't "pipeline" the process, you don't get the advantage from sse ???
i mention this, because i am trying to visualize how to apply "Henry Ford" to ling long kai fang :P

Title: Re: Intro and coding problem/question
Post by: drizz on December 16, 2009, 03:27:26 PM

The MMX version seems to be much faster than SSE2 :)

Code Select

.686
.model flat,c
.const
.mmx
align 16
AlignedGXGY label word
dw -1,-2,-1,00, 1, 0,-1,00
dw 0, 0, 0,00, 2, 0,-2,00
dw 1, 2, 1,00, 1, 0,-1,00

.code

option prologue:none
option epilogue:none

;; extern "C" void __cdecl SobelMMX(PBYTE dest, PBYTE src, int iWidth, int iHeight);

SobelMMX proc c dest,src,iWidth,iHeight
	locals equ 4*4
	_dest equ <dword ptr [esp+1*4+locals]>
	_src equ <dword ptr [esp+2*4+locals]>
	_iWidth equ <dword ptr [esp+3*4+locals]>
	_iHeight equ <dword ptr [esp+4*4+locals]>
	push edi
	push esi
	push ebx
	push ebp
	mov ebp,_iHeight
	mov ebx,_iWidth
	mov esi,_src
	mov edi,_dest
	mov eax,ebx
	imul ebp
	mov ecx,eax
	mov edx,eax
	mov eax,-1
	shr ecx,2
	and edx,3
	rep stosd
	mov ecx,edx
	rep stosb
	mov edi,_dest
	mov ecx,_iWidth
	add edi,ebx
	add edi,1
	sub esi,edi
	pcmpeqw mm3,mm3
	psrlw mm3,8 
	sub ebp,2
	jle endYloop
startYloop:	
		mov ebx,ecx
		sub ebx,2
		jle endXloop
startXloop:	
		lea eax,[esi+edi]
		pxor mm7,mm7
		movd mm0,[eax+000*0]
		movd mm5,[eax+ecx*1]
		movd mm2,[eax+ecx*2]
		punpcklbw mm0,mm7;// bytes -> words
		punpcklbw mm5,mm7;// bytes -> words
		punpcklbw mm2,mm7;// bytes -> words
		movq mm4,mm0
		;//movq mm5,mm1
		movq mm6,mm2
		pmullw mm0,qword ptr AlignedGXGY[00];// GX GY 1//low:sumX :: high:sumY
		;//pmullw mm1,qword ptr AlignedGXGY[16];// zeros
		pmullw mm2,qword ptr AlignedGXGY[32];// GX GY 3//low:sumX :: high:sumY
		pmullw mm4,qword ptr AlignedGXGY[08];// GX GY 1//low:sumX :: high:sumY
		pmullw mm5,qword ptr AlignedGXGY[24];// GX GY 2//low:sumX :: high:sumY
		pmullw mm6,qword ptr AlignedGXGY[40];// GX GY 3//low:sumX :: high:sumY
		paddsw mm4,mm5
		paddsw mm0,mm2;//low:sumX :: high:sumY
		paddsw mm4,mm6;//low:sumX :: high:sumY
		punpckldq mm1,mm0
		punpckldq mm5,mm4
		punpckhdq mm0,mm4
		punpckhdq mm1,mm5
		paddsw mm0,mm1
		movq mm2,mm0
		psrld mm2,16
		paddsw mm0,mm2
		pcmpeqw mm5,mm5
		pcmpgtw mm7,mm0
		pxor mm0,mm7
		psubw mm0,mm7
		punpckldq mm2,mm0
		paddw mm0,mm2
		movq mm2,mm0
		pcmpgtw mm2,mm3
		por mm0,mm2
		psubw mm5,mm0
		psrlq mm5,32
		movd edx,mm5
		add edi,1
		sub ebx,1;//X++
		mov [edi-1],dl
		jnz startXloop
		;//}
endXloop:
	add edi,2
	sub ebp,1;//Y++
	jnz startYloop
	;//}
endYloop:
	pop ebp
	pop ebx
	pop esi
	pop edi
	retn
	
SobelMMX endp

end

The problem is transforming 3x3 byte matrix into 1byte output. The 3-byte reading doesn't fit 16-byte register scheme well, so you must do multiple and remainder but that only complicates stuff, not to mention a lot of byte shifting...

Title: Re: Intro and coding problem/question
Post by: robione on December 17, 2009, 04:26:10 PM

Drizz that site you saw is where I first started. I wasn't really sure if Sobel could be sped up (with parallelism) as it was presented on that site, being done on a greyscale image and all. I wanted to do it on color images so I was hoping SSE could help out with that, as the same ops would have to be done on all the color channels once they were separated. My thoughts were that R and B edges have the potential to not be properly isolated as an edge as they have low intensities. And this turns out to be true. Running this app while I run a game I like to play called Guildwars really illustrates this as several details disappear. Running the color version I coded all these details are obvious.

I take it you saw the laplace algorithm then too? I coded that up as well when I did Sobel. I load up a 'summator' variable at the beginning of each row and compute the first pixel. From there I basically add in the values leaving, subtract the 5 coming in, and adjust the center pixel accordingly. My original thinking was that I probably could get Laplace to be faster then Sobel..... I think it was at a comparable speed LOL until I this thread started :).

Sadly I have benchmarked all steps in my current program loop (BitBlt, RGB->Gy, Sobel, OpenGL Draws), one that is out of my control turns out to be the slowest link in the chain... BitBlt. Sometimes I really don't like Windows. I don't understand how they can take something that should run 30k+ times a minute and squeak out 1,140 on my machine. I got my RGB->Gy done pretty quick but I'm eventually taking it out as I hope the Sobel/Laplace SSE color version will be comparable to the time RGB->Gy and greyscale Sobel/Laplace take now.

If you're at all interested in the RGB->Gy code:

Code Select

void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
	__declspec(align(16)) const float fvR2Gy[4] = {0.11f,0.11f,0.11f,0.11f};
	__declspec(align(16)) const float fvG2Gy[4] = {0.59f,0.59f,0.59f,0.59f};
	__declspec(align(16)) const float fvB2Gy[4] = {0.3f,0.3f,0.3f,0.3f};
	__declspec(align(16)) const UINT uiRMask[4] = {0x000000ff,0x000000ff,0x000000ff,0x000000ff};
	__declspec(align(16)) const UINT uiBMask[4] = {0x00ff0000,0x00ff0000,0x00ff0000,0x00ff0000};
	__declspec(align(16)) const UINT uiGMask[4] = {0x0000ff00,0x0000ff00,0x0000ff00,0x0000ff00};
	
	__asm {
		mov		ecx,nPixels
		mov		eax,0
		mov		edx,0
		shr		ecx,2
		mov		esi,src
		mov		edi,dest

cvt_rgba_g:
		dec		ecx
		movups	xmm0,[esi+eax]
		movaps	xmm1,xmm0
		movaps	xmm2,xmm0
		pand	xmm0,uiBMask
		pand	xmm1,uiGMask
		pand	xmm2,uiRMask
		psrld	xmm0,16
		psrld	xmm1,8
		cvtdq2ps	xmm0,xmm0
		cvtdq2ps	xmm1,xmm1
		cvtdq2ps	xmm2,xmm2
		mulps	xmm0,fvR2Gy
		mulps	xmm1,fvG2Gy
		mulps	xmm2,fvB2Gy
		addps	xmm1,xmm0
		addps	xmm2,xmm1
		cvtps2dq	xmm2,xmm2
		add		eax,16
		packuswb	xmm2,xmm2
		packuswb	xmm2,xmm2
		
		movss	[edi+edx],xmm2
		add		edx,4

		cmp		ecx,0
		jnz cvt_rgba_g
	}
}

This assumes the image width is divisible by four. Since I'm working only with full size screenies I find this holds true. (Actually I find them being divisible by 16 holds true so far)

Title: Re: Intro and coding problem/question
Post by: robione on December 17, 2009, 04:44:57 PM

If you take the 'dec ecx' instruction at the beginning of the loop and put it underneath the movups I squeaked out some more speed. Before ~11,900 iterations/minute, after ~12,700.

Title: Re: Intro and coding problem/question
Post by: dedndave on December 17, 2009, 05:07:51 PM

in practice, it makes little difference, but if you want the equation with more resolution.....

Y = 0.587G + 0.299R + 0.114B

in fact, i have seen programs that use only the green component for grey - lol
the resulting grey scale images are obviously not as good, but i was surprised to see how good they were :P

Title: Re: Intro and coding problem/question
Post by: drizz on December 17, 2009, 10:30:30 PM

Working on 32 bpp data makes things a lot easier!

PS #define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10) approximation seems like a better option for grayscaling.

Code Select



// by working on RGBQUAD instead of RGBTRIPLE this can be modified to process 4 RGBQUADs with SSE.

void GreyScale24bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
	//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
	__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
	__asm
	{
		mov eax,src
		mov edx,dest
		mov ecx,pixels
		movq mm2,qword ptr rgbmult
	L1:	punpcklbw mm0,[eax]
		psrlw mm0,8
		pmaddwd mm0,mm2
		punpckldq mm1,mm0
		paddd mm1,mm0
		psrlq mm1,32 + 10
		punpcklbw mm1,mm1
		punpcklbw mm1,mm1
		movd [edx],mm1
		add eax,3
		add edx,3
		sub ecx,1
		jnz L1
	}
}

Title: Re: Intro and coding problem/question
Post by: dedndave on December 17, 2009, 11:52:00 PM

Drizz - where'd you get the odd-ball luminance ratios ?

Title: Re: Intro and coding problem/question
Post by: drizz on December 18, 2009, 12:08:48 AM

dedndave,

QuoteY = 0.587G + 0.299R + 0.114B

=>

G*587/1000 + R*299/1000 + B*114/1000

=> approx

G*601/1024 + R*306/1024 + B*117/1024

587 * (1024/1000) = 601,088 ~~ 601
299 * (1024/1000) = 306,176 ~~ 306
114 * (1024/1000) = 116,736 ~~ 117

=> finally

Y = ( G*601 + R*306 + B*117 ) >> 10

Title: Re: Intro and coding problem/question
Post by: dedndave on December 18, 2009, 12:27:41 AM

ahhhhhhhhhhhh - i see - they are multiplied by 1.024 !! - lol
that does make the divide faster :P

Title: Re: Intro and coding problem/question
Post by: drizz on December 18, 2009, 01:24:52 AM

test 32bpp version:

Code Select

void GreyScale32bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
	//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
	__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
	__asm
	{
		mov eax,src
		mov edx,dest
		sub eax,edx
		mov ecx,pixels
		pcmpeqb xmm5,xmm5
		movdqa xmm7,xmmword ptr rgbmult
		psrld xmm5,8
		pxor xmm6,xmm6
		//movdqa - eax,edx must be aligned @ 16
		// _mm_malloc ( ,16)
	L1:	movdqa xmm0,[eax+edx]//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
		movdqa xmm1,xmm0
		punpckhqdq xmm1,xmm1
		punpcklbw xmm0,xmm6
		punpcklbw xmm1,xmm6
		pmaddwd xmm0,xmm7
		pmaddwd xmm1,xmm7
		movdqa xmm2,xmm0
		movdqa xmm3,xmm1
		psrlq xmm2,32
		psrlq xmm3,32
		paddd xmm0,xmm2
		paddd xmm1,xmm3
		pshufd xmm0,xmm0,10001000b
		pshufd xmm1,xmm1,10001000b
		punpcklqdq xmm0,xmm1
		psrld xmm0,10
		packssdw xmm0,xmm0
		packsswb xmm0,xmm0
		punpcklbw xmm0,xmm0
		punpcklwd xmm0,xmm0
		pand xmm0,xmm5
		movdqa [edx],xmm0//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
		add edx,16
		sub ecx,4
		jnz L1
	}
}

Title: Re: Intro and coding problem/question
Post by: robione on December 18, 2009, 02:34:03 AM

It's amazing this is like the third time I was writing a reply and got a response while writing LOL. So this is kinda split in two sections. First Drizz's greyscale converter and then questions I had about your MMX Sobel code.

I guess this is a lesson about not ignoring older technology. In my SSE code I cant start off with an unpack because the src is not guaranteed to be aligned properly. The source buffer is allocated in the CreateDIBSection() call I need to make before BitBlt()'s and I have no idea what windows is doing. And the other dumb thing about SSE/2/3/3.1 (SSSE3) is that there is no integer multiplication/division (ok well there is pmulhw and pmullw. I need to read up on them). So I was curious to see what your code could do for time. About ~8900. I told myself I needed to unroll this to be fair. 2 pixels/loop .... ~ 11,800. 4 pixels/loop .... ~13,400. MMX beats SSE3

How long have you been doing this Drizz? My impression is this is like second nature to you.... It takes me forever to code something simple in asssembly. Then to add tricks on top... like realizing to multiply the float conversions by 1024 then shift left 10.... never happen to me.... not yet anyway :).

**********************************************************************
Response after DednDave's comment about green
**********************************************************************

Yeah green I don't have a problem seeing. It's such a large component of luminosity. But lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain on this.... well between this and physics. I'm recalling my original thinking now. I am by default creating 3+ times the work by doing all color channels and there is no way around that. I can only save time by possibly parallelizing the computation on one row. (I need to write this stuff down as I had completely different thoughts in my last post.)

Drizz I'd like to benchmark what you did on my machine but I'm having some problems understanding some lines as I have to convert this to compile on Visual C++ 98. I'm having problems mainly before the first label. As I've been writing I've been really looking at your code and I'm answering some of my questions but I still have a few:

I also dont really get what the purpose of rep stosd is? You're writing -1 in EAX to ES, ECX number of times? How are you later accessing the -1's in ES? Why are there -1's in ES? Also I was curious why the following occurs:

Code Select


	mov esi,src
	mov edi,dest
	sub esi,edi
//loop
	lea eax,[esi+edi]

Isn't eax essentially pointing to edi? Ok.. the debugger says it points back to the original src.... Ok I think I understand, sort of. It's done so you can skip the 'inc esi' at the end of the loop? But this gets me curious about something else. Doesn't 'lea eax,[esi]' 'lea eax,[esi+edi]' 'lea eax,[esi+edi*3-3]' expand into multiple instructions within the architecture? Like 'loop' pretty much expands into 'dec ecx; cmp ecx,0; jnz label'. That would explain the behavior I had earlier encountered in my Sobel code... as the instruction count shrunk but the time was similar.

Title: Re: Intro and coding problem/question
Post by: robione on December 18, 2009, 02:36:08 AM

I can't type fast enough LOL. I wanted to try to code it :(

Title: Re: Intro and coding problem/question
Post by: FORTRANS on December 18, 2009, 02:17:37 PM

Hi,

QuoteYeah green I don't have a problem seeing. It's such a large component of luminosity. But
lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain
on this.... well between this and physics. I'm recalling my original thinking now. I am by default
creating 3+ times the work by doing all color channels and there is no way around that.

By using a gray image, you are implicitly doing a conversion
from one color space (RGB) to another (YIQ or YCbCr)
that uses a luminance (Y) and two chromaticity components.
If an edge is not showing in RGB or grayscale, perhaps it
will show up in another component in a different color space.

There are a bunch of color spaces defined for different uses.
RGB, CMY, and CMYK are easy to use. Hue, Saturation, Value
(HSV) and Hue, Luminosity, Saturation (HLS) are supposed to
be more intuitive for human interaction. NTSB defined YIQ and
JPEG uses YCbCr to reduce bandwith or compress the data stream.
Humans are less sensitive to changes in color (chromaticity) than
changes in brightness (intensity or luminosity) and thus those
components can use reduced precision.

There are color spaces that try to emulate the human visual
sensitivity as well, (L*a*b*, L*u*v*, XYZ C.I.E standards {I think}).
My reference is "Digital Image Processing" by William K. Pratt for
these last mentioned. He has conversion matrices for those and
some of the others.

Code Select


 Y = 0.299*R + 0.587*G + 0.114*B
 I = 0.596*R - 0.274*G - 0.322*B
 Q = 0.211*R - 0.523*G + 0.312*B

If you can see changes that your Sobel operation is not finding,
perhaps a different color component would help. Pratt also
discusses a number of different edge detector algorithms as well.

Regards,

Steve N.

Title: Re: Intro and coding problem/question
Post by: dedndave on December 18, 2009, 02:24:41 PM

here are some "sepia tone" conversion numbers as well

outputRed = (inputRed * .393) + (inputGreen *.769) + (inputBlue * .189)

outputGreen = (inputRed * .349) + (inputGreen *.686) + (inputBlue * .168)

outputBlue = (inputRed * .272) + (inputGreen *.534) + (inputBlue * .131)

EDIT - if the total adds up to more than 255, the 255 limit is set
i am not sure i agree with these numbers, but this is what is supposedly recommended by MS

greyscale

(http://i.techrepublic.com.com/gallery/160538-299-136.jpg)

sepia tone

(http://i.techrepublic.com.com/gallery/160539-299-136.jpg)

Title: Re: Intro and coding problem/question
Post by: sprint on December 18, 2009, 06:03:24 PM

Ahnn Dave...from where the hell did u get that car :dazzled:....man hope u will gift me one on my birthday...!!! a real one ..not a framed pic..i mean it :wink

Title: Re: Intro and coding problem/question
Post by: dedndave on December 19, 2009, 05:02:16 AM

not my car Asif - lol
although, Zara used to have a nice collection of cars and motorcycles :bg
i wish she had kept her Jag - although, i couldn't afford to change the brakes on that thing - lol
an old beat up pickup truck is more my speed :bg

(http://img685.imageshack.us/img685/6008/zeesj.jpg)

Title: Re: Intro and coding problem/question
Post by: robione on December 19, 2009, 07:39:33 AM

I think the gryscale converter has all the speed squeezed out of it possible now. What started out doing ~2400 iterations/min in C, that I got up to ~11000 myself... and with the help here is now ~22000. It's insanity!!! (... or assembly :) ) I changed Drizz's code around a little bit as it wasn't coming out quite right. It's been distilled down to this:

Code Select


void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
	__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
//while(time_f-time_i < 60000) {
/**/__asm {
		mov esi,src
		mov edi,dest
		mov ecx,nPixels
		movdqa xmm7,rgbmult
		pxor xmm6,xmm6

L1:
		movdqu xmm0,[esi]
		sub ecx,4
		add esi,16
		movdqa xmm1,xmm0
		punpckhbw xmm0,xmm6
		punpcklbw xmm1,xmm6
		pmaddwd xmm0,xmm7
		pmaddwd xmm1,xmm7
		pshufd xmm2,xmm0,0x31
		pshufd xmm3,xmm1,0x31
		paddd xmm0,xmm2
		paddd xmm1,xmm3
		pshufd xmm0,xmm0,10001000b
		pshufd xmm1,xmm1,10001000b
		punpcklqdq xmm1,xmm0
		psrld xmm1,10
		packuswb	xmm1,xmm1
		packuswb	xmm1,xmm1
		movss [edi],xmm1

		add edi,4
		cmp ecx,0
		jnz L1
	}
/**/count++;
time_f = ::timeGetTime();
}/**/
}

A note on color spaces. I think the most intuitive to use would be HSB but the goal of this little project is realtime edge detection, segmentation, feature extraction, etc. In the end I'd rather not convert anything if I can help it. I'm not sure if Sobel or Laplace is the right algorithm to use either. I'm just trying a bunch of stuff out. I am kinda curious about seeing how the YIQ comes out. If memory serves me it's a bit easier on the CPU then the RGB->HSB formula wikipedia has. After my final the 23rd I'll have a bunch more time to tinker with this stuff :)

DednDave is that the Ford GT40? I like that car.

Title: Re: Intro and coding problem/question
Post by: jj2007 on December 19, 2009, 08:03:45 AM

Quote from: robione on December 19, 2009, 07:39:33 AM
I think the gryscale converter has all the speed squeezed out of it possible now.

movdqu xmm0,[esi] is relatively slow; on a P4, lddqu is faster.
movdqa is even faster but requires 16-byte alignment. It looks as if you could provide that, but test yourself.

Title: Re: Intro and coding problem/question
Post by: FORTRANS on December 19, 2009, 02:25:14 PM

Quote from: robione on December 19, 2009, 07:39:33 AM
A note on color spaces. I think the most intuitive to use would be HSB

HSV?

Quotebut the goal of this little project is realtime edge detection, segmentation,
feature extraction, etc. In the end I'd rather not convert anything if I can help it.

Reviewing Pratt, and looking at all the example pictures, straight
RGB is probably usable without conversion to another color space
for most images. The idea behind converting is to emphasize some
characteristic of the image that is of interest.

QuoteI'm not sure if Sobel or Laplace is the right algorithm to use either.

Again, looking at the example pictures and scanning the text,
Sobel should be good as a baseline edge detector. If you find
that it fails to perform adequately on your images, then try a
different algorithm. It seems to be the best compromise between
simplicity and effectiveness.

Trying my own code, Sobel works well except on noisy images.
(For my excursions, that meant "do not apply a high pass filter".)

QuoteI'm just trying a bunch of stuff out. I am kinda curious about seeing how the YIQ
comes out. If memory serves me it's a bit easier on the CPU then the RGB->HSB formula
wikipedia has.

Yes, it is simpler to calculate. And the gray scale from individual
RGB components is a good first test. The RGB channels are highly
correlated in most images.

QuoteAfter my final the 23rd I'll have a bunch more time to tinker with this stuff :)

Well, if you come to some firm conclusions, let us know.

Regards,

Steve N.

Title: Re: Intro and coding problem/question
Post by: drizz on December 19, 2009, 06:25:05 PM

Quote from: robione on December 18, 2009, 02:34:03 AMI also dont really get what the purpose of rep stosd is? You're writing -1 in EAX to ES, ECX number of times? How are you later accessing the -1's in ES? Why are there -1's in ES?

It's this code that I moved outside the loop

Code Select

	/* image boundaries */
	     if(Y==0 || Y==originalImage.rows-1)
		  SUM = 0;
	     else if(X==0 || X==originalImage.cols-1)
		  SUM = 0;

Since in my code "dest" and "src" have the same size edges are filled with 0xFF.

QuoteIt is important to notice that pixels in the first and last rows, as well as the first and last columns cannot be manipulated by a 3x3 mask. This is because when placing the center of the mask over a pixel in the first row (for example), the mask will be outside the image boundaries.

using memset (rep stosd/stosb) is faster than having a loop that changes only first/last row and first/last columns

This can of course be avoided by having "dest" size of "(width-2)*(height-2)" but to simplify things i didn't do it.

Quote from: robione on December 18, 2009, 02:34:03 AM
Also I was curious why the following occurs:

Code Select Expand
mov esi,src mov edi,dest sub esi,edi //loop lea eax,[esi+edi]

Isn't eax essentially pointing to edi? Ok.. the debugger says it points back to the original src.... Ok I think I understand, sort of. It's done so you can skip the 'inc esi' at the end of the loop?

Yes, it's a trick i learned from c++compiler, there are many similar tricks that can be used to optimize loop code.

The one i did:

P3 = P1 - P2

mov ..,[P3+P2] ;; mem evaluates to P1
mov [P2],...
inc P2; by increasing P2 I increase both effective addresses

Remember: it's all integer arithmetic before accessing memory.

Quote from: robione on December 18, 2009, 02:34:03 AMBut this gets me curious about something else. Doesn't 'lea eax,[esi]' 'lea eax,[esi+edi]' 'lea eax,[esi+edi*3-3]' expand into multiple instructions within the architecture? Like 'loop' pretty much expands into 'dec ecx; cmp ecx,0; jnz label'. That would explain the behavior I had earlier encountered in my Sobel code... as the instruction count shrunk but the time was similar.

"lea" stands for "load effective address". This means that it only calculates address without accessing memory. You have to check the optimization manual(s) for instructions to avoid when writing optimized code.

here are some links:
http://www.agner.org/optimize/#manuals
http://www.mark.masmcode.com/
http://graphics.stanford.edu/~seander/bithacks.html
http://www.iti.cs.tu-bs.de/soft/www.goof.com/pcg/doc/opt-pairing.html
http://www.hackersdelight.org/
+ intel, amd opt. manuals which you surely have.

Quote from: robione on December 18, 2009, 02:34:03 AM
How long have you been doing this Drizz? My impression is this is like second nature to you.... It takes me forever to code something simple in asssembly. Then to add tricks on top... like realizing to multiply the float conversions by 1024 then shift left 10.... never happen to me.... not yet anyway :).

For me writing optimized code is fun. You can learn all the tricks too if you wanted. Also don't credit me for RGB2GRAY macro, I picked it up from some page, but writing code without floats was the first thing I thought of.

have fun :wink

Title: Re: Intro and coding problem/question
Post by: sprint on December 19, 2009, 07:06:36 PM

ahhh Dave...thats a sincere confession :U...now look at this attachment...u will find plenty of them in a country like india....what should i call it !!! :bg

sorry for the attachement...couldn't find a way to embed the picture along with this post..... :eek

i guess this is the best way of traveling ..... :lol

Title: Re: Intro and coding problem/question
Post by: UtillMasm on December 20, 2009, 02:47:52 AM

:green2
oh my fo, crazy indian!

Title: Re: Intro and coding problem/question
Post by: dedndave on December 20, 2009, 02:50:59 AM

Quoteoh my fo, crazy indian!

ROFL !!!

(http://www.artafterdark.com/hidden2/1947Indian_lg1.jpg)

Sorry Rob - we have completely hi-jacked your thread (http://www.laserenthusiast.com/forums/images/smilies/th_threadjacked.gif)

Title: Re: Intro and coding problem/question
Post by: robione on December 21, 2009, 06:35:07 AM

Quote from: drizz on December 19, 2009, 06:25:05 PM
here are some links:
http://www.agner.org/optimize/#manuals
http://www.mark.masmcode.com/
http://graphics.stanford.edu/~seander/bithacks.html
http://www.iti.cs.tu-bs.de/soft/www.goof.com/pcg/doc/opt-pairing.html
http://www.hackersdelight.org/
+ intel, amd opt. manuals which you surely have.

Funny you should mention the opt. manuals.... I just downloaded them Friday. Interesting reads. Thx for the links. I'll definitely be checking them out (as I procrastinate for this physics final :) )

Quote from: dedndave on December 20, 2009, 02:50:59 AM
Sorry Rob - we have completely hi-jacked your thread (http://www.laserenthusiast.com/forums/images/smilies/th_threadjacked.gif)

For a sec I was trying to figure out who you were talking to :). I haven't given out my names on forums yet. This one seems to be more of a tight community than others I'm part of. My last name is Robidoux... that + Starwars is where the Robione comes from :)

Quote from: jj2007 on December 19, 2009, 08:03:45 AM
Quote from: robione on December 19, 2009, 07:39:33 AM
I think the gryscale converter has all the speed squeezed out of it possible now.
movdqu xmm0,[esi] is relatively slow; on a P4, lddqu is faster...

I've learned to stop saying such things LOL. I end up biting my tongue. TYVM, I just need to upgrade all my compilers. As far as C++ goes I'm not sure if VC++ Express 2008 had anything lacking from VS6 Enterprise Ed. That's the only thing that's kept me from changing compilers on my netbook. Unfort, I'm limited to SSE2 instructions with my VS6 setup. I'll keep that in mind until then :)

I finished the SSE color Sobel but am having some problems. [Edit: just about everything from here down deleted.... which was alot if you didn't see it before. I basically ran one loop too long, forgot to use width in one spot (was using width-8) and forgot to switch glDrawPixels from GL_LUMINENCE to GL_RGBA. Now I just need to figure out why everything is dull but otherwise all is good.]

The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: robione on December 13, 2009, 08:18:32 PM