Intro and coding problem/question

robione · December 13, 2009, 08:18:32 PM

Hi. I learned x86 many years ago on my own. "Learned" is probably the wrong word LOL.... I bought Borland C++ 3.0 when I was 14 and struggled with it. C was way easier to me. I tried to tinker with assembly over the years but things didn't fall into place in my mind. Kinda funny how that works cause many years later in college I took an assembler course and barely did any work cause things just seemed to make sense. So... many years later :) .... at the present I'm still pretty much a nooblet. I don't know much beyond moves, shifts, logical ops, jumps, and arithmetic instructions.

I started to tinker with computer vision a few weeks back. Just really for the fun of it and I wrote a greyscale converter, and Sobel/Laplace edge detectors in C++. Then came the need for speed :)..... so I optimized my C code. It was good but not quite fast enough.... as I'm continually doing a fullscreen windows BitBlt() and then doing my processing from that image. So I started to convert my algorithms from C to assembly.... I think I'm slower then the release builds... not by much but I'm way faster then the debug builds.

So on my quest for more speed I looked into SSE!!! And that made my greyscale converter 4x as fast. (Although I take 7 instructions just to rearrange the registers from xxxAxxxBxxxCxxxD to xxxx|xxxx|xxxx|ABCD :/ ) I'm not too sure if I can take advantage of it for my edge detectors though. But my problem that I'm trying to tackle now is converting my Sobel detector into plain old x86 assembly.... and have it work. The C++ release build does about 3500 iterations in 1 minute correctly.... my failed code can do about 3350 in a minute. It's failed cause it doesn't properly determine the edges

I was hoping if you guys wouldn't mind helping me understand why it's not working and perhaps give me some pointers.

Code Select

/*iSum[0] = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
			(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
			(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
*//**
		mov		dl,byte ptr[esi+1]
		mov		bl,byte ptr[esi+eax+1]
		mov		cl,byte ptr[esi+eax-1]
		shl		edx,1
		neg		eax
		sub		ebx,ecx
		add		ebx,edx
		xor		edx,edx
		mov		dl,byte ptr[esi-1]
		mov		cl,byte ptr[esi+eax-1]
		shl		edx,1
		sub		ebx,ecx
		sub		ebx,edx
		mov		cl,byte ptr[esi+eax+1]
		add		ebx,ecx

esi = byPreCalcSrc
eax = width

And the final line ebx = iSum[0]

This is in a nested loop with the loop counters saved to the stack. And ebx,ecx,edx are cleared before this part. The code above is where 97+% of the time is spent. It continues after this to calculate iSum[1] which is pretty similar to iSum[0] and then finishes up some misc stuff before repeating. I originally had width in edx but I got an error with that. I also get the same error doing something like mov al,[esi-eax+1] then on the next line mov al,[esi+eax+1]. Actually 'esi-eax' gives me the error. I assume my 'neg eax' is probably the cause of my current errors. How would I approach converting this over?

If you're wondering why I'd like to convert this if the C code is faster, it's because I'd like to bring this over to SSE, skip the greyscale conversion and implement Sobel on each color channel. I can probably figure out the SSE if I can get this working. Thanks guys for your help.

jj2007 · December 13, 2009, 10:24:22 PM

You seem not to clear your registers before the byte size moves. Try

Code Select

movzx edx, byte ptr[esi+1]
instead of

Code Select

mov dl, byte ptr[esi+1]

drizz · December 14, 2009, 12:09:05 AM

It would help if you post the whole procedure.

Code Select

	; remove clearing of ebx ecx edx
	sub esi,eax
	movzx ebx,byte ptr[esi+1]
	movzx ecx,byte ptr[esi+eax*2+1]
	add ebx,ecx;(*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc + width + 1))
	movzx edx,byte ptr[esi-1]
	movzx ecx,byte ptr[esi+eax*2-1]
	sub ebx,edx; - (*(byPreCalcSrc - width - 1))
	sub ebx,ecx; - (*(byPreCalcSrc + width - 1))
	movzx ecx,byte ptr[esi+eax+1]
	movzx edx,byte ptr[esi+eax-1]
	sub ecx,edx;(*(byPreCalcSrc + 1)) - (*(byPreCalcSrc - 1));
	lea ebx,[ebx+ecx*2];ebx += ecx<<1
	add esi,eax; restore esi

robione · December 14, 2009, 12:24:39 AM

Oooo I like that instruction.... that's what I meant about being a noob lol... the only move instruction I knew is 'mov'.... I cleared them with xor's before and as needed as I progressed through the algo. I 'll change my code. I think it'll run faster but I don't think it'll solve my problem. Also my compiler will give me an error moving a byte to a 32 register..... ahhh but I see that's the whole purpose for this command :D

I kinda learned most of my assembler by looking at the disassembled stuff in my debugger. I never came across movzx before. Well that just sped up my code a bunch as I no longer need the xor's. I'm a hair faster then the release build :D. I think the problem is that IDK how to address the memory properly that I need to.

The algo is essentially an approximation of the 1st derivative of the change in color. It's two 3x3 masks run over every pixel. One being [-1,0,1][-2,0,2][-1,0,1] The second is pretty much the transposition of that matrix. In my C code I basically look back one row, look at the current row I'm on, and look one row ahead. I thought I could do the same in assembly? Apparently my copmiler doesn't like something like:

Code Select


	mov		eax,width
	mov		esi,src
	movzx	ebx,byte ptr[esi-eax+1]
	movzx	edx,byte ptr[esi+eax-1]

That gives me the following error: D:\Visual Studio Projects\edge\edge.cpp(464) : error C2425: '-' : non-constant expression in 'second operand'

So I tried doing a 'neg eax' beforehand. You can't actually subtract from the base address when using the []'s can you? That's what I was trying to accomplish with the neg. I have a sneaky suspicion I'm adding a huge number to the base address in esi instead of subtracting 1024. Can I do something like 'mov edx,byte ptr [esi+eax+eax+eax+1]'? Does all the '+' and '*' add a performance cost?

Thanks

hutch-- · December 14, 2009, 02:07:37 AM

Rob,

> movzx ebx,byte ptr[esi-eax+1]

The complex addressing modes do not allow the subtraction of the index, its still a BASE address plus the INDEX then any displacement after as a constant. The real problem is there is no opcode to do this task.

Code it as "movzx ebx, [esi+eax1]" and modify the EAX register if you wish to go below the base address in ESI.

robione · December 14, 2009, 02:24:16 AM

TYVM guys!!!!! I got it working with your help. I did make a mistake outside of what I posted but that's fixed. The times seem to be a bit more variable now after changing it to just look forward instead of backward and forward but it did squeak above 3700 iterations a minute :D (It also got down to 3650.... the back/forward looking version was a little more consistent around 3670. Kinda surprises me since I "removed" instructions. Does LEA "add" them in?)

Here's the working code:

Code Select


void Sobel(BYTE *src, BYTE *dest, int width, int height) {
/**/DWORD time_i,time_f;
::timeBeginPeriod(1);
time_f = time_i = ::timeGetTime();

while(time_f-time_i < 60000) {
/**/
	__asm {
		mov		esi,src
		mov		edi,dest
		mov		ebx,width
		mov		ecx,height
		inc		esi
		lea		edi,[edi+ebx+1]

		mov		eax,ebx
		sub		ebx,2
		sub		ecx,2

rows:
		dec		ecx
		push	ecx
		mov		ecx,ebx
cols:
		dec		ecx
		push	ebx
		push	ecx		

		movzx	edx,byte ptr[esi+eax*2+1]
		movzx	ebx,byte ptr[esi+1]
		movzx	ecx,byte ptr[esi-1]
		sub		ebx,ecx
		add		ebx,edx
		movzx	ecx,byte ptr[esi+eax-1]
		movzx	edx,byte ptr[esi+eax*2-1]
		sub		ebx,edx
		movzx	edx,byte ptr[esi+eax+1]
		sub		edx,ecx
		lea		ebx,[ebx+edx*2]

		push	ebx

		movzx	edx,byte ptr[esi]
		movzx	ebx,byte ptr[esi+1]
		movzx	ecx,byte ptr[esi+eax*2]
		sub		edx,ecx
		lea		ebx,[ebx+edx*2]
		movzx	edx,byte ptr[esi-1]
		movzx	ecx,byte ptr[esi+eax*2-1]
		add		ebx,edx
		sub		ebx,ecx
		movzx	ecx,byte ptr[esi+eax*2+1]
		sub		ebx,ecx

		pop		edx

		cmp		edx,0
		jge		Ex_nlt0
		neg		edx
Ex_nlt0:
		cmp		ebx,0
		jge		Ey_nlt0
		neg		ebx
Ey_nlt0:
		add		edx,ebx

		cmp		edx,0xff
		jle		ngt255
		mov		edx,0xff
ngt255:
//		cmp		eax,0
//		jge		nlt0
//		xor		eax,eax
//nlt0:
//For the above to be true a very large sum must be present so that SumX + SumY > 2^31
		mov		byte ptr[edi],dl

		pop		ecx
		pop		ebx
		inc		esi
		inc		edi

		cmp		ecx,0
		jnz		cols

		pop		ecx
		add		esi,2
		add		edi,2
		cmp		ecx,0
		jnz		rows
	}
/**/
count++;
time_f = ::timeGetTime();
}
::timeEndPeriod(1);
/**/
}

drizz · December 14, 2009, 12:10:12 PM

Can you post the c++ version too?

robione · December 14, 2009, 05:35:40 PM

This is what I have left of it commented out at the bottom of my .cpp file. Hopefully the missing variable declarations are easy enough to figure out :) Pretty much just add the declarations and replace the assembly code and it should be good to go. This assumes it's greyscale. Working on a BitBlitter (that probably can't do anything..... dam Windows.... I only say that in the nicest way :) ), then converting this to SSE... well I'll try anyway :)

Code Select


/**
	for(y = 1,byPreCalcDest = dest + width + 1,byPreCalcSrc = src + width + 1;
		y<ylim;
		++y,byPreCalcDest+=2,byPreCalcSrc+=2) {

		for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
			iSumX = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
				(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
			iSumY = (*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc - width - 1)) -
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
				(*(byPreCalcSrc - width) << 1) - (*(byPreCalcSrc + width) << 1);


			iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
			
			if(iSumY > 255) iSumY = 255;
//			if(iSumY < 0) iSumY = 0;

			*(byPreCalcDest) = (unsigned char)iSumY;
		}
	}
*/

Oopsie.... xlim and ylim are the height and width minus one.

drizz · December 14, 2009, 06:57:28 PM

You should first try to optimize the c++ version.
1) There is no need to read the same memory twice (good compilers will optimize this of course) (see: TRY1)
2) I'm not sure if reading a dword will give any significant performance increase since it adds some overhead (see: TRY2)
but I'm sure you can get increase in speed by reducing the number of memory reads for more than a dword.
You can try unrolling by 2 just by just by doing (tri_p>>=8;; tri_m>>=8;) and copy-pasting;
This will lead you to the SSE version.

PS this is all by heart, i didn't try to compile anything

Code Select

		for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
			
			// TRY 1
			// {
			iTmp = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
				(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1));
				
			iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2  + iTmp;
			iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) * 2  + iTmp;
			// }
			
			// TRY 2
			// {
			// read dword :
			// low byte = *(byPreCalcSrc +- width - 1)
			// 2nd byte = *(byPreCalcSrc +- width)
			// 3rd byte = *(byPreCalcSrc +- width + 1)
			DWORD tri_p=*((DWORD *)(byPreCalcSrc + width - 1));
			DWORD tri_m=*((DWORD *)(byPreCalcSrc - width - 1));
			
			
			iSumY = (((tri_m&0xFF00)+(tri_p&0xFF00))>>7); // 7 ==> >>8 <<1
			iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2;
			
			iTmp = (((tri_m&0xFF0000) + (tri_p&0xFF0000)) >>16)
				-  (((tri_m&0xFF) + (tri_p&0xFF)));
			
			iSumX += iTmp;
			iSumY += iTmp;
			// }

			iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
			
			if(iSumY > 255) iSumY = 255;
//			if(iSumY < 0) iSumY = 0;

			*(byPreCalcDest) = (unsigned char)iSumY;
		}

drizz · December 14, 2009, 07:10:50 PM

sorry iTmp calc was wrong, should be something like:

Code Select

DWORD B1=(*(byPreCalcSrc - width + 1));
DWORD B2=(*(byPreCalcSrc - width - 1));
DWORD B3=(*(byPreCalcSrc + width + 1));
DWORD B4=(*(byPreCalcSrc + width - 1));

iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) *2;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) *2;

// ==>

iSumX += B1 - B2 + B3 - B4;
iSumY += B1 + B2 - B3 - B4;
	

// ==>

iSumX += (B1 - B4) - (B2 - B3);
iSumY += (B1 - B4) + (B2 - B3);
	
// ==>

V1 = B1 - B4;
V2 = B2 - B3;
iSumX += V1 - V2
iSumY += V1 + V2

// this should be in asm:
lea eax,[ebx+ecx];V1 + V2
sub ebx,ecx;V1 - V2

robione · December 15, 2009, 09:33:44 PM

If I catch your meaning right your last post would replace TRY2 in the previous post? The hex stuff is a little hard for me to work with without a pen and paper lol. You shoulda seen the original version. I never benchmarked it but I know it was slow just by looking at it. It was the epitome of unoptimized code. I thought I did an okay job....... Jeez Drizz. You have any idea how much faster it is now? It went from ~3600 iterations/minute to ~4400, a 25+% improvement.

Thank You!!!!

drizz · December 16, 2009, 01:48:04 AM

Since you didn't provide a full example to play with i did a little search and found this page with c source http://www.pages.drexel.edu/~weg22/edge.html assimilated it and tried to optimize. Finally i wrote a sse2 version that seems to run 2x slower !! :) yay

my guess is even if I improve the sse2 version, x2 seems like a lot to catch up. :(

dedndave · December 16, 2009, 01:29:48 PM

that is interesting, Drizz
i am not up to speed on sse yet
but, i figure that if you can't do it, noone can - lol
sse isn't for everything, i guess
if you can't "pipeline" the process, you don't get the advantage from sse ???
i mention this, because i am trying to visualize how to apply "Henry Ford" to ling long kai fang :P

drizz · December 16, 2009, 03:27:26 PM

The MMX version seems to be much faster than SSE2 :)

Code Select

.686
.model flat,c
.const
.mmx
align 16
AlignedGXGY label word
dw -1,-2,-1,00, 1, 0,-1,00
dw 0, 0, 0,00, 2, 0,-2,00
dw 1, 2, 1,00, 1, 0,-1,00

.code

option prologue:none
option epilogue:none

;; extern "C" void __cdecl SobelMMX(PBYTE dest, PBYTE src, int iWidth, int iHeight);

SobelMMX proc c dest,src,iWidth,iHeight
	locals equ 4*4
	_dest equ <dword ptr [esp+1*4+locals]>
	_src equ <dword ptr [esp+2*4+locals]>
	_iWidth equ <dword ptr [esp+3*4+locals]>
	_iHeight equ <dword ptr [esp+4*4+locals]>
	push edi
	push esi
	push ebx
	push ebp
	mov ebp,_iHeight
	mov ebx,_iWidth
	mov esi,_src
	mov edi,_dest
	mov eax,ebx
	imul ebp
	mov ecx,eax
	mov edx,eax
	mov eax,-1
	shr ecx,2
	and edx,3
	rep stosd
	mov ecx,edx
	rep stosb
	mov edi,_dest
	mov ecx,_iWidth
	add edi,ebx
	add edi,1
	sub esi,edi
	pcmpeqw mm3,mm3
	psrlw mm3,8 
	sub ebp,2
	jle endYloop
startYloop:	
		mov ebx,ecx
		sub ebx,2
		jle endXloop
startXloop:	
		lea eax,[esi+edi]
		pxor mm7,mm7
		movd mm0,[eax+000*0]
		movd mm5,[eax+ecx*1]
		movd mm2,[eax+ecx*2]
		punpcklbw mm0,mm7;// bytes -> words
		punpcklbw mm5,mm7;// bytes -> words
		punpcklbw mm2,mm7;// bytes -> words
		movq mm4,mm0
		;//movq mm5,mm1
		movq mm6,mm2
		pmullw mm0,qword ptr AlignedGXGY[00];// GX GY 1//low:sumX :: high:sumY
		;//pmullw mm1,qword ptr AlignedGXGY[16];// zeros
		pmullw mm2,qword ptr AlignedGXGY[32];// GX GY 3//low:sumX :: high:sumY
		pmullw mm4,qword ptr AlignedGXGY[08];// GX GY 1//low:sumX :: high:sumY
		pmullw mm5,qword ptr AlignedGXGY[24];// GX GY 2//low:sumX :: high:sumY
		pmullw mm6,qword ptr AlignedGXGY[40];// GX GY 3//low:sumX :: high:sumY
		paddsw mm4,mm5
		paddsw mm0,mm2;//low:sumX :: high:sumY
		paddsw mm4,mm6;//low:sumX :: high:sumY
		punpckldq mm1,mm0
		punpckldq mm5,mm4
		punpckhdq mm0,mm4
		punpckhdq mm1,mm5
		paddsw mm0,mm1
		movq mm2,mm0
		psrld mm2,16
		paddsw mm0,mm2
		pcmpeqw mm5,mm5
		pcmpgtw mm7,mm0
		pxor mm0,mm7
		psubw mm0,mm7
		punpckldq mm2,mm0
		paddw mm0,mm2
		movq mm2,mm0
		pcmpgtw mm2,mm3
		por mm0,mm2
		psubw mm5,mm0
		psrlq mm5,32
		movd edx,mm5
		add edi,1
		sub ebx,1;//X++
		mov [edi-1],dl
		jnz startXloop
		;//}
endXloop:
	add edi,2
	sub ebp,1;//Y++
	jnz startYloop
	;//}
endYloop:
	pop ebp
	pop ebx
	pop esi
	pop edi
	retn
	
SobelMMX endp

end

The problem is transforming 3x3 byte matrix into 1byte output. The 3-byte reading doesn't fit 16-byte register scheme well, so you must do multiple and remainder but that only complicates stuff, not to mention a lot of byte shifting...

robione · December 17, 2009, 04:26:10 PM

Drizz that site you saw is where I first started. I wasn't really sure if Sobel could be sped up (with parallelism) as it was presented on that site, being done on a greyscale image and all. I wanted to do it on color images so I was hoping SSE could help out with that, as the same ops would have to be done on all the color channels once they were separated. My thoughts were that R and B edges have the potential to not be properly isolated as an edge as they have low intensities. And this turns out to be true. Running this app while I run a game I like to play called Guildwars really illustrates this as several details disappear. Running the color version I coded all these details are obvious.

I take it you saw the laplace algorithm then too? I coded that up as well when I did Sobel. I load up a 'summator' variable at the beginning of each row and compute the first pixel. From there I basically add in the values leaving, subtract the 5 coming in, and adjust the center pixel accordingly. My original thinking was that I probably could get Laplace to be faster then Sobel..... I think it was at a comparable speed LOL until I this thread started :).

Sadly I have benchmarked all steps in my current program loop (BitBlt, RGB->Gy, Sobel, OpenGL Draws), one that is out of my control turns out to be the slowest link in the chain... BitBlt. Sometimes I really don't like Windows. I don't understand how they can take something that should run 30k+ times a minute and squeak out 1,140 on my machine. I got my RGB->Gy done pretty quick but I'm eventually taking it out as I hope the Sobel/Laplace SSE color version will be comparable to the time RGB->Gy and greyscale Sobel/Laplace take now.

If you're at all interested in the RGB->Gy code:

Code Select

void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
	__declspec(align(16)) const float fvR2Gy[4] = {0.11f,0.11f,0.11f,0.11f};
	__declspec(align(16)) const float fvG2Gy[4] = {0.59f,0.59f,0.59f,0.59f};
	__declspec(align(16)) const float fvB2Gy[4] = {0.3f,0.3f,0.3f,0.3f};
	__declspec(align(16)) const UINT uiRMask[4] = {0x000000ff,0x000000ff,0x000000ff,0x000000ff};
	__declspec(align(16)) const UINT uiBMask[4] = {0x00ff0000,0x00ff0000,0x00ff0000,0x00ff0000};
	__declspec(align(16)) const UINT uiGMask[4] = {0x0000ff00,0x0000ff00,0x0000ff00,0x0000ff00};
	
	__asm {
		mov		ecx,nPixels
		mov		eax,0
		mov		edx,0
		shr		ecx,2
		mov		esi,src
		mov		edi,dest

cvt_rgba_g:
		dec		ecx
		movups	xmm0,[esi+eax]
		movaps	xmm1,xmm0
		movaps	xmm2,xmm0
		pand	xmm0,uiBMask
		pand	xmm1,uiGMask
		pand	xmm2,uiRMask
		psrld	xmm0,16
		psrld	xmm1,8
		cvtdq2ps	xmm0,xmm0
		cvtdq2ps	xmm1,xmm1
		cvtdq2ps	xmm2,xmm2
		mulps	xmm0,fvR2Gy
		mulps	xmm1,fvG2Gy
		mulps	xmm2,fvB2Gy
		addps	xmm1,xmm0
		addps	xmm2,xmm1
		cvtps2dq	xmm2,xmm2
		add		eax,16
		packuswb	xmm2,xmm2
		packuswb	xmm2,xmm2
		
		movss	[edi+edx],xmm2
		add		edx,4

		cmp		ecx,0
		jnz cvt_rgba_g
	}
}

This assumes the image width is divisible by four. Since I'm working only with full size screenies I find this holds true. (Actually I find them being divisible by 16 holds true so far)

News:

Intro and coding problem/question

robione

robione

robione

robione

robione

robione