News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Intro and coding problem/question

Started by robione, December 13, 2009, 08:18:32 PM

Previous topic - Next topic

robione

Hi. I learned x86 many years ago on my own. "Learned" is probably the wrong word LOL.... I bought Borland C++ 3.0 when I was 14 and struggled with it. C was way easier to me. I tried to tinker with assembly over the years but things didn't fall into place in my mind. Kinda funny how that works cause many years later in college I took an assembler course and barely did any work cause things just seemed to make sense. So... many years later :) .... at the present I'm still pretty much a nooblet. I don't know much beyond moves, shifts, logical ops, jumps, and arithmetic instructions.

I started to tinker with computer vision a few weeks back. Just really for the fun of it and I wrote a greyscale converter, and Sobel/Laplace edge detectors in C++. Then came the need for speed :)..... so I optimized my C code. It was good but not quite fast enough.... as I'm continually doing a fullscreen windows BitBlt() and then doing my processing from that image. So I started to convert my algorithms from C to assembly.... I think I'm slower then the release builds... not by much but I'm way faster then the debug builds.

So on my quest for more speed I looked into SSE!!! And that made my greyscale converter 4x as fast. (Although I take 7 instructions just to rearrange the registers from xxxAxxxBxxxCxxxD to xxxx|xxxx|xxxx|ABCD :/ ) I'm not too sure if I can take advantage of it for my edge detectors though. But my problem that I'm trying to tackle now is converting my Sobel detector into plain old x86 assembly.... and have it work. The C++ release build does about 3500 iterations in 1 minute correctly.... my failed code can do about 3350 in a minute. It's failed cause it doesn't properly determine the edges

I was hoping if you guys wouldn't mind helping me understand why it's not working and perhaps give me some pointers.

/*iSum[0] = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
*//**
mov dl,byte ptr[esi+1]
mov bl,byte ptr[esi+eax+1]
mov cl,byte ptr[esi+eax-1]
shl edx,1
neg eax
sub ebx,ecx
add ebx,edx
xor edx,edx
mov dl,byte ptr[esi-1]
mov cl,byte ptr[esi+eax-1]
shl edx,1
sub ebx,ecx
sub ebx,edx
mov cl,byte ptr[esi+eax+1]
add ebx,ecx


esi = byPreCalcSrc
eax = width

And the final line ebx = iSum[0]

This is in a nested loop with the loop counters saved to the stack. And ebx,ecx,edx are cleared before this part. The code above is where 97+% of the time is spent. It continues after this to calculate iSum[1] which is pretty similar to iSum[0] and then finishes up some misc stuff before repeating. I originally had width in edx but I got an error with that. I also get the same error doing something like mov al,[esi-eax+1] then on the next line mov al,[esi+eax+1]. Actually 'esi-eax' gives me the error. I assume my 'neg eax' is probably the cause of my current errors. How would I approach converting this over?

If you're wondering why I'd like to convert this if the C code is faster, it's because I'd like to bring this over to SSE, skip the greyscale conversion and implement Sobel on each color channel. I can probably figure out the SSE if I can get this working. Thanks guys for your help.

jj2007

You seem not to clear your registers before the byte size moves. Try
movzx edx, byte ptr[esi+1]
instead of
mov dl, byte ptr[esi+1]

drizz

It would help if you post the whole procedure.
; remove clearing of ebx ecx edx
sub esi,eax
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi+eax*2+1]
add ebx,ecx;(*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc + width + 1))
movzx edx,byte ptr[esi-1]
movzx ecx,byte ptr[esi+eax*2-1]
sub ebx,edx; - (*(byPreCalcSrc - width - 1))
sub ebx,ecx; - (*(byPreCalcSrc + width - 1))
movzx ecx,byte ptr[esi+eax+1]
movzx edx,byte ptr[esi+eax-1]
sub ecx,edx;(*(byPreCalcSrc + 1)) - (*(byPreCalcSrc - 1));
lea ebx,[ebx+ecx*2];ebx += ecx<<1
add esi,eax; restore esi
The truth cannot be learned ... it can only be recognized.

robione

Oooo I like that instruction.... that's what I meant about being a noob lol... the only move instruction I knew is 'mov'.... I cleared them with xor's before and as needed as I progressed through the algo. I 'll change my code. I think it'll run faster but I don't think it'll solve my problem. Also my compiler will give me an error moving a byte to a 32 register..... ahhh but I see that's the whole purpose for this command :D

I kinda learned most of my assembler by looking at the disassembled stuff in my debugger. I never came across movzx before. Well that just sped up my code a bunch as I no longer need the xor's. I'm a hair faster then the release build :D. I think the problem is that IDK how to address the memory properly that I need to.

The algo is essentially an approximation of the 1st derivative of the change in color. It's two 3x3 masks run over every pixel. One being [-1,0,1][-2,0,2][-1,0,1] The second is pretty much the transposition of that matrix. In my C code I basically look back one row, look at the current row I'm on, and look one row ahead. I thought I could do the same in assembly? Apparently my copmiler doesn't like something like:


mov eax,width
mov esi,src
movzx ebx,byte ptr[esi-eax+1]
movzx edx,byte ptr[esi+eax-1]


That gives me the following error: D:\Visual Studio Projects\edge\edge.cpp(464) : error C2425: '-' : non-constant expression in 'second operand'

So I tried doing a 'neg eax' beforehand. You can't actually subtract  from the base address when using the []'s can you? That's what I was trying to accomplish with the neg. I have a sneaky suspicion I'm adding a huge number to the base address in esi instead of subtracting 1024. Can I do something like 'mov edx,byte ptr [esi+eax+eax+eax+1]'? Does all the '+' and '*' add a performance cost?

Thanks

hutch--

Rob,

> movzx ebx,byte ptr[esi-eax+1]

The complex addressing modes do not allow the subtraction of the index, its still a BASE address plus the INDEX then any displacement after as a constant. The real problem is there is no opcode to do this task.

Code it as "movzx ebx, [esi+eax1]" and modify the EAX register if you wish to go below the base address in ESI.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

robione

TYVM guys!!!!! I got it working with your help. I did make a mistake outside of what I posted but that's fixed. The times seem to be a bit more variable now after changing it to just look forward instead of backward and forward but it did squeak above 3700 iterations a minute :D (It also got down to 3650.... the back/forward looking version was a little more consistent around 3670. Kinda surprises me since I "removed" instructions. Does LEA "add" them in?)

Here's the working code:

void Sobel(BYTE *src, BYTE *dest, int width, int height) {
/**/DWORD time_i,time_f;
::timeBeginPeriod(1);
time_f = time_i = ::timeGetTime();

while(time_f-time_i < 60000) {
/**/
__asm {
mov esi,src
mov edi,dest
mov ebx,width
mov ecx,height
inc esi
lea edi,[edi+ebx+1]

mov eax,ebx
sub ebx,2
sub ecx,2

rows:
dec ecx
push ecx
mov ecx,ebx
cols:
dec ecx
push ebx
push ecx

movzx edx,byte ptr[esi+eax*2+1]
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi-1]
sub ebx,ecx
add ebx,edx
movzx ecx,byte ptr[esi+eax-1]
movzx edx,byte ptr[esi+eax*2-1]
sub ebx,edx
movzx edx,byte ptr[esi+eax+1]
sub edx,ecx
lea ebx,[ebx+edx*2]

push ebx

movzx edx,byte ptr[esi]
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi+eax*2]
sub edx,ecx
lea ebx,[ebx+edx*2]
movzx edx,byte ptr[esi-1]
movzx ecx,byte ptr[esi+eax*2-1]
add ebx,edx
sub ebx,ecx
movzx ecx,byte ptr[esi+eax*2+1]
sub ebx,ecx

pop edx

cmp edx,0
jge Ex_nlt0
neg edx
Ex_nlt0:
cmp ebx,0
jge Ey_nlt0
neg ebx
Ey_nlt0:
add edx,ebx

cmp edx,0xff
jle ngt255
mov edx,0xff
ngt255:
// cmp eax,0
// jge nlt0
// xor eax,eax
//nlt0:
//For the above to be true a very large sum must be present so that SumX + SumY > 2^31
mov byte ptr[edi],dl

pop ecx
pop ebx
inc esi
inc edi

cmp ecx,0
jnz cols

pop ecx
add esi,2
add edi,2
cmp ecx,0
jnz rows
}
/**/
count++;
time_f = ::timeGetTime();
}
::timeEndPeriod(1);
/**/
}

drizz

The truth cannot be learned ... it can only be recognized.

robione

This is what I have left of it commented out at the bottom of my .cpp file. Hopefully the missing variable declarations are easy enough to figure out :) Pretty much just add the declarations and replace the assembly code and it should be good to go. This assumes it's greyscale. Working on a BitBlitter (that probably can't do anything..... dam Windows.... I only say that in the nicest way :) ), then converting this to SSE... well I'll try anyway :)



/**
for(y = 1,byPreCalcDest = dest + width + 1,byPreCalcSrc = src + width + 1;
y<ylim;
++y,byPreCalcDest+=2,byPreCalcSrc+=2) {

for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
iSumX = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
iSumY = (*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc - width - 1)) -
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc - width) << 1) - (*(byPreCalcSrc + width) << 1);


iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);

if(iSumY > 255) iSumY = 255;
// if(iSumY < 0) iSumY = 0;

*(byPreCalcDest) = (unsigned char)iSumY;
}
}
*/


Oopsie.... xlim and ylim are the height and width minus one.

drizz

You should first try to optimize the c++ version.
1) There is no need to read the same memory twice (good compilers will optimize this of course) (see: TRY1)
2) I'm not sure if reading a dword will give any significant performance increase since it adds some overhead (see: TRY2)
but I'm sure you can get increase in speed by reducing the number of memory reads for more than a dword.
You can try unrolling by 2 just by just by doing (tri_p>>=8;;  tri_m>>=8;) and copy-pasting;
This will lead you to the SSE version.

PS this is all by heart, i didn't try to compile anything
for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {

// TRY 1
// {
iTmp = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1));

iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2  + iTmp;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) * 2  + iTmp;
// }

// TRY 2
// {
// read dword :
// low byte = *(byPreCalcSrc +- width - 1)
// 2nd byte = *(byPreCalcSrc +- width)
// 3rd byte = *(byPreCalcSrc +- width + 1)
DWORD tri_p=*((DWORD *)(byPreCalcSrc + width - 1));
DWORD tri_m=*((DWORD *)(byPreCalcSrc - width - 1));


iSumY = (((tri_m&0xFF00)+(tri_p&0xFF00))>>7); // 7 ==> >>8 <<1
iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2;

iTmp = (((tri_m&0xFF0000) + (tri_p&0xFF0000)) >>16)
-  (((tri_m&0xFF) + (tri_p&0xFF)));

iSumX += iTmp;
iSumY += iTmp;
// }

iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);

if(iSumY > 255) iSumY = 255;
// if(iSumY < 0) iSumY = 0;

*(byPreCalcDest) = (unsigned char)iSumY;
}
The truth cannot be learned ... it can only be recognized.

drizz

sorry iTmp calc was wrong, should be something like:
DWORD B1=(*(byPreCalcSrc - width + 1));
DWORD B2=(*(byPreCalcSrc - width - 1));
DWORD B3=(*(byPreCalcSrc + width + 1));
DWORD B4=(*(byPreCalcSrc + width - 1));

iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) *2;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) *2;

// ==>

iSumX += B1 - B2 + B3 - B4;
iSumY += B1 + B2 - B3 - B4;


// ==>

iSumX += (B1 - B4) - (B2 - B3);
iSumY += (B1 - B4) + (B2 - B3);

// ==>

V1 = B1 - B4;
V2 = B2 - B3;
iSumX += V1 - V2
iSumY += V1 + V2

// this should be in asm:
lea eax,[ebx+ecx];V1 + V2
sub ebx,ecx;V1 - V2

The truth cannot be learned ... it can only be recognized.

robione

If I catch your meaning right your last post would replace TRY2 in the previous post? The hex stuff is a little hard for me to work with without a pen and paper lol. You shoulda seen the original version. I never benchmarked it but I know it was slow just by looking at it. It was the epitome of unoptimized code. I thought I did an okay job....... Jeez Drizz. You have any idea how much faster it is now? It went from ~3600 iterations/minute to ~4400, a 25+% improvement.

Thank You!!!!

drizz

Since you didn't provide a full example to play with i did a little search and found this page with c source http://www.pages.drexel.edu/~weg22/edge.html assimilated it and tried to optimize. Finally i wrote a sse2 version that seems to run 2x slower !! :) yay

my guess is even if I improve the sse2 version, x2 seems like a lot to catch up. :(
The truth cannot be learned ... it can only be recognized.

dedndave

that is interesting, Drizz
i am not up to speed on sse yet
but, i figure that if you can't do it, noone can - lol
sse isn't for everything, i guess
if you can't "pipeline" the process, you don't get the advantage from sse ???
i mention this, because i am trying to visualize how to apply "Henry Ford" to ling long kai fang   :P

drizz

The MMX version seems to be much faster than SSE2 :)
.686
.model flat,c
.const
.mmx
align 16
AlignedGXGY label word
dw -1,-2,-1,00, 1, 0,-1,00
dw 0, 0, 0,00, 2, 0,-2,00
dw 1, 2, 1,00, 1, 0,-1,00

.code

option prologue:none
option epilogue:none

;; extern "C" void __cdecl SobelMMX(PBYTE dest, PBYTE src, int iWidth, int iHeight);

SobelMMX proc c dest,src,iWidth,iHeight
locals equ 4*4
_dest equ <dword ptr [esp+1*4+locals]>
_src equ <dword ptr [esp+2*4+locals]>
_iWidth equ <dword ptr [esp+3*4+locals]>
_iHeight equ <dword ptr [esp+4*4+locals]>
push edi
push esi
push ebx
push ebp
mov ebp,_iHeight
mov ebx,_iWidth
mov esi,_src
mov edi,_dest
mov eax,ebx
imul ebp
mov ecx,eax
mov edx,eax
mov eax,-1
shr ecx,2
and edx,3
rep stosd
mov ecx,edx
rep stosb
mov edi,_dest
mov ecx,_iWidth
add edi,ebx
add edi,1
sub esi,edi
pcmpeqw mm3,mm3
psrlw mm3,8
sub ebp,2
jle endYloop
startYloop:
mov ebx,ecx
sub ebx,2
jle endXloop
startXloop:
lea eax,[esi+edi]
pxor mm7,mm7
movd mm0,[eax+000*0]
movd mm5,[eax+ecx*1]
movd mm2,[eax+ecx*2]
punpcklbw mm0,mm7;// bytes -> words
punpcklbw mm5,mm7;// bytes -> words
punpcklbw mm2,mm7;// bytes -> words
movq mm4,mm0
;//movq mm5,mm1
movq mm6,mm2
pmullw mm0,qword ptr AlignedGXGY[00];// GX GY 1//low:sumX :: high:sumY
;//pmullw mm1,qword ptr AlignedGXGY[16];// zeros
pmullw mm2,qword ptr AlignedGXGY[32];// GX GY 3//low:sumX :: high:sumY
pmullw mm4,qword ptr AlignedGXGY[08];// GX GY 1//low:sumX :: high:sumY
pmullw mm5,qword ptr AlignedGXGY[24];// GX GY 2//low:sumX :: high:sumY
pmullw mm6,qword ptr AlignedGXGY[40];// GX GY 3//low:sumX :: high:sumY
paddsw mm4,mm5
paddsw mm0,mm2;//low:sumX :: high:sumY
paddsw mm4,mm6;//low:sumX :: high:sumY
punpckldq mm1,mm0
punpckldq mm5,mm4
punpckhdq mm0,mm4
punpckhdq mm1,mm5
paddsw mm0,mm1
movq mm2,mm0
psrld mm2,16
paddsw mm0,mm2
pcmpeqw mm5,mm5
pcmpgtw mm7,mm0
pxor mm0,mm7
psubw mm0,mm7
punpckldq mm2,mm0
paddw mm0,mm2
movq mm2,mm0
pcmpgtw mm2,mm3
por mm0,mm2
psubw mm5,mm0
psrlq mm5,32
movd edx,mm5
add edi,1
sub ebx,1;//X++
mov [edi-1],dl
jnz startXloop
;//}
endXloop:
add edi,2
sub ebp,1;//Y++
jnz startYloop
;//}
endYloop:
pop ebp
pop ebx
pop esi
pop edi
retn

SobelMMX endp

end


The problem is transforming 3x3 byte matrix into 1byte output. The 3-byte reading doesn't fit 16-byte register scheme well, so you must do multiple and remainder but that only complicates stuff, not to mention a lot of byte shifting...
The truth cannot be learned ... it can only be recognized.

robione

Drizz that site you saw is where I first started. I wasn't really sure if Sobel could be sped up (with parallelism) as it was presented on that site, being done on a greyscale image and all. I wanted to do it on color images so I was hoping SSE could help out with that, as the same ops would have to be done on all the color channels once they were separated. My thoughts were that R and B edges have the potential to not be properly isolated as an edge as they have low intensities. And this turns out to be true. Running this app while I run a game I like to play called Guildwars really illustrates this as several details disappear. Running the color version I coded all these details are obvious.

I take it you saw the laplace algorithm then too? I coded that up as well when I did Sobel. I load up a 'summator' variable at the beginning of each row and compute the first pixel. From there I basically add in the values leaving, subtract the 5 coming in, and adjust the center pixel accordingly. My original thinking was that I probably could get Laplace to be faster then Sobel..... I think it was at a comparable speed LOL until I this thread started :).

Sadly I have benchmarked all steps in my current program loop (BitBlt, RGB->Gy, Sobel, OpenGL Draws), one that is out of my control turns out to be the slowest link in the chain... BitBlt. Sometimes I really don't like Windows. I don't understand how they can take something that should run 30k+ times a minute and squeak out 1,140 on my machine. I got my RGB->Gy done pretty quick but I'm eventually taking it out as I hope the Sobel/Laplace SSE color version will be comparable to the time RGB->Gy and greyscale Sobel/Laplace take now.

If you're at all interested in the RGB->Gy code:
void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
__declspec(align(16)) const float fvR2Gy[4] = {0.11f,0.11f,0.11f,0.11f};
__declspec(align(16)) const float fvG2Gy[4] = {0.59f,0.59f,0.59f,0.59f};
__declspec(align(16)) const float fvB2Gy[4] = {0.3f,0.3f,0.3f,0.3f};
__declspec(align(16)) const UINT uiRMask[4] = {0x000000ff,0x000000ff,0x000000ff,0x000000ff};
__declspec(align(16)) const UINT uiBMask[4] = {0x00ff0000,0x00ff0000,0x00ff0000,0x00ff0000};
__declspec(align(16)) const UINT uiGMask[4] = {0x0000ff00,0x0000ff00,0x0000ff00,0x0000ff00};

__asm {
mov ecx,nPixels
mov eax,0
mov edx,0
shr ecx,2
mov esi,src
mov edi,dest

cvt_rgba_g:
dec ecx
movups xmm0,[esi+eax]
movaps xmm1,xmm0
movaps xmm2,xmm0
pand xmm0,uiBMask
pand xmm1,uiGMask
pand xmm2,uiRMask
psrld xmm0,16
psrld xmm1,8
cvtdq2ps xmm0,xmm0
cvtdq2ps xmm1,xmm1
cvtdq2ps xmm2,xmm2
mulps xmm0,fvR2Gy
mulps xmm1,fvG2Gy
mulps xmm2,fvB2Gy
addps xmm1,xmm0
addps xmm2,xmm1
cvtps2dq xmm2,xmm2
add eax,16
packuswb xmm2,xmm2
packuswb xmm2,xmm2

movss [edi+edx],xmm2
add edx,4

cmp ecx,0
jnz cvt_rgba_g
}
}


This assumes the image width is divisible by four. Since I'm working only with full size screenies I find this holds true. (Actually I find them being divisible by 16 holds true so far)