Hi. I learned x86 many years ago on my own. "Learned" is probably the wrong word LOL.... I bought Borland C++ 3.0 when I was 14 and struggled with it. C was way easier to me. I tried to tinker with assembly over the years but things didn't fall into place in my mind. Kinda funny how that works cause many years later in college I took an assembler course and barely did any work cause things just seemed to make sense. So... many years later :) .... at the present I'm still pretty much a nooblet. I don't know much beyond moves, shifts, logical ops, jumps, and arithmetic instructions.
I started to tinker with computer vision a few weeks back. Just really for the fun of it and I wrote a greyscale converter, and Sobel/Laplace edge detectors in C++. Then came the need for speed :)..... so I optimized my C code. It was good but not quite fast enough.... as I'm continually doing a fullscreen windows BitBlt() and then doing my processing from that image. So I started to convert my algorithms from C to assembly.... I think I'm slower then the release builds... not by much but I'm way faster then the debug builds.
So on my quest for more speed I looked into SSE!!! And that made my greyscale converter 4x as fast. (Although I take 7 instructions just to rearrange the registers from xxxAxxxBxxxCxxxD to xxxx|xxxx|xxxx|ABCD :/ ) I'm not too sure if I can take advantage of it for my edge detectors though. But my problem that I'm trying to tackle now is converting my Sobel detector into plain old x86 assembly.... and have it work. The C++ release build does about 3500 iterations in 1 minute correctly.... my failed code can do about 3350 in a minute. It's failed cause it doesn't properly determine the edges
I was hoping if you guys wouldn't mind helping me understand why it's not working and perhaps give me some pointers.
/*iSum[0] = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
*//**
mov dl,byte ptr[esi+1]
mov bl,byte ptr[esi+eax+1]
mov cl,byte ptr[esi+eax-1]
shl edx,1
neg eax
sub ebx,ecx
add ebx,edx
xor edx,edx
mov dl,byte ptr[esi-1]
mov cl,byte ptr[esi+eax-1]
shl edx,1
sub ebx,ecx
sub ebx,edx
mov cl,byte ptr[esi+eax+1]
add ebx,ecx
esi = byPreCalcSrc
eax = width
And the final line ebx = iSum[0]
This is in a nested loop with the loop counters saved to the stack. And ebx,ecx,edx are cleared before this part. The code above is where 97+% of the time is spent. It continues after this to calculate iSum[1] which is pretty similar to iSum[0] and then finishes up some misc stuff before repeating. I originally had width in edx but I got an error with that. I also get the same error doing something like mov al,[esi-eax+1] then on the next line mov al,[esi+eax+1]. Actually 'esi-eax' gives me the error. I assume my 'neg eax' is probably the cause of my current errors. How would I approach converting this over?
If you're wondering why I'd like to convert this if the C code is faster, it's because I'd like to bring this over to SSE, skip the greyscale conversion and implement Sobel on each color channel. I can probably figure out the SSE if I can get this working. Thanks guys for your help.
You seem not to clear your registers before the byte size moves. Try
movzx edx, byte ptr[esi+1]
instead of
mov dl, byte ptr[esi+1]
It would help if you post the whole procedure.
; remove clearing of ebx ecx edx
sub esi,eax
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi+eax*2+1]
add ebx,ecx;(*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc + width + 1))
movzx edx,byte ptr[esi-1]
movzx ecx,byte ptr[esi+eax*2-1]
sub ebx,edx; - (*(byPreCalcSrc - width - 1))
sub ebx,ecx; - (*(byPreCalcSrc + width - 1))
movzx ecx,byte ptr[esi+eax+1]
movzx edx,byte ptr[esi+eax-1]
sub ecx,edx;(*(byPreCalcSrc + 1)) - (*(byPreCalcSrc - 1));
lea ebx,[ebx+ecx*2];ebx += ecx<<1
add esi,eax; restore esi
Oooo I like that instruction.... that's what I meant about being a noob lol... the only move instruction I knew is 'mov'.... I cleared them with xor's before and as needed as I progressed through the algo. I 'll change my code. I think it'll run faster but I don't think it'll solve my problem. Also my compiler will give me an error moving a byte to a 32 register..... ahhh but I see that's the whole purpose for this command :D
I kinda learned most of my assembler by looking at the disassembled stuff in my debugger. I never came across movzx before. Well that just sped up my code a bunch as I no longer need the xor's. I'm a hair faster then the release build :D. I think the problem is that IDK how to address the memory properly that I need to.
The algo is essentially an approximation of the 1st derivative of the change in color. It's two 3x3 masks run over every pixel. One being [-1,0,1][-2,0,2][-1,0,1] The second is pretty much the transposition of that matrix. In my C code I basically look back one row, look at the current row I'm on, and look one row ahead. I thought I could do the same in assembly? Apparently my copmiler doesn't like something like:
mov eax,width
mov esi,src
movzx ebx,byte ptr[esi-eax+1]
movzx edx,byte ptr[esi+eax-1]
That gives me the following error: D:\Visual Studio Projects\edge\edge.cpp(464) : error C2425: '-' : non-constant expression in 'second operand'
So I tried doing a 'neg eax' beforehand. You can't actually subtract from the base address when using the []'s can you? That's what I was trying to accomplish with the neg. I have a sneaky suspicion I'm adding a huge number to the base address in esi instead of subtracting 1024. Can I do something like 'mov edx,byte ptr [esi+eax+eax+eax+1]'? Does all the '+' and '*' add a performance cost?
Thanks
Rob,
> movzx ebx,byte ptr[esi-eax+1]
The complex addressing modes do not allow the subtraction of the index, its still a BASE address plus the INDEX then any displacement after as a constant. The real problem is there is no opcode to do this task.
Code it as "movzx ebx, [esi+eax1]" and modify the EAX register if you wish to go below the base address in ESI.
TYVM guys!!!!! I got it working with your help. I did make a mistake outside of what I posted but that's fixed. The times seem to be a bit more variable now after changing it to just look forward instead of backward and forward but it did squeak above 3700 iterations a minute :D (It also got down to 3650.... the back/forward looking version was a little more consistent around 3670. Kinda surprises me since I "removed" instructions. Does LEA "add" them in?)
Here's the working code:
void Sobel(BYTE *src, BYTE *dest, int width, int height) {
/**/DWORD time_i,time_f;
::timeBeginPeriod(1);
time_f = time_i = ::timeGetTime();
while(time_f-time_i < 60000) {
/**/
__asm {
mov esi,src
mov edi,dest
mov ebx,width
mov ecx,height
inc esi
lea edi,[edi+ebx+1]
mov eax,ebx
sub ebx,2
sub ecx,2
rows:
dec ecx
push ecx
mov ecx,ebx
cols:
dec ecx
push ebx
push ecx
movzx edx,byte ptr[esi+eax*2+1]
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi-1]
sub ebx,ecx
add ebx,edx
movzx ecx,byte ptr[esi+eax-1]
movzx edx,byte ptr[esi+eax*2-1]
sub ebx,edx
movzx edx,byte ptr[esi+eax+1]
sub edx,ecx
lea ebx,[ebx+edx*2]
push ebx
movzx edx,byte ptr[esi]
movzx ebx,byte ptr[esi+1]
movzx ecx,byte ptr[esi+eax*2]
sub edx,ecx
lea ebx,[ebx+edx*2]
movzx edx,byte ptr[esi-1]
movzx ecx,byte ptr[esi+eax*2-1]
add ebx,edx
sub ebx,ecx
movzx ecx,byte ptr[esi+eax*2+1]
sub ebx,ecx
pop edx
cmp edx,0
jge Ex_nlt0
neg edx
Ex_nlt0:
cmp ebx,0
jge Ey_nlt0
neg ebx
Ey_nlt0:
add edx,ebx
cmp edx,0xff
jle ngt255
mov edx,0xff
ngt255:
// cmp eax,0
// jge nlt0
// xor eax,eax
//nlt0:
//For the above to be true a very large sum must be present so that SumX + SumY > 2^31
mov byte ptr[edi],dl
pop ecx
pop ebx
inc esi
inc edi
cmp ecx,0
jnz cols
pop ecx
add esi,2
add edi,2
cmp ecx,0
jnz rows
}
/**/
count++;
time_f = ::timeGetTime();
}
::timeEndPeriod(1);
/**/
}
Can you post the c++ version too?
This is what I have left of it commented out at the bottom of my .cpp file. Hopefully the missing variable declarations are easy enough to figure out :) Pretty much just add the declarations and replace the assembly code and it should be good to go. This assumes it's greyscale. Working on a BitBlitter (that probably can't do anything..... dam Windows.... I only say that in the nicest way :) ), then converting this to SSE... well I'll try anyway :)
/**
for(y = 1,byPreCalcDest = dest + width + 1,byPreCalcSrc = src + width + 1;
y<ylim;
++y,byPreCalcDest+=2,byPreCalcSrc+=2) {
for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
iSumX = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc + 1) << 1) - (*(byPreCalcSrc - 1) << 1);
iSumY = (*(byPreCalcSrc - width + 1)) + (*(byPreCalcSrc - width - 1)) -
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1)) +
(*(byPreCalcSrc - width) << 1) - (*(byPreCalcSrc + width) << 1);
iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
if(iSumY > 255) iSumY = 255;
// if(iSumY < 0) iSumY = 0;
*(byPreCalcDest) = (unsigned char)iSumY;
}
}
*/
Oopsie.... xlim and ylim are the height and width minus one.
You should first try to optimize the c++ version.
1) There is no need to read the same memory twice (good compilers will optimize this of course) (see: TRY1)
2) I'm not sure if reading a dword will give any significant performance increase since it adds some overhead (see: TRY2)
but I'm sure you can get increase in speed by reducing the number of memory reads for more than a dword.
You can try unrolling by 2 just by just by doing (tri_p>>=8;; tri_m>>=8;) and copy-pasting;
This will lead you to the SSE version.
PS this is all by heart, i didn't try to compile anything
for(x = 1;x<xlim;byPreCalcSrc++,byPreCalcDest++,x++) {
// TRY 1
// {
iTmp = (*(byPreCalcSrc - width + 1)) - (*(byPreCalcSrc - width - 1)) +
(*(byPreCalcSrc + width + 1)) - (*(byPreCalcSrc + width - 1));
iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2 + iTmp;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) * 2 + iTmp;
// }
// TRY 2
// {
// read dword :
// low byte = *(byPreCalcSrc +- width - 1)
// 2nd byte = *(byPreCalcSrc +- width)
// 3rd byte = *(byPreCalcSrc +- width + 1)
DWORD tri_p=*((DWORD *)(byPreCalcSrc + width - 1));
DWORD tri_m=*((DWORD *)(byPreCalcSrc - width - 1));
iSumY = (((tri_m&0xFF00)+(tri_p&0xFF00))>>7); // 7 ==> >>8 <<1
iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) * 2;
iTmp = (((tri_m&0xFF0000) + (tri_p&0xFF0000)) >>16)
- (((tri_m&0xFF) + (tri_p&0xFF)));
iSumX += iTmp;
iSumY += iTmp;
// }
iSumY = (iSumX<0?-iSumX:iSumX) + (iSumY<0?-iSumY:iSumY);
if(iSumY > 255) iSumY = 255;
// if(iSumY < 0) iSumY = 0;
*(byPreCalcDest) = (unsigned char)iSumY;
}
sorry iTmp calc was wrong, should be something like:
DWORD B1=(*(byPreCalcSrc - width + 1));
DWORD B2=(*(byPreCalcSrc - width - 1));
DWORD B3=(*(byPreCalcSrc + width + 1));
DWORD B4=(*(byPreCalcSrc + width - 1));
iSumX = (*(byPreCalcSrc + 1) - *(byPreCalcSrc - 1)) *2;
iSumY = (*(byPreCalcSrc - width) - *(byPreCalcSrc + width)) *2;
// ==>
iSumX += B1 - B2 + B3 - B4;
iSumY += B1 + B2 - B3 - B4;
// ==>
iSumX += (B1 - B4) - (B2 - B3);
iSumY += (B1 - B4) + (B2 - B3);
// ==>
V1 = B1 - B4;
V2 = B2 - B3;
iSumX += V1 - V2
iSumY += V1 + V2
// this should be in asm:
lea eax,[ebx+ecx];V1 + V2
sub ebx,ecx;V1 - V2
If I catch your meaning right your last post would replace TRY2 in the previous post? The hex stuff is a little hard for me to work with without a pen and paper lol. You shoulda seen the original version. I never benchmarked it but I know it was slow just by looking at it. It was the epitome of unoptimized code. I thought I did an okay job....... Jeez Drizz. You have any idea how much faster it is now? It went from ~3600 iterations/minute to ~4400, a 25+% improvement.
Thank You!!!!
Since you didn't provide a full example to play with i did a little search and found this page with c source http://www.pages.drexel.edu/~weg22/edge.html assimilated it and tried to optimize. Finally i wrote a sse2 version that seems to run 2x slower !! :) yay
my guess is even if I improve the sse2 version, x2 seems like a lot to catch up. :(
that is interesting, Drizz
i am not up to speed on sse yet
but, i figure that if you can't do it, noone can - lol
sse isn't for everything, i guess
if you can't "pipeline" the process, you don't get the advantage from sse ???
i mention this, because i am trying to visualize how to apply "Henry Ford" to ling long kai fang :P
The MMX version seems to be much faster than SSE2 :)
.686
.model flat,c
.const
.mmx
align 16
AlignedGXGY label word
dw -1,-2,-1,00, 1, 0,-1,00
dw 0, 0, 0,00, 2, 0,-2,00
dw 1, 2, 1,00, 1, 0,-1,00
.code
option prologue:none
option epilogue:none
;; extern "C" void __cdecl SobelMMX(PBYTE dest, PBYTE src, int iWidth, int iHeight);
SobelMMX proc c dest,src,iWidth,iHeight
locals equ 4*4
_dest equ <dword ptr [esp+1*4+locals]>
_src equ <dword ptr [esp+2*4+locals]>
_iWidth equ <dword ptr [esp+3*4+locals]>
_iHeight equ <dword ptr [esp+4*4+locals]>
push edi
push esi
push ebx
push ebp
mov ebp,_iHeight
mov ebx,_iWidth
mov esi,_src
mov edi,_dest
mov eax,ebx
imul ebp
mov ecx,eax
mov edx,eax
mov eax,-1
shr ecx,2
and edx,3
rep stosd
mov ecx,edx
rep stosb
mov edi,_dest
mov ecx,_iWidth
add edi,ebx
add edi,1
sub esi,edi
pcmpeqw mm3,mm3
psrlw mm3,8
sub ebp,2
jle endYloop
startYloop:
mov ebx,ecx
sub ebx,2
jle endXloop
startXloop:
lea eax,[esi+edi]
pxor mm7,mm7
movd mm0,[eax+000*0]
movd mm5,[eax+ecx*1]
movd mm2,[eax+ecx*2]
punpcklbw mm0,mm7;// bytes -> words
punpcklbw mm5,mm7;// bytes -> words
punpcklbw mm2,mm7;// bytes -> words
movq mm4,mm0
;//movq mm5,mm1
movq mm6,mm2
pmullw mm0,qword ptr AlignedGXGY[00];// GX GY 1//low:sumX :: high:sumY
;//pmullw mm1,qword ptr AlignedGXGY[16];// zeros
pmullw mm2,qword ptr AlignedGXGY[32];// GX GY 3//low:sumX :: high:sumY
pmullw mm4,qword ptr AlignedGXGY[08];// GX GY 1//low:sumX :: high:sumY
pmullw mm5,qword ptr AlignedGXGY[24];// GX GY 2//low:sumX :: high:sumY
pmullw mm6,qword ptr AlignedGXGY[40];// GX GY 3//low:sumX :: high:sumY
paddsw mm4,mm5
paddsw mm0,mm2;//low:sumX :: high:sumY
paddsw mm4,mm6;//low:sumX :: high:sumY
punpckldq mm1,mm0
punpckldq mm5,mm4
punpckhdq mm0,mm4
punpckhdq mm1,mm5
paddsw mm0,mm1
movq mm2,mm0
psrld mm2,16
paddsw mm0,mm2
pcmpeqw mm5,mm5
pcmpgtw mm7,mm0
pxor mm0,mm7
psubw mm0,mm7
punpckldq mm2,mm0
paddw mm0,mm2
movq mm2,mm0
pcmpgtw mm2,mm3
por mm0,mm2
psubw mm5,mm0
psrlq mm5,32
movd edx,mm5
add edi,1
sub ebx,1;//X++
mov [edi-1],dl
jnz startXloop
;//}
endXloop:
add edi,2
sub ebp,1;//Y++
jnz startYloop
;//}
endYloop:
pop ebp
pop ebx
pop esi
pop edi
retn
SobelMMX endp
end
The problem is transforming 3x3 byte matrix into 1byte output. The 3-byte reading doesn't fit 16-byte register scheme well, so you must do multiple and remainder but that only complicates stuff, not to mention a lot of byte shifting...
Drizz that site you saw is where I first started. I wasn't really sure if Sobel could be sped up (with parallelism) as it was presented on that site, being done on a greyscale image and all. I wanted to do it on color images so I was hoping SSE could help out with that, as the same ops would have to be done on all the color channels once they were separated. My thoughts were that R and B edges have the potential to not be properly isolated as an edge as they have low intensities. And this turns out to be true. Running this app while I run a game I like to play called Guildwars really illustrates this as several details disappear. Running the color version I coded all these details are obvious.
I take it you saw the laplace algorithm then too? I coded that up as well when I did Sobel. I load up a 'summator' variable at the beginning of each row and compute the first pixel. From there I basically add in the values leaving, subtract the 5 coming in, and adjust the center pixel accordingly. My original thinking was that I probably could get Laplace to be faster then Sobel..... I think it was at a comparable speed LOL until I this thread started :).
Sadly I have benchmarked all steps in my current program loop (BitBlt, RGB->Gy, Sobel, OpenGL Draws), one that is out of my control turns out to be the slowest link in the chain... BitBlt. Sometimes I really don't like Windows. I don't understand how they can take something that should run 30k+ times a minute and squeak out 1,140 on my machine. I got my RGB->Gy done pretty quick but I'm eventually taking it out as I hope the Sobel/Laplace SSE color version will be comparable to the time RGB->Gy and greyscale Sobel/Laplace take now.
If you're at all interested in the RGB->Gy code:
void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
__declspec(align(16)) const float fvR2Gy[4] = {0.11f,0.11f,0.11f,0.11f};
__declspec(align(16)) const float fvG2Gy[4] = {0.59f,0.59f,0.59f,0.59f};
__declspec(align(16)) const float fvB2Gy[4] = {0.3f,0.3f,0.3f,0.3f};
__declspec(align(16)) const UINT uiRMask[4] = {0x000000ff,0x000000ff,0x000000ff,0x000000ff};
__declspec(align(16)) const UINT uiBMask[4] = {0x00ff0000,0x00ff0000,0x00ff0000,0x00ff0000};
__declspec(align(16)) const UINT uiGMask[4] = {0x0000ff00,0x0000ff00,0x0000ff00,0x0000ff00};
__asm {
mov ecx,nPixels
mov eax,0
mov edx,0
shr ecx,2
mov esi,src
mov edi,dest
cvt_rgba_g:
dec ecx
movups xmm0,[esi+eax]
movaps xmm1,xmm0
movaps xmm2,xmm0
pand xmm0,uiBMask
pand xmm1,uiGMask
pand xmm2,uiRMask
psrld xmm0,16
psrld xmm1,8
cvtdq2ps xmm0,xmm0
cvtdq2ps xmm1,xmm1
cvtdq2ps xmm2,xmm2
mulps xmm0,fvR2Gy
mulps xmm1,fvG2Gy
mulps xmm2,fvB2Gy
addps xmm1,xmm0
addps xmm2,xmm1
cvtps2dq xmm2,xmm2
add eax,16
packuswb xmm2,xmm2
packuswb xmm2,xmm2
movss [edi+edx],xmm2
add edx,4
cmp ecx,0
jnz cvt_rgba_g
}
}
This assumes the image width is divisible by four. Since I'm working only with full size screenies I find this holds true. (Actually I find them being divisible by 16 holds true so far)
If you take the 'dec ecx' instruction at the beginning of the loop and put it underneath the movups I squeaked out some more speed. Before ~11,900 iterations/minute, after ~12,700.
in practice, it makes little difference, but if you want the equation with more resolution.....
Y = 0.587G + 0.299R + 0.114B
in fact, i have seen programs that use only the green component for grey - lol
the resulting grey scale images are obviously not as good, but i was surprised to see how good they were :P
Working on 32 bpp data makes things a lot easier!
PS #define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10) approximation seems like a better option for grayscaling.
// by working on RGBQUAD instead of RGBTRIPLE this can be modified to process 4 RGBQUADs with SSE.
void GreyScale24bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
__asm
{
mov eax,src
mov edx,dest
mov ecx,pixels
movq mm2,qword ptr rgbmult
L1: punpcklbw mm0,[eax]
psrlw mm0,8
pmaddwd mm0,mm2
punpckldq mm1,mm0
paddd mm1,mm0
psrlq mm1,32 + 10
punpcklbw mm1,mm1
punpcklbw mm1,mm1
movd [edx],mm1
add eax,3
add edx,3
sub ecx,1
jnz L1
}
}
Drizz - where'd you get the odd-ball luminance ratios ?
dedndave,
QuoteY = 0.587G + 0.299R + 0.114B
=>
G*587/1000 + R*299/1000 + B*114/1000
=> approx
G*601/1024 + R*306/1024 + B*117/1024
587 * (1024/1000) = 601,088 ~~ 601
299 * (1024/1000) = 306,176 ~~ 306
114 * (1024/1000) = 116,736 ~~ 117
=> finally
Y = ( G*601 + R*306 + B*117 ) >> 10
ahhhhhhhhhhhh - i see - they are multiplied by 1.024 !! - lol
that does make the divide faster :P
test 32bpp version:
void GreyScale32bpp(BYTE *src, BYTE *dest, DWORD pixels)
{
//#define RGB2GRAY(r,g,b) (((b)*117 + (g)*601 + (r)*306) >> 10)
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
__asm
{
mov eax,src
mov edx,dest
sub eax,edx
mov ecx,pixels
pcmpeqb xmm5,xmm5
movdqa xmm7,xmmword ptr rgbmult
psrld xmm5,8
pxor xmm6,xmm6
//movdqa - eax,edx must be aligned @ 16
// _mm_malloc ( ,16)
L1: movdqa xmm0,[eax+edx]//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
movdqa xmm1,xmm0
punpckhqdq xmm1,xmm1
punpcklbw xmm0,xmm6
punpcklbw xmm1,xmm6
pmaddwd xmm0,xmm7
pmaddwd xmm1,xmm7
movdqa xmm2,xmm0
movdqa xmm3,xmm1
psrlq xmm2,32
psrlq xmm3,32
paddd xmm0,xmm2
paddd xmm1,xmm3
pshufd xmm0,xmm0,10001000b
pshufd xmm1,xmm1,10001000b
punpcklqdq xmm0,xmm1
psrld xmm0,10
packssdw xmm0,xmm0
packsswb xmm0,xmm0
punpcklbw xmm0,xmm0
punpcklwd xmm0,xmm0
pand xmm0,xmm5
movdqa [edx],xmm0//RGBQUAD RGBQUAD RGBQUAD RGBQUAD
add edx,16
sub ecx,4
jnz L1
}
}
It's amazing this is like the third time I was writing a reply and got a response while writing LOL. So this is kinda split in two sections. First Drizz's greyscale converter and then questions I had about your MMX Sobel code.
I guess this is a lesson about not ignoring older technology. In my SSE code I cant start off with an unpack because the src is not guaranteed to be aligned properly. The source buffer is allocated in the CreateDIBSection() call I need to make before BitBlt()'s and I have no idea what windows is doing. And the other dumb thing about SSE/2/3/3.1 (SSSE3) is that there is no integer multiplication/division (ok well there is pmulhw and pmullw. I need to read up on them). So I was curious to see what your code could do for time. About ~8900. I told myself I needed to unroll this to be fair. 2 pixels/loop .... ~ 11,800. 4 pixels/loop .... ~13,400. MMX beats SSE3
How long have you been doing this Drizz? My impression is this is like second nature to you.... It takes me forever to code something simple in asssembly. Then to add tricks on top... like realizing to multiply the float conversions by 1024 then shift left 10.... never happen to me.... not yet anyway :).
**********************************************************************
Response after DednDave's comment about green
**********************************************************************
Yeah green I don't have a problem seeing. It's such a large component of luminosity. But lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain on this.... well between this and physics. I'm recalling my original thinking now. I am by default creating 3+ times the work by doing all color channels and there is no way around that. I can only save time by possibly parallelizing the computation on one row. (I need to write this stuff down as I had completely different thoughts in my last post.)
Drizz I'd like to benchmark what you did on my machine but I'm having some problems understanding some lines as I have to convert this to compile on Visual C++ 98. I'm having problems mainly before the first label. As I've been writing I've been really looking at your code and I'm answering some of my questions but I still have a few:
I also dont really get what the purpose of rep stosd is? You're writing -1 in EAX to ES, ECX number of times? How are you later accessing the -1's in ES? Why are there -1's in ES? Also I was curious why the following occurs:
mov esi,src
mov edi,dest
sub esi,edi
//loop
lea eax,[esi+edi]
Isn't eax essentially pointing to edi? Ok.. the debugger says it points back to the original src.... Ok I think I understand, sort of. It's done so you can skip the 'inc esi' at the end of the loop? But this gets me curious about something else. Doesn't 'lea eax,[esi]' 'lea eax,[esi+edi]' 'lea eax,[esi+edi*3-3]' expand into multiple instructions within the architecture? Like 'loop' pretty much expands into 'dec ecx; cmp ecx,0; jnz label'. That would explain the behavior I had earlier encountered in my Sobel code... as the instruction count shrunk but the time was similar.
I can't type fast enough LOL. I wanted to try to code it :(
Hi,
QuoteYeah green I don't have a problem seeing. It's such a large component of luminosity. But
lets say a red edge on a brown background.... it's almost impossible to see. I think I fried my brain
on this.... well between this and physics. I'm recalling my original thinking now. I am by default
creating 3+ times the work by doing all color channels and there is no way around that.
By using a gray image, you are implicitly doing a conversion
from one color space (RGB) to another (YIQ or YCbCr)
that uses a luminance (Y) and two chromaticity components.
If an edge is not showing in RGB or grayscale, perhaps it
will show up in another component in a different color space.
There are a bunch of color spaces defined for different uses.
RGB, CMY, and CMYK are easy to use. Hue, Saturation, Value
(HSV) and Hue, Luminosity, Saturation (HLS) are supposed to
be more intuitive for human interaction. NTSB defined YIQ and
JPEG uses YCbCr to reduce bandwith or compress the data stream.
Humans are less sensitive to changes in color (chromaticity) than
changes in brightness (intensity or luminosity) and thus those
components can use reduced precision.
There are color spaces that try to emulate the human visual
sensitivity as well, (L*a*b*, L*u*v*, XYZ C.I.E standards {I think}).
My reference is "Digital Image Processing" by William K. Pratt for
these last mentioned. He has conversion matrices for those and
some of the others.
Y = 0.299*R + 0.587*G + 0.114*B
I = 0.596*R - 0.274*G - 0.322*B
Q = 0.211*R - 0.523*G + 0.312*B
If you can see changes that your Sobel operation is not finding,
perhaps a different color component would help. Pratt also
discusses a number of different edge detector algorithms as well.
Regards,
Steve N.
here are some "sepia tone" conversion numbers as well
outputRed = (inputRed * .393) + (inputGreen *.769) + (inputBlue * .189)
outputGreen = (inputRed * .349) + (inputGreen *.686) + (inputBlue * .168)
outputBlue = (inputRed * .272) + (inputGreen *.534) + (inputBlue * .131)
EDIT - if the total adds up to more than 255, the 255 limit is set
i am not sure i agree with these numbers, but this is what is supposedly recommended by MS
greyscale
(http://i.techrepublic.com.com/gallery/160538-299-136.jpg)
sepia tone
(http://i.techrepublic.com.com/gallery/160539-299-136.jpg)
Ahnn Dave...from where the hell did u get that car :dazzled:....man hope u will gift me one on my birthday...!!! a real one ..not a framed pic..i mean it :wink
not my car Asif - lol
although, Zara used to have a nice collection of cars and motorcycles :bg
i wish she had kept her Jag - although, i couldn't afford to change the brakes on that thing - lol
an old beat up pickup truck is more my speed :bg
(http://img685.imageshack.us/img685/6008/zeesj.jpg)
I think the gryscale converter has all the speed squeezed out of it possible now. What started out doing ~2400 iterations/min in C, that I got up to ~11000 myself... and with the help here is now ~22000. It's insanity!!! (... or assembly :) ) I changed Drizz's code around a little bit as it wasn't coming out quite right. It's been distilled down to this:
void GreyScale(BYTE *src, BYTE *dest, int nPixels) {
__declspec(align(16)) const __int16 rgbmult[8] = {117,601,306,0,117,601,306,0};
//while(time_f-time_i < 60000) {
/**/__asm {
mov esi,src
mov edi,dest
mov ecx,nPixels
movdqa xmm7,rgbmult
pxor xmm6,xmm6
L1:
movdqu xmm0,[esi]
sub ecx,4
add esi,16
movdqa xmm1,xmm0
punpckhbw xmm0,xmm6
punpcklbw xmm1,xmm6
pmaddwd xmm0,xmm7
pmaddwd xmm1,xmm7
pshufd xmm2,xmm0,0x31
pshufd xmm3,xmm1,0x31
paddd xmm0,xmm2
paddd xmm1,xmm3
pshufd xmm0,xmm0,10001000b
pshufd xmm1,xmm1,10001000b
punpcklqdq xmm1,xmm0
psrld xmm1,10
packuswb xmm1,xmm1
packuswb xmm1,xmm1
movss [edi],xmm1
add edi,4
cmp ecx,0
jnz L1
}
/**/count++;
time_f = ::timeGetTime();
}/**/
}
A note on color spaces. I think the most intuitive to use would be HSB but the goal of this little project is realtime edge detection, segmentation, feature extraction, etc. In the end I'd rather not convert anything if I can help it. I'm not sure if Sobel or Laplace is the right algorithm to use either. I'm just trying a bunch of stuff out. I am kinda curious about seeing how the YIQ comes out. If memory serves me it's a bit easier on the CPU then the RGB->HSB formula wikipedia has. After my final the 23rd I'll have a bunch more time to tinker with this stuff :)
DednDave is that the Ford GT40? I like that car.
Quote from: robione on December 19, 2009, 07:39:33 AM
I think the gryscale converter has all the speed squeezed out of it possible now.
movdqu xmm0,[esi] is relatively slow; on a P4, lddqu is faster.
movdqa is even faster but requires 16-byte alignment. It looks as if you could provide that, but test yourself.
Quote from: robione on December 19, 2009, 07:39:33 AM
A note on color spaces. I think the most intuitive to use would be HSB
HSV?
Quotebut the goal of this little project is realtime edge detection, segmentation,
feature extraction, etc. In the end I'd rather not convert anything if I can help it.
Reviewing Pratt, and looking at all the example pictures, straight
RGB is probably usable without conversion to another color space
for most images. The idea behind converting is to emphasize some
characteristic of the image that is of interest.
QuoteI'm not sure if Sobel or Laplace is the right algorithm to use either.
Again, looking at the example pictures and scanning the text,
Sobel should be good as a baseline edge detector. If you find
that it fails to perform adequately on your images, then try a
different algorithm. It seems to be the best compromise between
simplicity and effectiveness.
Trying my own code, Sobel works well except on noisy images.
(For my excursions, that meant "do not apply a high pass filter".)
QuoteI'm just trying a bunch of stuff out. I am kinda curious about seeing how the YIQ
comes out. If memory serves me it's a bit easier on the CPU then the RGB->HSB formula
wikipedia has.
Yes, it is simpler to calculate. And the gray scale from individual
RGB components is a good first test. The RGB channels are highly
correlated in most images.
QuoteAfter my final the 23rd I'll have a bunch more time to tinker with this stuff :)
Well, if you come to some firm conclusions, let us know.
Regards,
Steve N.
Quote from: robione on December 18, 2009, 02:34:03 AMI also dont really get what the purpose of rep stosd is? You're writing -1 in EAX to ES, ECX number of times? How are you later accessing the -1's in ES? Why are there -1's in ES?
It's this code that I moved outside the loop
/* image boundaries */
if(Y==0 || Y==originalImage.rows-1)
SUM = 0;
else if(X==0 || X==originalImage.cols-1)
SUM = 0;
Since in my code "dest" and "src" have the same size edges are filled with 0xFF.
QuoteIt is important to notice that pixels in the first and last rows, as well as the first and last columns cannot be manipulated by a 3x3 mask. This is because when placing the center of the mask over a pixel in the first row (for example), the mask will be outside the image boundaries.
using memset (rep stosd/stosb) is faster than having a loop that changes only first/last row and first/last columns
This can of course be avoided by having "dest" size of "(width-2)*(height-2)" but to simplify things i didn't do it.
Quote from: robione on December 18, 2009, 02:34:03 AM
Also I was curious why the following occurs:
mov esi,src
mov edi,dest
sub esi,edi
//loop
lea eax,[esi+edi]
Isn't eax essentially pointing to edi? Ok.. the debugger says it points back to the original src.... Ok I think I understand, sort of. It's done so you can skip the 'inc esi' at the end of the loop?
Yes, it's a trick i learned from c++compiler, there are many similar tricks that can be used to optimize loop code.
The one i did:
P3 = P1 - P2
mov ..,[P3+P2] ;; mem evaluates to P1
mov [P2],...
inc P2; by increasing P2 I increase both effective addresses
Remember: it's all integer arithmetic before accessing memory.
Quote from: robione on December 18, 2009, 02:34:03 AMBut this gets me curious about something else. Doesn't 'lea eax,[esi]' 'lea eax,[esi+edi]' 'lea eax,[esi+edi*3-3]' expand into multiple instructions within the architecture? Like 'loop' pretty much expands into 'dec ecx; cmp ecx,0; jnz label'. That would explain the behavior I had earlier encountered in my Sobel code... as the instruction count shrunk but the time was similar.
"lea" stands for "load effective address". This means that it only calculates address without accessing memory. You have to check the optimization manual(s) for instructions to avoid when writing optimized code.
here are some links:
http://www.agner.org/optimize/#manuals
http://www.mark.masmcode.com/
http://graphics.stanford.edu/~seander/bithacks.html
http://www.iti.cs.tu-bs.de/soft/www.goof.com/pcg/doc/opt-pairing.html
http://www.hackersdelight.org/
+ intel, amd opt. manuals which you surely have.
Quote from: robione on December 18, 2009, 02:34:03 AM
How long have you been doing this Drizz? My impression is this is like second nature to you.... It takes me forever to code something simple in asssembly. Then to add tricks on top... like realizing to multiply the float conversions by 1024 then shift left 10.... never happen to me.... not yet anyway :).
For me writing optimized code is fun. You can learn all the tricks too if you wanted. Also don't credit me for RGB2GRAY macro, I picked it up from some page, but writing code without floats was the first thing I thought of.
have fun :wink
ahhh Dave...thats a sincere confession :U...now look at this attachment...u will find plenty of them in a country like india....what should i call it !!! :bg
sorry for the attachement...couldn't find a way to embed the picture along with this post..... :eek
i guess this is the best way of traveling ..... :lol
:green2
oh my fo, crazy indian!
Quoteoh my fo, crazy indian!
ROFL !!!(http://www.artafterdark.com/hidden2/1947Indian_lg1.jpg)
Sorry Rob - we have completely hi-jacked your thread (http://www.laserenthusiast.com/forums/images/smilies/th_threadjacked.gif)
Quote from: drizz on December 19, 2009, 06:25:05 PM
here are some links:
http://www.agner.org/optimize/#manuals
http://www.mark.masmcode.com/
http://graphics.stanford.edu/~seander/bithacks.html
http://www.iti.cs.tu-bs.de/soft/www.goof.com/pcg/doc/opt-pairing.html
http://www.hackersdelight.org/
+ intel, amd opt. manuals which you surely have.
Funny you should mention the opt. manuals.... I just downloaded them Friday. Interesting reads. Thx for the links. I'll definitely be checking them out (as I procrastinate for this physics final :) )
Quote from: dedndave on December 20, 2009, 02:50:59 AM
Sorry Rob - we have completely hi-jacked your thread (http://www.laserenthusiast.com/forums/images/smilies/th_threadjacked.gif)
For a sec I was trying to figure out who you were talking to :). I haven't given out my names on forums yet. This one seems to be more of a tight community than others I'm part of. My last name is Robidoux... that + Starwars is where the Robione comes from :)
Quote from: jj2007 on December 19, 2009, 08:03:45 AM
Quote from: robione on December 19, 2009, 07:39:33 AM
I think the gryscale converter has all the speed squeezed out of it possible now.
movdqu xmm0,[esi] is relatively slow; on a P4, lddqu is faster...
I've learned to stop saying such things LOL. I end up biting my tongue. TYVM, I just need to upgrade all my compilers. As far as C++ goes I'm not sure if VC++ Express 2008 had anything lacking from VS6 Enterprise Ed. That's the only thing that's kept me from changing compilers on my netbook. Unfort, I'm limited to SSE2 instructions with my VS6 setup. I'll keep that in mind until then :)
I finished the SSE color Sobel but am having some problems. [Edit: just about everything from here down deleted.... which was alot if you didn't see it before. I basically ran one loop too long, forgot to use width in one spot (was using width-8) and forgot to switch glDrawPixels from GL_LUMINENCE to GL_RGBA. Now I just need to figure out why everything is dull but otherwise all is good.]