News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Clipping rectangles with use of MMX instruction set

Started by jarekfall, May 20, 2011, 07:21:35 PM

Previous topic - Next topic

jarekfall

Hello!

I'm new in this forum, so I would like to introduce myself. I'm an assembly language and C++ programmer. I implement my projects by using a mixed-language programming technique, because I like to create in an assembly language an optimized code. I also like to optimize used algorithms.

I made sure whether my topic is not present in this forum and I haven't found any similar topics.

I like the idea of sharing a knowledge so I would like to introduce to you my two optimized functions. Function that clips rectangles is one of the most often used function in almost any program or a 2D game. In case of a program it's used to clip hundreds of windows on a screen, in case of a 2D game it clips hundreds of sprites thus such function should be optimized as much as possible.

First I will describe the first function from CRect class:

CRect & __thiscall ClipUnmappedSourceRectangleMMX(int x, int y, CRect &rkSourceRect) const;

Clips a rectangle specified in an object pointed by rkSourceRect so that no edge of this rectangle sticks out beyond edges of a destination rectangle contained in an object which called this member function. The source rectangle is temporarily mapped onto destination rectangle coordinate system to position specified by (x,y) coordinates. Left top corner of the rectangle is put on point specified by (x,y) coordinates in destination rectangle coordinate system. Coordinates of clipped source rectangle position are specified in source coordinate system. If source rectangle is found entirely beyond destination rectangle then source rectangle dimensions are zero, however coordinates of location of its left top corner keep steady. Returns a rkSourceRect reference. The function makes use of MMX instruction set.

Function's special features:
- single stream during processing of executable code (lack of any jump instructions), conditional instructions are carried out numerically,
- thanks to the use of instructions provided by the MMX extension working on two packed doublewords, the count of all arithmetical and logical operations is twice as small compared to function not using MMX instructions,
- maximum use of MMX unit registers (minimization of count of executed operations referencing RAM memory),
- minimization of count of variables,
- use of instructions taking the lowest count of processor clock cycles.

Following figure depicts the process of clipping a rectangle:


I placed images of source code for better clearness of text.

CRect class definition:




The second function is very similar to the first function but the top left corner of destination rectangle is always placed at (0,0) point. This function is a member function of CBitmap32 class and the CBitmap32 class doesn't store a destination rectangle object, it contains only bitmap's dimensions, so that the destination rectangle can be easily created.

CBitmap32 class definition:




On my website you will find a document titled "Comparison of different processor instruction execution times" which can help you in writing an optimized code.

I'm interested in your comments and suggestions, especially if you are able to optimize this functions by using CPU and MMX instruction set. Feel free to use the functions in your projects.

If somebody would be able to create such function with usage of SSE instruction set and would like to share it with us then we could compare function speed on our PCs.

jj2007

Hi Jarek,

Welcome to the Forum - and congrats for the strong first impression :U

Ultimately, we had 90% Colosseum and 10% coding posts, so some of us will be really happy to see more from you.
A question on your website: How do you explain that idiv ebx takes 61 cycles on the AMD but 12,500 on the P4? My first guess would be that I misunderstood that table...

Welcome again,
Jochen

qWord

hi and welcome,

- why are you using MMX - SSE2 is independent from the fpu
- I'm correct, your function calculates the intersection of two rectangles?
- why is clipping so time critical in your application? - IMO the drawing it self is the time-consuming part.

qWord
FPU in a trice: SmplMath
It's that simple!

jarekfall

#3
Hi Jochen and qWord!

Thank you for your replies and welcome.

Yes I am going to introduce to you several other of my functions which also extensively make use of MMX instruction set and are also optimized. Among others a function drawing and blending images encoded in an ARGB format on images encoded in XRGB format (X means unimportant), all values are natural numbers, it doesn't contain any division instruction working on integers, it performs 4 multiplication operations per 2 pixels in each loop cycle, it's designed to work with DirectDraw surfaces after locking them and obtaining a pointer to video memory buffer so it considers a Pitch parameter, and so on... I have some kind of optimization obsession but I hope I won't be the only person who has such obsession. :green
One of my projects named "Sprite Creator" which I released in 2007 lacks of clipping rectangles thus I work on these functions and modify my function performing alpha blending. Currently I'm very busy with work on my thesis, so it will pass some time before I will present next functions.

Quote from: jj2007How do you explain that idiv ebx takes 61 cycles on the AMD but 12,500 on the P4? My first guess would be that I misunderstood that table...

Please compare resolutions of processor clocks. The P4 Prescott 3.20 GHz has a clock that can count up to 3192200000 per second no matter whether HT is enabled or disabled, while AMDK6-2 428 MHz has very low resolution equal to 7159090 cycles per second. In such case the P4 is able to measure shorter time periods. Let's imagine following situation:
Person A and B count for 10 seconds, but person A is more accurate while person B is lazy and it counts four times slower.
Seconds:  1 2 3 4 5 6 7 8 9 10
Person A: 1 2 3 4 5 6 7 8 9 10
Person B: 1 1 1 1 2 2 2 2 3 3  and thus the difference in measured count of processor clock cycles. :wink

Quote from: qWord- why are you using MMX - SSE2 is independent from the fpu
- I'm correct, your function calculates the intersection of two rectangles?
- why is clipping so time critical in your application? - IMO the drawing it self is the time-consuming part.

The MMX instruction set has been applied in Pentium MMX in 1997 and is present in both Intel's and AMD's processors while SSE is present only in modern Intel's processors. Don't get me wrong, there is nothing bad in using SSE instruction set, but I currently prefer the MMX extension especially that I've mastered writing code for this unit.

Yes, but also preserves the position of the source rectangle in its coordinate system.

I explained this in the introduction, besides EVERY function that is used in real-time processing in time-consuming loops should be optimized and besides it's a great fun for optimization maniacs. :green

I would almost forget, feel free to incorporate my functions into MASM32 project of course if you will recognize them as useful.

drizz

Hello

since the function consists of just min max operations, let's look at your options:

  • 1) jg jl - normal instructions with branching
       1a) branchless min/max
          #define doz(x,y) ((x)-(y)&~((x)-(y)^((x)^(y))&((x)-(y)^(x)))>>31) 
          #define max(x,y) ((y)+doz(x,y)) 
          #define min(x,y) ((x)-doz(x,y))
  • 2) CMOVcc - pentium pro - no branching
  • 3) mmx - your code since there is no min/max instruction
  • 4) sse1
       4a) (if possible) replace int with short and use pminsw pmaxsw
       4b) same as 3) but with xmm registers
  • 5) sse2 convert to float and use maxps/maxpd and convert back
  • 6) sse4.1 use pmaxsd pminsd   
Please provide original c++ code and mmx versin from the pictures.
No one will write down the code from the pictures.

There is nothing stoping you from using functions based on current cpu features
(on startup)
funcptr_t func;
   if cpuid & mmx
      func = &funcMMX
   else if cpuid & sse
      func = &funcSSE
   else if cpuid & sse2
      func = &funcSSE2
   else if cpuid & sse41
      func = &funcSSE41
   ...
      
call func()

The truth cannot be learned ... it can only be recognized.

jarekfall

Hello drizz!

Thank you for your suggestions. There is one thing that stops me, it's lack of spare time for experimenting with function variations.

I added a link to a zip archive containing the source code. The link is present in my signature.

drizz

Can you also post the ClipUnmappedSourceRectangle method c++ code just to have a starting point. (i.e. the algorithm)
Yes you have it well commented but it's not the same without looking at the "unoptimized" version (c++, for both class methods).



The truth cannot be learned ... it can only be recognized.

baltoro

Jarekfall,
Great initial post. This subject (the mixed C++ and assembly programming) is rarely discussed here on the forum. I wish I could contribute, but, the MMX instruction set is beyond my capabilities (I'm one of the 90% that Jochen refers to :eek).
But, for those readers who don't understand the Visual Studio name mangling scheme (and. yes, I am one of those), I found these two resources that explain it quite well:   


...And, thanks for the link to your website,...the information is useful,...I  wondered if that was even possible (coding member functions of classes declared in C++, with assembly language; it never would have occurred to me that you could simply use the mangled name in the implementation code, and that this would work consistently in all Visual Studio versions).
Baltoro

jarekfall

Drizz, I included the CPP file containing the C++ code of the ClipUnmappedSourceRectangleMMX function in the same zip archive, please use the same link. I changed the UInt32 type to int so please replace all the files which you've preserved.

Hello baltoro!

Yes, you can implement any type of C++ class member functions (methods) in pure assembly language, apart from functions of inline type. Names stored in files of LIB and OBJ type have global scope. The only drawback of this is that the implemented in assembly language functions must be used only with that C++ compiler which uses the same calling conventions as those which have been used in those implemented functions.

drizz

Here are some ideas.

Note: this is all partially pseudo code, I didn't actually test or execute anything, wrote it by heart using just text editor

C++ analysis
void CRect::ClipUnmappedSourceRectangle(int x, int y, CRect &rSourceRect) const
{
int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);

tx = min(tx, this->m_Right);
ty = min(ty, this->m_Bottom);

tx -= max(x, this->m_Left);
ty -= max(y, this->m_Top);

rSourceRect.m_Right = rSourceRect.m_Left;
rSourceRect.m_Bottom = rSourceRect.m_Top;

if( tx > 0 && ty > 0 )
{
rSourceRect.m_Right += tx;
rSourceRect.m_Bottom += ty;
}
}


possible asm implementations:
__asm {
// int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
// int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);
mov esi,rSourceRect
mov edi,this
mov ebx,x
mov ecx,y
mov eax,[esi].CRect.m_Right
mov edx,[esi].CRect.m_Bottom
sub eax,[esi].CRect.m_Left
sub edx,[esi].CRect.m_Top
add eax,ebx; x
add edx,ecx; y

// tx = min(tx, this->m_Right);
// ty = min(ty, this->m_Bottom);
cmp eax,[edi].CRect.m_Right
cmovl eax,[edi].CRect.m_Right
cmp edx,[edi].CRect.m_Bottom
cmovl edx,[edi].CRect.m_Bottom

// tx -= max(x, this->m_Left);
// ty -= max(y, this->m_Top);
cmp ebx,[edi].CRect.m_Left
cmovg ebx,[edi].CRect.m_Left
cmp ecx,[edi].CRect.m_Top
cmovg ecx,[edi].CRect.m_Top
sub eax,ebx
sub edx,ecx

// rSourceRect.m_Right = rSourceRect.m_Left;
// rSourceRect.m_Bottom = rSourceRect.m_Top;
mov ebx,[esi].CRect.m_Left
mov ecx,[esi].CRect.m_Top
mov [esi].CRect.m_Right,ebx
mov [esi].CRect.m_Bottom,ecx

// if( tx > 0 && ty > 0 )
// {
// rSourceRect.m_Right += tx;
// rSourceRect.m_Bottom += ty;
// }
#if 0
test eax,eax
js @F
test edx,edx
js @F
add [esi].CRect.m_Right,eax
add [esi].CRect.m_Bottom,edx
@@:
#else

cmp eax,80000000h
sbb ebx,ebx
cmp edx,80000000h
sbb ecx,ecx
and ebx,eax
and ecx,edx
add [esi].CRect.m_Right,ebx
add [esi].CRect.m_Bottom,ecx

#endif

}

// this is almost 1:1 to your code

__asm {
// int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
// int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);
movq      mm4, {rSourceRect.m_Right,rSourceRect.m_Bottom}
movq      mm5, {rSourceRect.m_Left,rSourceRect.m_Top}
movq      mm6, {x,y}
movq  {tx,ty}, mm4
psubd {tx,ty}, mm5
paddd {tx,ty}, mm6

// tx = min(tx, this->m_Right);
// ty = min(ty, this->m_Bottom);
movq      mm2, {this->m_Right,this->m_Bottom}
movq      mm3, {tx,ty}
pcmpgtd   mm3, mm2
pand  {tx,ty}, mm3
pandn     mm3, mm2
por   {tx,ty}, mm3

// tx -= max(x, this->m_Left);
// ty -= max(y, this->m_Top);
movq      mm2, {this->m_Left,this->m_Top}
movq      mm3, mm2
pcmpgtd   mm6, mm2
pand      mm6, mm3
pandn     mm3, mm2
por       mm6, mm3

psubd {tx,ty}, mm6

// rSourceRect.m_Right = rSourceRect.m_Left;
// rSourceRect.m_Bottom = rSourceRect.m_Top;
movq      mm4, mm5

// if( tx > 0 && ty > 0 )
// {
// rSourceRect.m_Right += tx;
// rSourceRect.m_Bottom += ty;
// }

pxor      mm7, mm7
pcmpgtd   mm7, mm4
pandn     mm7, {tx,ty}
paddd     mm4, mm7

movq     {rSourceRect.m_Right,rSourceRect.m_Bottom}, mm4
movq         {rSourceRect.m_Left,rSourceRect.m_Top}, mm5
}



__asm {
// int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
// int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);
movq      xmm4, {rSourceRect.m_Right,rSourceRect.m_Bottom}
movq      xmm5, {rSourceRect.m_Left,rSourceRect.m_Top}
movq      xmm6, {x,y}
movdqa {tx,ty}, xmm4
psubd {tx,ty}, xmm5
paddd {tx,ty}, xmm6

// tx = min(tx, this->m_Right);
// ty = min(ty, this->m_Bottom);
movq      xmm2, {this->m_Right,this->m_Bottom}
movdqa    xmm3, {tx,ty}
cvtdq2ps  xmm3, xmm3
cvtdq2ps  xmm2, xmm2
pminps    xmm2, xmm3
cvtps2dq  xmm3, xmm3

// tx -= max(x, this->m_Left);
// ty -= max(y, this->m_Top);
movq      xmm2, {this->m_Left,this->m_Top}
cvtdq2ps  xmm3, xmm3
cvtdq2ps  xmm2, xmm2
pmaxps    xmm2, xmm3
cvtps2dq  xmm3, xmm3

psubd {tx,ty}, xmm6

// rSourceRect.m_Right = rSourceRect.m_Left;
// rSourceRect.m_Bottom = rSourceRect.m_Top;
movdqa      xmm4, xmm5

// if( tx > 0 && ty > 0 )
// {
// rSourceRect.m_Right += tx;
// rSourceRect.m_Bottom += ty;
// }

pxor      xmm7, xmm7
pcmpgtd   xmm7, {tx,ty}
pandn     xmm7, {tx,ty}
paddd     xmm4, xmm7
            punpcklqdq xmm5,xmm4

movdqu   {rSourceRect}, xmm5
}


// i don't have sse41 so this is just guesswork

__asm {
// int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
// int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);
movq      xmm4, {rSourceRect.m_Right,rSourceRect.m_Bottom}
movq      xmm5, {rSourceRect.m_Left,rSourceRect.m_Top}
movq      xmm6, {x,y}
movdqa {tx,ty}, xmm4
psubd {tx,ty}, xmm5
paddd {tx,ty}, xmm6

// tx = min(tx, this->m_Right);
// ty = min(ty, this->m_Bottom);
pminsd    {tx,ty},{this->m_Right,this->m_Bottom}

// tx -= max(x, this->m_Left);
// ty -= max(y, this->m_Top);
pmaxsd    xmm6,  {this->m_Left,this->m_Top}

psubd {tx,ty}, xmm6

// rSourceRect.m_Right = rSourceRect.m_Left;
// rSourceRect.m_Bottom = rSourceRect.m_Top;
movdqa      xmm4, xmm5

// if( tx > 0 && ty > 0 )
// {
// rSourceRect.m_Right += tx;
// rSourceRect.m_Bottom += ty;
// }

pxor      xmm7, xmm7
pmaxsd    xmm7, {tx,ty}
paddd     xmm4, xmm7
            punpcklqdq xmm5,xmm4

movdqu   {rSourceRect}, xmm5
}


The truth cannot be learned ... it can only be recognized.

jarekfall

Thank you very much drizz for your work. :bg I will try to analyze your examples in my spare time. I will start from the example designed for MMX unit.

However I spotted in the example for MMX unit that your destination operands in arithmetic and logical MMX instructions are memory operands which is not allowed in MMX instruction set. The destination operand for all the MMX instructions (except the data transfer instructions MOVQ, MOVD) can reside only in an MMX register. That's why my code does not contain such instructions. Could you correct your example?

In my function the code refers to memory operands 7 times while in your MMX example it refers 15 times which on my AMD K6-2 will be a big drawback. Modern PC's have fast DDR's so in this case it will be lesser drawback. Please compare on my website the movq regMMX, qword ptr [esi] instruction execution time for both processors.

drizz

I did write a note in red at the start of my previous post...

The point is that your MMX code is as good as it gets - more or less.
You could test how CMOVcc and SSE41 version perform next.

__declspec(naked) void CRect::ClipUnmappedSourceRectangleMMXdrizz(int x, int y, CRect &rSourceRect) const
{
__asm
{
;//ecx = this pointer
;//The CRect class doesn't contain virtual functions (methods) thus the dword ptr [ecx+00h]
;//field doesn't contain a pointer to an array of pointers to virtual functions.
;//dword ptr [esp+0Ch] = reference (pointer) to a source rectangle object SrcRect
;//dword ptr [esp+08h] = the y coordinate
;//dword ptr [esp+04h] = the x coordinate
;//dword ptr [esp+00h] = function's return address
mov eax, [esp+0Ch]
pxor mm7,mm7
movq mm6,[esp+4];// {x,y}
movq mm5,[eax+0];// {rSourceRect.m_Left,rSourceRect.m_Top}
movq mm0,[eax+8];// {rSourceRect.m_Right,rSourceRect.m_Bottom}
movq mm4,[ecx+0];//{this->m_Left,this->m_Top}
movq mm1,[ecx+8];//{this->m_Right,this->m_Bottom}
;// int tx = x + (rSourceRect.m_Right - rSourceRect.m_Left);
;// int ty = y + (rSourceRect.m_Bottom - rSourceRect.m_Top);
;// mm0 = {tx,ty}
psubd mm0,mm5
paddd mm0,mm6
;// tx = min(tx, this->m_Right);
;// ty = min(ty, this->m_Bottom);
movq mm2,mm0
movq mm3,mm0
psubd mm3,mm1
pcmpgtd mm2,mm1
pand mm2,mm3
psubd mm0,mm2
;// tx -= max(x, this->m_Left);
;// ty -= max(y, this->m_Top);
movq mm2,mm6
psubd mm6,mm4
pcmpgtd mm2,mm4
pand mm6,mm2
paddd mm6,mm4

psubd mm0,mm6
;// rSourceRect.m_Right = rSourceRect.m_Left;
;// rSourceRect.m_Bottom = rSourceRect.m_Top;
;// movq mm4,mm5
;// if( tx > 0 && ty > 0 )
;// {
;// rSourceRect.m_Right += tx;
;// rSourceRect.m_Bottom += ty;
;// }
pcmpgtd mm7,mm0
pandn mm7,mm0
paddd mm7,mm5
;// movq [eax+0],mm5
movq [eax+8],mm7
emms
ret 3*4
}
}
The truth cannot be learned ... it can only be recognized.

jarekfall

Thanks drizz, unfortunately my P4 Prescott doesn't support SSE4.1 instruction set.

I found a bug in my functions, I forgot to calculate an offset of clipped source rectangle relative to its (x,y) position and add it to to source rectangle's left top corner (x1,y1) coordinates to obtain coordinates of left top corner of clipped source rectangle in source coordinate system. After executing these operations, coordinates of right bottom corner of clipped source rectangle can be calculated by adding to its (x1',y1') point its current dimensions respectively.

I corrected my source code in my first posted message and also in source files available for download. The link to the source files is placed in my signature.

Drizz, I'm currently very busy with work on my thesis so I plan to analyze your functions in summer holidays.