News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Lightning-fast bitmap rotation

Started by AlchoholicSnake, January 22, 2006, 08:13:13 PM

Previous topic - Next topic

AlchoholicSnake

Hello!

I have been working on a piece of software for the last couple of months, which allthough it doesnt aim mostly at graphics or such, it does require ultra fast rotation of bitmaps. Getting this to work O.K. for a good (not GREAT) pc is no problem, but still it tends to eat up a sickening amount of the CPU, even though I've come to a point where I just can't seem to squeeze any more power out of the code I put together.

So I thought perhaps some ASM programmers a bit more experienced with assembly optimization than I am could give me an idea or two about how to get this bugger moving a bit faster. :green

Here is the code, which is C-style C++ with Inline ASM.


#define ROTATEAASCALE 1
#define FPBITS 10
#define FPRES (1<<FPBITS)
void __stdcall RotateBuf(float deg,DWORD*src,DWORD*dst,int xs,int ys,int xs2,int ys2,int x_center,int y_center,int x_translate,int y_translate){
if(!dst||!src)return;
// calculate the matrix
int m[4]={
int((cosf(-deg)*(1<<ROTATEAASCALE))*FPRES),int((sinf(-deg)*(1<<ROTATEAASCALE))*FPRES),
int((-sinf(-deg)*(1<<ROTATEAASCALE))*FPRES),int((cosf(-deg)*(1<<ROTATEAASCALE))*FPRES)
};
int fpx_center=(-(x_center))<<FPBITS;
int fpy_center=(-(y_center))<<FPBITS;
DWORD*d=dst;
// transform the inverse corner rotation
int x_cornertransformed=((m[0]*fpx_center)>>FPBITS)+((m[1]*fpy_center)>>FPBITS)-((fpx_center+(x_translate<<FPBITS))<<ROTATEAASCALE);
int y_cornertransformed=((m[2]*fpx_center)>>FPBITS)+((m[3]*fpy_center)>>FPBITS)-((fpy_center+(y_translate<<FPBITS))<<ROTATEAASCALE);
__asm{
push ebx
push esi
push edi
push y_cornertransformed // push the inverse rotated corners to the stack
push x_cornertransformed
mov ecx,ys
mov edi,d
ROTATE_YLOOP:
push ecx
push dword ptr[esp+8] // copy the inverse rotated corners for looping
push dword ptr[esp+8]
mov ecx,xs
ROTATE_XLOOP:
// transform the fixed-point positions to screen coordinates ...
mov ebx,dword ptr[esp]
mov eax,dword ptr[esp+4]
sar ebx,FPBITS
sar eax,FPBITS
mov edx,dword ptr[m]
add dword ptr[esp],edx
mov edx,dword ptr[m+8]
add dword ptr[esp+4],edx
// ... and check if they are inside the source image ...
xor edx,edx
cmp ebx,edx
jl ROTATE_DTRANSPARENT
cmp eax,edx
jl ROTATE_DTRANSPARENT
cmp ebx,xs2
jge ROTATE_DTRANSPARENT
cmp eax,ys2
jge ROTATE_DTRANSPARENT
// ... and if they are, draw the source image color of the transformed point.
mul xs2
lea esi,[eax+ebx]
shl esi,2
add esi,src
mov eax,[esi]
mov [edi],eax
add edi,4
dec ecx
jz ROTATE_ENDXLOOP
jmp ROTATE_XLOOP
ROTATE_DTRANSPARENT:
// ... but if they arent, draw the transparent keycolor.
mov eax,transcol
mov [edi],eax
add edi,4
dec ecx
jz ROTATE_ENDXLOOP
jmp ROTATE_XLOOP
ROTATE_ENDXLOOP:
add esp,8
pop ecx
mov edx,dword ptr[m+4]
add dword ptr[esp],edx
mov edx,dword ptr[m+12]
add dword ptr[esp+4],edx
dec ecx
jz ROTATE_ENDYLOOP
jmp ROTATE_YLOOP
ROTATE_ENDYLOOP:
add esp,8
pop edi
pop esi
pop ebx
}
}


I guess that most of this code should be pretty self-explanatory and straight forward. I even added comments. :green

ROTATEAASCALE is a constant which defines how many times the source image is scaled of the original. The reason why I do this is because if I rescale the source images and smooth them, then I get a nice smooth rotated image, without any extra smoothing code inside the rotation loop.

The first thing i guess you may want to point out is that i should some how get rid of the MUL inside the inner-loop, by for example aligning all my images to a size that allows me to bitshift instead, but the problem with that approach is that the images become too big. The unscaled size of the biggest rotated image is 272*272 (please dont mention redesigning the graphics for 256*256*, I have my reasons for these sizes).

May there perhaps be something to gain from not using GDI and DIBSections for this, by for example switching to DirectDraw (preferably alot if I have should switch to DirectDraw as I've used GDI ALOT in this application...)?

Now, my personal observations so far are that the larger the source or destination images are, the slower it runs, which leads me to the conclusion that memory access is jumping too much which in some way ruins the speed, and I am pretty sure there has to be ways to improve this but still keep the big smooth rotated pictures looking good.
I thought for a moment that the cause of the slowdown could be the fact that i was using DIBsection buffers where perhaps the reads could be slower, but after testing it on memory allocated using malloc I doubt that could be the problem.

Is there something that can be done to improve this alot or is perhaps the only way to go using the GPU?
(Or maybe I just have to let it look darn ugly and run fast...)

Any suggestions?

Human

well your code suxx:P
first slowing down all is mul instruction, you can do shl instead, well if you cant because image x size isnt power of 2, well then you can copy whole picture to buffer larger and divisible by 2 and rotate width only, so like 1025 you use 2048 and use shl x,11 and do that only 1025 times, other way you can look on google for some scene effects like zoomrotator and there you will find solution.
also as i can see you use shifted step values, to calculate x and y, good but for god sake why all in memory?
you can use for it only registers, just think how to solve it and reorder, even before calculate doing few more mov eax,reg is faster than reading from memory,also if there is no way to use only regs for all you can use memory only for adding
try something like it
add ebx,[stepx]
add edx,[stepy]
mov eax,ebx
mov ebp,edx
shr eax,16
shr ebp,16
shl ebp,11 so y*2048
add ebp,eax
add ebp,[src]
mov eax,[ebp]
you can try also to improve pairing
but its your goal

hutch--

Human,

> well your code suxx:P

Tread carefully here, the Laboratory is for people to post code and ideas, not to tell people thier code "suxx".
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

AlchoholicSnake

#3

[removed]

Quote from: Human on January 23, 2006, 12:31:45 AM
first slowing down all is mul instruction, you can do shl instead, well if you cant because image x size isnt power of 2, well then you can copy whole picture to buffer larger and divisible by 2 and rotate width only, so like 1025 you use 2048 and use shl x,11 and do that only 1025 times,

Allready done this before. Reason why this wasnt included in the source i posted was because i removed it due to the memory problems i ran into when running at a ROTATEAASCALE of 2. Also tried it with constant shifts in tests which proved that it is not thing that is slowing it down as far as tests prove. Will fix it though.

Quote from: Human on January 23, 2006, 12:31:45 AM
other way you can look on google for some scene effects like zoomrotator and there you will find solution.

I'll take a look at that, thanks.

Quote from: Human on January 23, 2006, 12:31:45 AM
also as i can see you use shifted step values, to calculate x and y, good but for god sake why all in memory?

Well we all have our reasons, dont we. :bdg
I will most definetly fix this though, thanks.

Quote from: Human on January 23, 2006, 12:31:45 AM
you can try also to improve pairing
but its your goal

Thanks for the help. :8)

As for the occational weird funny spots, that's what lack of sleep can do to people.

gabor

Hi!

About the MUL anomaly I recommend you to check this thread out:
http://www.masmforum.com/simple/index.php?topic=3704.0
It is about fast multiplication on a P4, there are ideas valid for other CPUs as well.
I can remember a method for the mentioned problem. The image is appended rows and colums to have dimensions that can be dissolved into the sum of powers of 2. Your image size is ok, since 272=256+16. This is important for the shl solution. Like this:


        ; an old way of fast multiplication (used in 320*200 gfx :))) )
        ; eax has the value to be mul-ed
        shl eax,4        ;eax*16
        mov ebx,eax
        shl eax,4        ;(eax*16)*16=eax*256
        add eax,ebx

This is not an optimized code indeed! At first there are register stalls, and I did not care of pairing or slow-fast instruction coupling either. Just wanted to show how the basic idea works.  :P

It's true that in the inner loops try to avoid pushs and pops and memory variables.

Another thread that can be interesting is about memory copy.
http://www.masmforum.com/simple/index.php?topic=1637.0
And yet another about copy:
http://www.masmforum.com/simple/index.php?topic=3685.0
I mentioned these because there are tipps and tricks about how to accelerate the memory access. (Most interesting to me was the utilization of the data cache :bg )

Finally with D3D or OpenGL this can be solved simply, if you know and understand how to use these libraries.
About OpenGL the forum has a very good subforum lead by hitchhikr an asm-OpenGL master  :toothy He is very helpful too!

I hope I gave you some hints!

Greets, Gábor

Mark_Larson

Quote from: Human on January 23, 2006, 12:31:45 AM
first slowing down all is mul instruction, you can do shl instead, well if you cant because image x size isnt power of 2, well then you can copy whole picture to buffer larger and divisible by 2 and rotate width only, so like 1025 you use 2048 and use shl x,11 and do that only 1025 times, other way you can look on google for some scene effects like zoomrotator and there you will find solution.

turns out 272 in binary only has 2 bits set, so you can simply do two shifts and add the results together to simulate a MUL by 272.


272 = 100010000b

;value = value to mutliply by 272
mov eax,[value]
mov edx,eax
shl eax,4
shl edx,8
add eax,edx


The code is by no means optimized.  I am just demonstrating how it works.

Mark

EDIT: Sorry Gabor, just realized you were saying the same thing that I just posted.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

AlchoholicSnake

Quote from: gabor on January 23, 2006, 08:20:21 AM
About the MUL anomaly I recommend you to check this thread out:
http://www.masmforum.com/simple/index.php?topic=3704.0
It is about fast multiplication on a P4, there are ideas valid for other CPUs as well.

Nice! Definetly worth checking out, allthough there is a problem with these kinds of approaches for my needs...

Quote from: gabor on January 23, 2006, 08:20:21 AM
I can remember a method for the mentioned problem. The image is appended rows and colums to have dimensions that can be dissolved into the sum of powers of 2. Your image size is ok, since 272=256+16. This is important for the shl solution. Like this:

Very good. Only problem, as I was pointing to above, is that as my application uses skinning, there is unfortunately not a good way for me to be sure what sizes to optimize for, so I guess I'll still have to stick with aligning the image sizes to powers of 2 so I can do a single shift, perhaps instead of checking inside the inner loop I'll make multiple versions for the most common numbers of shifts though...

Quote from: gabor on January 23, 2006, 08:20:21 AM
It's true that in the inner loops try to avoid pushs and pops and memory variables.

Yes, as my guess is that memory access is the thing slowing it down the most, I can only imagine how much worse the stack access makes it, so getting rid of all unnecessary memory access will be the primary goal. Thanks.
Another question though...
I thought about using MMX for the position vectors instead, so I can eliminate all the stack access in the main loop and at the same time add the vectors with fewer instructions. Is there any slowdown when transferring data from MMX registers to the standard CPU registers like EAX, EBX, ECX or EDX? Think I'll need to transferr them to 32-bit registers to check that the position is inside the source image size but if there is another way that would also be great.

Quote from: gabor on January 23, 2006, 08:20:21 AM
Another thread that can be interesting is about memory copy.
http://www.masmforum.com/simple/index.php?topic=1637.0
And yet another about copy:
http://www.masmforum.com/simple/index.php?topic=3685.0
I mentioned these because there are tipps and tricks about how to accelerate the memory access. (Most interesting to me was the utilization of the data cache :bg )

I will check that out, thanks!  :green This will probably be very useful as memory access seems to be the part that's slowing the code down the most...

Quote from: gabor on January 23, 2006, 08:20:21 AM
Finally with D3D or OpenGL this can be solved simply, if you know and understand how to use these libraries.
About OpenGL the forum has a very good subforum lead by hitchhikr an asm-OpenGL master  :toothy He is very helpful too!

Not a master at them, but I've been using OpenGL for small game projects for quite a while so I know the basics yes. Would take some time to rewrite the whole graphics part of my application to use it instead of just GDI though, but I guess it will probably be worth it.

Thanks for the help! :8)

zcoder

I have not posted any code cuz I see you are working in C, which
I don't have anything to help you interface to C programs.

But months ago, I wrote a whole LIB that would flip, retate left, rotate right,
shrink, inlarge, mirror, and do Alfa lending and color stuff ect to a bitmap.
you can open GIF's, JPG's and BMP's and work with them in ram
then you has any choice to save it to BMP or display it using GDI

All the functions do not use GDI they use only asm functions I wrote to
manipulate the bitmaps bits right in ram.
The LIB was written in pure assembly and I only have an INC file for
the lib.

otherwise if I knew how to make a H file I would, and I would have to make a doc on how to
invoke the functions in it.

I wrote this lib to help me when I was working in DirectX, I wanted to make
games faster by not using DX GDI calls and just do it in ram and then copy
them to the screen pointer returned when you lock and unlock the surface.

Not sure this would help.

Zcoder....
Back in 1979, My computer ran so fine.
And there was no such thing,
As a Microsoft Crashed Machine.
http://zcoder.110mb.com
http://www.dietzel.com/partner/idevaffiliate.php?id=345_6  Free Domain Names

AlchoholicSnake

Quote from: zcoder on February 06, 2006, 07:57:55 AM
I have not posted any code cuz I see you are working in C, which
I don't have anything to help you interface to C programs.

But months ago, I wrote a whole LIB that would flip, retate left, rotate right,
shrink, inlarge, mirror, and do Alfa lending and color stuff ect to a bitmap.
you can open GIF's, JPG's and BMP's and work with them in ram
then you has any choice to save it to BMP or display it using GDI
I could allways translate the INC, but I dont really like working with seperate librarys when coding (feels more comfortable for me to have all the code in front of me to get a better overview), but it would truly be interesting if you could perhaps show some of the code like the inner-loop of the rotation or such, but it is ofcourse up to you whether you want to share it.

Quote from: zcoder on February 06, 2006, 07:57:55 AM
I wrote this lib to help me when I was working in DirectX, I wanted to make
games faster by not using DX GDI calls and just do it in ram and then copy
them to the screen pointer returned when you lock and unlock the surface.
Was it a big improvement? Since I am currently working purely with GDI, I mostly use BitBlt, TransparentBlt and StretchBlt for drawing surfaces to device contexts, and just out of curiousity, does anybody know whether perhaps GDI loads the data to hardware buffer when bitmaps are selected into memory device contexts? In that case, a BitBlt should be faster than doing it all in regular ram if the data is not altered often... Somehow I still got a feeling of doubt that they did that tho, so perhaps I should try doing it in ram. :U

Quote from: AlchoholicSnake on January 24, 2006, 09:26:03 PM
I thought about using MMX for the position vectors instead, so I can eliminate all the stack access in the main loop and at the same time add the vectors with fewer instructions. Is there any slowdown when transferring data from MMX registers to the standard CPU registers like EAX, EBX, ECX or EDX? Think I'll need to transferr them to 32-bit registers to check that the position is inside the source image size but if there is another way that would also be great.
Just thought I'd answer this myself so noone else has to as I found the answer. Moves between MMX and regular CPU registers are SLOW. I actually had no idea it was such a bad idea, and somehow it confuses me, as the only way (correct me if I'm wrong here) to do a conditional jump based on results in MMX registers is to move the data to regular CPU registers and do a CMP, TEST or whatever... Guess I'll stick to the regular registers. :toothy

zcoder

AlchoholicSnake,
Lets see, where should I start? I have the BITMAP rotation functions
witten and ALOT of other stuff, but my procs assume that the image
was opened with a function that put everything into a STRUC so if
you can modify them your way, just let me know what parts of image
manipulation code you would like. rotation is one I take it, there is also
I DX rendering engine I wrote that will DO clipping and allow you to
put an image at any X,Y pos, and allow tranparent blitting as well as
how bright you want the image all without using GDI.

just let me know what you really need as I am not keeping the code
a secret, just be aware that if you want to use them as they are OK
but if not, you will have to gleem out what you need and modify them.

Zcoder....
Back in 1979, My computer ran so fine.
And there was no such thing,
As a Microsoft Crashed Machine.
http://zcoder.110mb.com
http://www.dietzel.com/partner/idevaffiliate.php?id=345_6  Free Domain Names

daydreamer

Quote from: Human on January 23, 2006, 12:31:45 AM
well your code suxx:P
first slowing down all is mul instruction, you can do shl instead, well if you cant because image x size isnt power of 2, well then you can copy whole picture to buffer larger and divisible by 2 and rotate width only, so like 1025 you use 2048 and use shl x,11 and do that only 1025 times, other way you can look on google for some scene effects like zoomrotator and there you will find solution.
also as i can see you use shifted step values, to calculate x and y, good but for god sake why all in memory?
you can use for it only registers, just think how to solve it and reorder, even before calculate doing few more mov eax,reg is faster than reading from memory,also if there is no way to use only regs for all you can use memory only for adding
try something like it
add ebx,[stepx]
add edx,[stepy]
mov eax,ebx
mov ebp,edx
shr eax,16
shr ebp,16
shl ebp,11 so y*2048
add ebp,eax
add ebp,[src]
mov eax,[ebp]
you can try also to improve pairing
but its your goal
if I use this kinda algo ,but instead
fadd x1,xdelta
fadd x2,xdelta
fadd y1,ydelta
fadd y2,ydelta
fadd z1,zdelta
fadd z2,zdelta
or
addps x1,y1,z1

how fast and isnt there a risk of loosing precision in a long loop?
I am talking about instead of rotate a terrain made of trianglestrips, which causes read data, rotate/transform/translate/writetransformed data somewhere else, generate onthefly which only causes a few fadds and writedata

I must point out I am still within the subject, because my method can be used to rotate bitmap, by apply bitmap as texture for trianglestrips, with the builtin texturefilter caps in hardware working for you at the same time
think just a few texturefilterchanges and you have for 'free' mipmap filtering, average on closest 4 texels and antialias /trilinear filter on rendering


edit...
question on the above, which is only fadds and memorywrite, could I get it even better cachewise, when doing initial calculation of these deltas to start write some crap to that memory area first, after that calculate deltas followed by fadd loops?

zooba

The FPU keeps all it's values in 80-bit floating point form so as long as you don't swap them out to 32-bit until you need to you'll be fine :U

BTW, if you want to delve into FPU instructions it might be wise to have a look at Ray's FP Tutorial or something on SSE instructions (I haven't found any good starting points for this, sorry)