Hi!
Being new to the forum, I first want to say hi folks! Nice that you provide us beginners such a nice place to ask questions!
Coming to my current problem. I want to write fast (masm32) MMX inline assembly for following C code (within VisualC++):
-----------------------
(CurrentBfr << (32 - BitsLeft)) >> (32 - N);
-----------------------
(all variables are unsigned ints)
So I thought using psrlq and psllq would be quite sufficient for a first start:
----------------------------------------------------------------------
unsigned int CBfr=CurrentBfr, BLeft=BitsLeft;
__asm{
pxor mm0, mm0
mov eax, BLeft
movd mm0, CBfr
mov ebx, 32
sub ebx, eax
mov eax, 32
mov ecx, N
sub eax, ecx
psllq mm0, bl
psrlq mm0, al
movd CBfr, mm0
}__asm emms;
------------------------------------------------------------
So this doesn't work at all. Maybe you see it immediately what's wrong here, but I've no idea (except that different truncation maybe the root for some problems).
However, maybe you can help me with some additional questions too:
-As you see I map the global variables CurrentBfr and BitsLeft to local ones, since I get errors accessing the globals from within the inline assembly of the function. Is there a different possiblity (I'm using visualstudio with the processor pack)?
-When do I need to pxor/xor a register? A mov 1, eax writes the 1 in every case into eax, doesn't it?
-If I use edx (instead of again eax) for the 2nd 32, why do I get an access violation?
I hope somebody takes the time to help me a bit out here - thanks a lot beforehand!
Cheers, Hannes
You can just use normal instruction and don't even need MMX for your code. In order to maxmise the efficiency and apply MMX correctly, you should be doing stuff like 8 bytes/ 4 words/ 2 dwords at a time.
If not, stick to normal instructions
Thanks for your reply!
I know shr/shl; but I want to optimize for the P4, hence I want to use the MMX versions which are said to be way faster.
Cheers, Hannes
MMX instruction are supposed to be used with parallelism for its powers to be exploited.
For example, you can read 4 pixels and process it in one go using mmx if each pixel takes up 1 byte. In your example, such parallelism does not exist, hence I think it is useless and wasteful to use mmx for such situtations.
Yeah, I know that MMX is mainly used for packed data. However I was told that shifting via mmx was faster than the usual shift. I just tried it, it isn't.
Besides: in above code the shift instructions psllq/psrlq need mmx registers (instead of al,bl) for the shiftcounter; that was the bug that prevented the hole from working.
Cheers, Hannes
Btw even if you are using shl/shr you have to use cl for the shift and not al or bl. Sorry for not looking hard enough at your code btw. :toothy
Thanks for your help!
You are welcome :U