Lately I have been playing with MMX and SSE extentions. There are some limitations vs. 32-bit instructions. Can someone advise me?
How can I set entire register to -1?
How do I invert a register? NOT instruction?
How can I use constants (immediates) inside the instructions? E.g. "add mmx1,10"
Can I have INC and DEC instructions?
And in general I thought it would be good to make a set of macros for MMX SSE code that would do the missing instructions. What do you think?
SolidCode,
What you can do with the instructions are exhaustively contained in the actual opcodes and you get this from the Intel reference material. They just happen to be different opcodes to the ones used with the normal integer instructions. You may be able to construct macros that will do close to what you want but they will be made up out of normal opcodes, there is no other choice.
Here I have some things, but they are quite awquard.
;swap dwords in the given xmm register
xmm_swapdw macro reg:REQ
shufps reg,reg,1Bh
endm
;rotate bits of xmm register left by 32 bits
xmm_rol macro reg:REQ
shufps reg,reg,93h
endm
;rotates bits right
xmm_ror macro reg:REQ
shufps reg,reg,39h
endm
;fill XMM register one of its dwords
;specify the dword by its number
xmm_filldw macro reg:REQ,fillpos:REQ
LOCAL inst
if fillpos GT 3
.ERR Too big fill position of dword
exitm
endif
if fillpos EQ 1
fillpos=55h
goto inst
endif
if fillpos EQ 2
fillpos=0AAh
goto inst
endif
if fillpos EQ 3
fillpos=0FFh
endif
:inst
shufps reg,reg,fillpos
endm
CMPEQ MM0, MM0 can be used to set all bits to 1...
Actually it is VERY interesting theme - making a constants at runtime... I think we should have a such a thread here...
This is a topic I am also interested in, SSE2, SSE3, SSE4 also.
NOT can be simulated by unsigned max- unsigned num.
NOT can also be simulated by xor unsigned num, unsigned max.
The max values for:
byte 255
word 65535
dword 4294967295
qword 18446744073709551615
Example
mov al, 255
sub al, num ; num is an unsigned byte
versus
mov al, 255
xor al, num ; xor does a single bit subtract without borrow 1 - 0 = 1, 1 - 1 = 0
versus
not al
The MMX,SSE equivalent codes are
.data
max dq -1, -1
MMX
movq mmx1, max
pxor mmx0, mmx1
SSE
movaps xmm1, oword ptr max
xorps xmm0, xmm1
SSE2
movdqa xmm1, oword ptr max
xorpd xmm0, xmm1
My CPU doesn't support most of these opcodes so they aren't tested.
The theory is sound though.
Quote from: SolidCode on June 12, 2006, 07:38:58 AM
How can I use constants (immediates) inside the instructions? E.g. "add mmx1,10"
Can I have INC and DEC instructions?
All the rest of your question has been answered so I will address these last two. You can't do immediate values but you can do variables in memory. So to do the add you simply set up a 10 decimal value in memory to use. Inc and Dec are the same way, you would have to do an ADD with a variable in memory where the value is set to 1 or -1.
For sse/sse2 the variables in memory have to be 16 byte aligned or you will get exceptions.
.data
align 16
one_variable dd 1,1,1,1 ; defines 4 dwords for use with sse/sse2
Thank you, asmfan! That was exactly what I needed. It works just fine for MMX regs.
mm_minusOne macro reg,REQ
pcmpeqb reg,reg
endm
Is there a one-instruction version for XMM? Or should I use the following two-instruction version?
xmm_minusOne macro reg,REQ
cmpss reg,reg,0 ;set lowest 32 bits to all ones
shufps reg,reg,00 ;copy lowest 32 bits to all the other bits
endm
To set any reg to zero we can use the following macro:
xmm_Zero macro reg:REQ
xorps reg,reg
endm
Thanks dsouza123, I know the case with xor. The entire idea was to do things quickly and without use of memory, as memory access slows down the process.
Maybe this will work?
xmm_not macro reg:REQ,dummyreg:REQ
xmm_minusOne dummyreg
xorps reg,dummyreg
endm
Mark_Larson.
Again, I wanted to avoid memory usage. I know how to do it with memory.
Quote from: SolidCode on June 13, 2006, 02:20:30 AM
Mark_Larson.
Again, I wanted to avoid memory usage. I know how to do it with memory.
I never saw where you typed in your message you didn't want to use memory. So that's why I gave you that answer. Their is no INC/DEC instruction for MMX/SSE/SSE2, and no ADD with an immediate. So you have to use a MMX/SSE ADD/SUB with two registers or with a register and memory. If you can set up your registers to have the values you need to add without using memory, then no problem. Most of your questions is pretty easy stuff, and if you spend time looking at the MMX/SSE/SSE2 instruction set you can figure it all out for yourself. I recommend you download the Instruction Set Reference for P4 from Intel's website. That would answer a lot of your questions.
; simulating INC/DEC with no memory accesses
; XMM0 has the value we want to increment by 1.
mov eax,1
movd xmm7,eax
pshufd xmm7,0000b
paddd xmm0,xmm7
Change the CMPSS to a CMPPS and you can make it just one instruction.
Is there a one-instruction version for XMM? Or should I use the following two-instruction version?
xmm_minusOne macro reg,REQ
cmpps reg,reg,0 ;set lowest 32 bits to all ones
endm
Mark_Larson
I am sorry if I have offended you. You are right. I forgot to write about the "no memory access" limitation.
Thank you for the "cmppd" idea. Now it really is done in one instruction.
xmm_minusOne macro reg,REQ
cmppd reg,reg,00
endm
I am using "IA-32 Intel® Architecture Software Developer's Manual". Is there a better reference for MMX,SSEx instructions? Can you give me a link for download? Is there something in HLP or CHM format like there is one for 32-bit instructions? I see you have some good experience working with these extentions.
All
I have just started seriously digging in MMX,SSE extentions.
I hoped that this thread could become a place for others interested in learning it. And a source of macros that would help newbes like me start using them easier and find ways to do things from IA-32. E.g. setting regs to 0 or -1, inverting regs and doing other similar things.
Is there a way to rotate bits (ROL or ROR) by bits in MMX,SSE? So far I have only seen a way to rotate bits by dwords. This is good but useless for me now. I need at least words. I think I showed the macro with dwords in one of my earlier answers.
Greetings to all.
I have been raising my questions on wasm.ru. Those guys have helped some more.
By now I have an include that implements numerous macros utilizing MMX-SSE2 instructions to do different operations.
Your opinion is welcome.
I have updated this xmmcode.inc. It is better commented now, a couple of macros were added for SSE instructions that MASM 6.15 cannot make (movsd, cmpsd) because they are named same as 32-bit instructions.
Some macros were modified / added.
[attachment deleted by admin]
Great work! :)
Now we just need some examples of how to use these macros. :U
I have few macros, if interest, a unfinished set of SSE equates for fluxus script language, including push/pop whole statemachine ala stackmechanism, which might be useful in different situations when you want to do recursion
it was suppose to be fluxus, but limited to max 8 3d coordinates I can keep in 8 .xmm regs, which is enough for cubes creating a 3d tree with recursion, but never got around to finish it
.data?
statestack dd 0 dup (65536)
dd 0 dup (65536)
.data
ALIGN 16
vectconst0 REAL4 0.0,0.0,0.0,0.0 ;room for constanst used in fluxusprogram
vectconst1 REAL4 0.0,0.0,0.0,0.0
vectconst2 REAL4 0.0,0.0,0.0,0.0
vectconst3 REAL4 0.0,0.0,0.0,0.0
vectconst4 REAL4 0.0,0.0,0.0,0.0
vectconst5 REAL4 0.0,0.0,0.0,0.0
vectconst6 REAL4 0.0,0.0,0.0,0.0
vectconst7 REAL4 0.0,0.0,0.0,0.0
vectconst8 REAL4 0.0,0.0,0.0,0.0
tmpxmm0 REAL4 0.0,0.0,0.0,0.0 ;temporary storage
tmpxmm1 REAL4 0.0,0.0,0.0,0.0
tmpxmm2 REAL4 0.0,0.0,0.0,0.0
tmpxmm3 REAL4 0.0,0.0,0.0,0.0
tmpxmm4 REAL4 0.0,0.0,0.0,0.0
tmpxmm5 REAL4 0.0,0.0,0.0,0.0
tmpxmm6 REAL4 0.0,0.0,0.0,0.0
tmpxmm7 REAL4 0.0,0.0,0.0,0.0
tmpf0 REAL4 0.0,0.0,0.0,0.0
tmpf1 REAL4 0.0,0.0,0.0,0.0
tmpf2 REAL4 0.0,0.0,0.0,0.0
X equ 0
Y equ 4
Z equ 8
.code
; #########################################################################
PUSHSTATE MACRO
FXSAVE [ebx]
add ebx,512
ENDM
POPSTATE MACRO
sub ebx,512
FXRSTOR [ebx]
ENDM
TRANSLATE MACRO vectconst
addps XMM0,vectconst
addps XMM1,vectconst
addps XMM2,vectconst
addps XMM3,vectconst
addps XMM4,vectconst
addps XMM5,vectconst
addps XMM6,vectconst
addps XMM7,vectconst
ENDM
SCALE MACRO vectconst
mulps XMM0,vectconst
mulps XMM1,vectconst
mulps XMM2,vectconst
mulps XMM3,vectconst
mulps XMM4,vectconst
mulps XMM5,vectconst
mulps XMM6,vectconst
mulps XMM7,vectconst
ENDM
start:
;initialize statestack
finit
lea ebx,statestack
add ebx,1024
and ebx,0FFFFF800h ;even divided by 512
I have only a PIII and I have problems with some instructions
pavgw ;this one , i don't see how to translate
fcomip st(0),st(1) ;translated by
.data
FPUDATA DQ 0
.code
FSTP FPUDATA
FCOM FPUDATA
If someone find a translate for pavgw,I take it.
ToutEnMasm
ToutEnMasm
The Pentium III has SSE so pavgw will work with 64 bit MMX values.
The 64 bit values are really a packed arrary of four unsigned 16 bit words.
The instruction pavgw is an average of two packed word arrays.
pavgw dest, source dest has to be a register, source register or 64 bit memory
.data
warr0 dw 0, 1, 2, 3
warr1 dw 5, 5, 6, 6
warr2 dw 0, 0, 0, 0
.code ; destination, source
movq mm0, qword ptr warr0 ; load mm0 with warr0
pavgw mm0, qword ptr warr1 ; add corresponding words, add 1, shr 1, store in dest
; dest0 = (dest0 + source0 + 1) idiv 2
; dest1 = (dest1 + source1 + 1) idiv 2
; dest2 = (dest2 + source2 + 1) idiv 2
; dest3 = (dest3 + source3 + 1) idiv 2
movq qword ptr warr2, mm0 ; store mm0 into warr2
using the values in the .data section
0+5+1 = 6 / 2 = 3
1+5+1 = 7 / 2 = 3
2+6+1 = 9 / 2 = 4
3+6+1 = 10 / 2 = 5
result
warr2 dw 3, 3, 4, 5
Is fcomip really ficom ?
The fcom in the .code section doesn't appear to correspond to anything about pavgw.
It looks like it stores the value in ST(0) in FPUDATA then pops it off the stack,
then compares what is now in ST(0) with FPUDATA.
What is your code supposed to do ?
Hello,
the code his the fire works that I have made work for the PIII
http://www.masm32.com/board/index.php?topic=1133.msg8290#msg8290
Following the intels book,the translate of FCOMIP seem very similar (missing perhaps a flag and interrupt in case of exception)
Quote
Opcode Instruction Description
FCOMI ST, ST(i) Compare ST(0) with ST(i) and set status flags accordingly.
FCOMIP ST, ST(i) Compare ST(0) with ST(i), set status flags accordingly, and
pop register stack.
FUCOMI ST, ST(i) Compare ST(0) with ST(i), check for ordered values, and
set status flags accordingly.
FUCOMIP ST, ST(i) Compare ST(0) with ST(i), check for ordered values, set
status flags accordingly, and pop register stack.
The cmov are easily translatable,but transfer from mmx or FPU registers can be only 32 bits
like
fld dword ptr[edi+eax+4]
or
movd dword ptr [edi+eax],MM0
it's work (i have effects in black and white),but need a little work for a perfect translation
ToutEnMasm