News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

what to do with MMX SSE limitations?

Started by SolidCode, June 12, 2006, 07:38:58 AM

Previous topic - Next topic

SolidCode

Lately I have been playing with MMX and SSE extentions. There are some limitations vs. 32-bit instructions. Can someone advise me?
How can I set entire register to -1?
How do I invert a register? NOT instruction?
How can I use constants (immediates) inside the instructions? E.g. "add mmx1,10"
Can I have INC and DEC instructions?

And in general I thought it would be good to make a set of macros for MMX SSE code that would do the missing instructions. What do you think?

hutch--

SolidCode,

What you can do with the instructions are exhaustively contained in the actual opcodes and you get this from the Intel reference material. They just happen to be different opcodes to the ones used with the normal integer instructions. You may be able to construct macros that will do close to what you want but they will be made up out of normal opcodes, there is no other choice.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

SolidCode

#2
Here I have some things, but they are quite awquard.

;swap dwords in the given xmm register
xmm_swapdw  macro reg:REQ
    shufps  reg,reg,1Bh
endm

;rotate bits of xmm register left by 32 bits
xmm_rol macro reg:REQ
    shufps  reg,reg,93h
endm

;rotates bits right
xmm_ror macro reg:REQ
    shufps  reg,reg,39h
endm

;fill XMM register one of its dwords
;specify the dword by its number
xmm_filldw macro reg:REQ,fillpos:REQ
    LOCAL   inst
    if fillpos GT 3
        .ERR Too big fill position of dword
        exitm
    endif
    if fillpos EQ 1
        fillpos=55h
        goto inst
    endif
    if fillpos EQ 2
        fillpos=0AAh
        goto inst
    endif
    if fillpos EQ 3
        fillpos=0FFh
    endif
:inst
    shufps  reg,reg,fillpos
endm

asmfan

CMPEQ MM0, MM0 can be used to set all bits to 1...
Actually it is VERY interesting theme - making a constants at runtime... I think we should have a such a thread here...
Russia is a weird place

dsouza123

#4
This is a topic I am also interested in, SSE2, SSE3, SSE4 also.

NOT can be simulated by unsigned max- unsigned num.
NOT can also be simulated by xor unsigned num, unsigned max.

The max values for:
byte 255
word 65535
dword 4294967295
qword 18446744073709551615

Example
mov al, 255
sub  al, num  ; num is an unsigned byte

versus

mov al, 255
xor al, num  ; xor does a single bit subtract without borrow  1 - 0 = 1,  1 - 1 = 0

versus

not al

The MMX,SSE equivalent codes are
.data
  max dq -1, -1

MMX
movq mmx1, max
pxor mmx0, mmx1

SSE
movaps xmm1, oword ptr max
xorps xmm0, xmm1

SSE2
movdqa xmm1, oword ptr max
xorpd xmm0, xmm1

My CPU doesn't support most of these opcodes so they aren't tested.
The theory is sound though.

Mark_Larson

Quote from: SolidCode on June 12, 2006, 07:38:58 AM
How can I use constants (immediates) inside the instructions? E.g. "add mmx1,10"
Can I have INC and DEC instructions?

  All the rest of your question has been answered so I will address these last two.  You can't do immediate values but you can do variables in memory.  So to do the add you simply set up a 10 decimal value in memory to use.  Inc and Dec are the same way, you would have to do an ADD with a  variable in memory where the value is set to 1 or -1.

For sse/sse2 the variables in memory have to be 16 byte aligned or you will get exceptions.


.data
align 16
one_variable   dd   1,1,1,1       ; defines 4 dwords for use with sse/sse2


BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

SolidCode

#6
Thank you, asmfan! That was exactly what I needed. It works just fine for MMX regs.
mm_minusOne macro reg,REQ
    pcmpeqb reg,reg
endm


Is there a one-instruction version for XMM? Or should I use the following two-instruction version?
xmm_minusOne macro reg,REQ
    cmpss   reg,reg,0    ;set lowest 32 bits to all ones
    shufps   reg,reg,00  ;copy lowest 32 bits to all the other bits
endm


To set any reg to zero we can use the following macro:
xmm_Zero macro reg:REQ
    xorps   reg,reg
endm


Thanks dsouza123, I know the case with xor. The entire idea was to do things quickly and without use of memory, as memory access slows down the process.
Maybe this will work?
xmm_not macro reg:REQ,dummyreg:REQ
    xmm_minusOne  dummyreg
    xorps   reg,dummyreg
endm


Mark_Larson.
Again, I wanted to avoid memory usage. I know how to do it with memory.

Mark_Larson

Quote from: SolidCode on June 13, 2006, 02:20:30 AM
Mark_Larson.
Again, I wanted to avoid memory usage. I know how to do it with memory.


  I never saw where you typed in your message you didn't want to use memory.  So that's why I gave you that answer.  Their is no INC/DEC instruction for MMX/SSE/SSE2, and no ADD with an immediate.  So you have to use a MMX/SSE ADD/SUB with two registers or with a register and memory. If you can set up your registers to have the values you need to add without using memory, then no problem.  Most of your questions is pretty easy stuff, and if you spend time looking at the MMX/SSE/SSE2 instruction set you can figure it all out for yourself.  I recommend you download the Instruction Set Reference for P4 from Intel's website.  That would answer a lot of your questions.


; simulating INC/DEC with no memory accesses
; XMM0 has the value we want to increment by 1.
mov eax,1
movd xmm7,eax
pshufd xmm7,0000b
paddd xmm0,xmm7



Change the CMPSS to a CMPPS and you can make it just one instruction.


Is there a one-instruction version for XMM? Or should I use the following two-instruction version?
xmm_minusOne macro reg,REQ
    cmpps reg,reg,0 ;set lowest 32 bits to all ones
endm
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

SolidCode

Mark_Larson
I am sorry if I have offended you. You are right. I forgot to write about the "no memory access" limitation.
Thank you for the "cmppd" idea. Now it really is done in one instruction.
xmm_minusOne macro reg,REQ
    cmppd   reg,reg,00
endm

I am using "IA-32 Intel® Architecture Software Developer's Manual". Is there a better reference for MMX,SSEx instructions? Can you give me a link for download? Is there something in HLP or CHM format like there is one for 32-bit instructions? I see you have some good experience working with these extentions.

All
I have just started seriously digging in MMX,SSE extentions.
I hoped that this thread could become a place for others interested in learning it. And a source of macros that would help newbes like me start using them easier and find ways to do things from IA-32. E.g. setting regs to 0 or -1, inverting regs and doing other similar things.
Is there a way to rotate bits (ROL or ROR) by bits in MMX,SSE? So far I have only seen a way to rotate bits by dwords. This is good but useless for me now. I need at least words. I think I showed the macro with dwords in one of my earlier answers.

SolidCode

#9
Greetings to all.
I have been raising my questions on wasm.ru. Those guys have helped some more.
By now I have an include that implements numerous macros utilizing MMX-SSE2 instructions to do different operations.
Your opinion is welcome.

I have updated this xmmcode.inc. It is better commented now, a couple of macros were added for SSE instructions that MASM 6.15 cannot make (movsd, cmpsd) because they are named same as 32-bit instructions.
Some macros were modified / added.

[attachment deleted by admin]

Mark Jones

Great work! :)

Now we just need some examples of how to use these macros. :U
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

daydreamer

I have few macros, if interest, a unfinished set of SSE equates for fluxus script language, including push/pop whole statemachine ala stackmechanism, which might be useful in different situations when you want to do recursion
it was suppose to be fluxus, but limited to max 8 3d coordinates I can keep in 8 .xmm regs, which is enough for cubes creating a 3d tree with recursion, but never got around to finish it

     
.data?
    statestack  dd 0 dup (65536)
                dd 0 dup (65536)

.data
ALIGN 16
vectconst0   REAL4 0.0,0.0,0.0,0.0 ;room for constanst used in fluxusprogram
vectconst1   REAL4 0.0,0.0,0.0,0.0
vectconst2   REAL4 0.0,0.0,0.0,0.0
vectconst3   REAL4 0.0,0.0,0.0,0.0
vectconst4   REAL4 0.0,0.0,0.0,0.0
vectconst5   REAL4 0.0,0.0,0.0,0.0
vectconst6   REAL4 0.0,0.0,0.0,0.0
vectconst7   REAL4 0.0,0.0,0.0,0.0
vectconst8   REAL4 0.0,0.0,0.0,0.0
tmpxmm0      REAL4 0.0,0.0,0.0,0.0 ;temporary storage
tmpxmm1      REAL4 0.0,0.0,0.0,0.0
tmpxmm2      REAL4 0.0,0.0,0.0,0.0
tmpxmm3      REAL4 0.0,0.0,0.0,0.0
tmpxmm4      REAL4 0.0,0.0,0.0,0.0
tmpxmm5      REAL4 0.0,0.0,0.0,0.0
tmpxmm6      REAL4 0.0,0.0,0.0,0.0
tmpxmm7      REAL4 0.0,0.0,0.0,0.0
tmpf0        REAL4 0.0,0.0,0.0,0.0
tmpf1        REAL4 0.0,0.0,0.0,0.0
tmpf2        REAL4 0.0,0.0,0.0,0.0
X   equ 0
Y equ 4
Z equ 8
.code

; #########################################################################
    PUSHSTATE MACRO
    FXSAVE [ebx]
    add ebx,512
    ENDM
    POPSTATE MACRO
    sub ebx,512
    FXRSTOR [ebx]
   
    ENDM
   
    TRANSLATE MACRO vectconst
    addps XMM0,vectconst
    addps XMM1,vectconst
    addps XMM2,vectconst
    addps XMM3,vectconst
    addps XMM4,vectconst
    addps XMM5,vectconst
    addps XMM6,vectconst
    addps XMM7,vectconst

    ENDM
    SCALE MACRO vectconst
    mulps XMM0,vectconst
    mulps XMM1,vectconst
    mulps XMM2,vectconst
    mulps XMM3,vectconst
    mulps XMM4,vectconst
    mulps XMM5,vectconst
    mulps XMM6,vectconst
    mulps XMM7,vectconst
    ENDM
   
   start:
    ;initialize statestack
    finit
    lea ebx,statestack
    add ebx,1024
    and ebx,0FFFFF800h ;even divided by 512


ToutEnMasm


I have only a PIII and I have problems with some instructions

pavgw   ;this one , i don't see how to translate

fcomip st(0),st(1) ;translated by

.data
FPUDATA DQ 0
.code
    FSTP FPUDATA
    FCOM FPUDATA

If someone find a translate for pavgw,I take it.

                                   ToutEnMasm



dsouza123

#13
ToutEnMasm

The Pentium III has SSE so pavgw will work with 64 bit MMX values.

The 64 bit values are really a packed arrary of four unsigned 16 bit words.

The instruction pavgw is an average of two packed word arrays.

pavgw dest, source   dest has to be a register, source register or 64 bit memory


.data
  warr0 dw 0, 1, 2, 3
  warr1 dw 5, 5, 6, 6
  warr2 dw 0, 0, 0, 0

.code                         ; destination, source
  movq  mm0, qword ptr warr0  ; load mm0 with warr0
  pavgw mm0, qword ptr warr1  ; add corresponding words, add 1, shr 1, store in dest
                              ; dest0 = (dest0 + source0 + 1) idiv 2
                              ; dest1 = (dest1 + source1 + 1) idiv 2
                              ; dest2 = (dest2 + source2 + 1) idiv 2
                              ; dest3 = (dest3 + source3 + 1) idiv 2
  movq  qword ptr warr2, mm0  ; store mm0 into warr2


using the values in the .data section

0+5+1 =  6 / 2 = 3
1+5+1 =  7 / 2 = 3
2+6+1 =  9 / 2 = 4
3+6+1 = 10 / 2 = 5

result

warr2 dw 3, 3, 4, 5


Is fcomip really ficom ?

The fcom in the .code section doesn't appear to correspond to anything about pavgw.

It looks like it stores the value in ST(0) in FPUDATA then pops it off the stack,
then compares what is now in ST(0) with FPUDATA.

What is your code supposed to do ?

ToutEnMasm

Hello,
the code his the fire works that I have made work for the PIII
http://www.masm32.com/board/index.php?topic=1133.msg8290#msg8290
Following the intels book,the translate of FCOMIP seem very similar (missing perhaps a flag and interrupt in case of exception)

Quote
Opcode Instruction Description
FCOMI ST, ST(i) Compare ST(0) with ST(i) and set status flags accordingly.

FCOMIP ST, ST(i) Compare ST(0) with ST(i), set status flags accordingly, and
pop register stack.

FUCOMI ST, ST(i) Compare ST(0) with ST(i), check for ordered values, and
set status flags accordingly.

FUCOMIP ST, ST(i) Compare ST(0) with ST(i), check for ordered values, set
status flags accordingly, and pop register stack.

The cmov are easily translatable,but transfer from mmx or FPU registers can be only 32 bits
like
fld dword ptr[edi+eax+4]
or
movd dword ptr [edi+eax],MM0

it's work (i have effects in black and white),but need a little work for a perfect translation
                               ToutEnMasm