using mmx/sse for temporary storage

kwadrofonik · March 05, 2007, 05:28:10 PM

Hi again,

I'm trying to use MMX for temporary storage but for some reason it crashes my program:


movd         mm0, eax

...

movd         ecx, mm0

Do I need to initialize mmx or shut off the FPU? I'm not using any floating point math.

Is this even a good idea? Is it that much faster than pushing/poping?

PBrennick · March 05, 2007, 05:38:54 PM

Hi kwadrofonik,
Yes, you need to initialize it.

Code Select


        .386
        .mmx
        .xmm
        .model      flat,stdcall
        option      casemap:none

... and it is an excellent idea.

Paul

u · March 05, 2007, 08:33:27 PM

Just be sure to use "emms" (if you use mmx) before calling Windows API, that might use the fpu: I've seen it crash calls to gdi procs.
SSE is safe,at least :)

dsouza123 · March 05, 2007, 08:57:50 PM

The following works, requires a CPU that supports MMX instructions.

Code Select


.686
.model flat,stdcall
option casemap:none
.mmx

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib

.data
  num1  dd 12
  num2  dd 13
  num3  dd 0
  szCap db "Test example using MMX",0
  szFmt db "%lu : First number",13,10
        db "%lu : Second number",13,10
        db "%lu : Sum of the numbers",0
  szBuf db 256 dup (0)

.code
start:

  movd  mm1, num1
  movd  mm2, num2
  paddd mm1, mm2
  movd  num3, mm1

;  emms           ; would be needed if any FPU instructions used later

  invoke wsprintf, ADDR szBuf, ADDR szFmt, num1, num2, num3
  invoke MessageBox, 0, ADDR szBuf, Addr szCap, MB_OK

  invoke ExitProcess, 0

end start

[attachment deleted by admin]

ic2 · March 06, 2007, 12:12:18 AM

Yes, as Pbrennick said that will work, but only for MASM. Use it like below so it can work for both MASM and POASM also. Now you only have to change the name and not the position.

Code Select

 Not:
.386
.mmx
.xmm
.model      flat,stdcall
 option     casemap:none


Use:
.386
.model      flat,stdcall
 option     casemap:none
.mmx        ;.p2
.xmm

hutch-- · March 06, 2007, 12:27:07 AM

If yopu are talking about 32 bit registers, you are probably better off using the uninitialised data section as I vaguely remember there being a problem repeatedly shifting an integer (GP) register to either an MMX or XMM register and back. Unless you know for sure that either MMX or XMM is faster than storing the register value in memory, memory is a better option with no initialisation required and no other problems.

Code Select


.data?
  _eax dd ?
.code
  mov _eax, eax
  ; more code
  mov eax, _eax

raymond · March 06, 2007, 03:15:11 AM

Plus the fact that Windows XP programmers found a new toy with MMX and used it for the simplest math tasks. Thus, if you call an API function, you risk seeing your data in MMX registers being trashed. The same applies to floating point data in FPU registers which are the same ones used for MMX.

I would strongly recommend you use memory to store your data unless you are certain that Windows functions will not be called until you retrieve that data. (Not all Windows functions use the MMX/FPU registers, but those that do have not been identified in their description.)

Raymond

dsouza123 · March 06, 2007, 12:32:15 PM

What about using fsave or
copying the mmx registers to something like the following along with fstenv?

.data?
_mmx0 dq ?
_mmx1 dq ?
_mmx2 dq ?
_mmx3 dq ?
_mmx4 dq ?
_mmx5 dq ?
_mmx6 dq ?
_mmx7 dq ?
_menv db 14

OR

.data?
_mmx0 dd ?, ?
_mmx1 dd ?, ?
_mmx2 dd ?, ?
_mmx3 dd ?, ?
_mmx4 dd ?, ?
_mmx5 dd ?, ?
_mmx6 dd ?, ?
_mmx7 dd ?, ?
_menv db 14

So with Raymond's info even using regular FPU instructions can be problematic.

What about the SSE2 registers are they saved or are they not touched by API calls ?

raymond · March 07, 2007, 03:32:26 AM

QuoteWhat about using fsave

The fsave instruction is quite slow and requires additional memory for FPU internal registers other than the data registers. See:

http://www.ray.masmcode.com/tutorial/fpuchap3.htm#fsave

Saving data in FPU/MMX registers and then saving those in memory instead of saving only the necessary data directly in memory would be a useless long detour.

Raymond

dsouza123 · March 07, 2007, 01:31:24 PM

So if only 3 mmx registers were used for temporary storage
example mmx0, mmx1, mmx1 and only the 3 were saved
before an API call that used either MMX or FPU instructions
would restoring those three register be enough
or would the fstenv/fldenv instructions also be needed ?

Code Select


.data?
  _mmx0 dd ?, ?    ; the MMX only need 64 bits (two dwords)
  _mmx1 dd ?, ?    ; to save a FPU register an extra word per register is needed ( 80 = 64 + 16 )  dd ?,?  dw ?
  _mmx2 dd ?, ?
  _mmx3 dd ?, ?
  _mmx4 dd ?, ?
  _mmx5 dd ?, ?
  _mmx6 dd ?, ?
  _mmx7 dd ?, ?
  _menv db 14


  movq qword ptr _mmx0, mmx0
  movq qword ptr _mmx1, mmx1
  movq qword ptr _mmx2, mmx2

  fstenv _menv     ; is this necessary to be saved and restored 
                   ; for MMX to work if the FPU was used in API call(s) ?

  API call(s)

  movq mmx0, qword ptr _mmx0
  movq mmx1, qword ptr _mmx1
  movq mmx2, qword ptr _mmx2

  fldevn _menv     ; along with this restoring instruction

kwadrofonik · March 10, 2007, 05:33:40 AM

I've come to a sad realization (Cancel/Allow?) hehe j/k

The MMX registers are such high latency that it's not worthwile to use them as storage. MOVD from imm to gpr takes 2 cycles on a P4 and a whopping 5 ticks from gpr to imm. What's wrong with this picture???

I tried replacing some of my local variables with imm and my code is equal to or slower than before. So unless you're using SIMD instructions, MMX is a waste of time.

http://www.tommesani.com/MMXLatency.html

dsouza123 · March 10, 2007, 12:24:03 PM

The idea is to cover latency issues by interleaving other instructions
so by the time the first instruction is done it can be immediately used.

It seems counterintuitive that transferring between ALU and MMX registers would
be slower than ALU and memory.

If the memory access is in the instruction prefetch buffer (16 byte), a cache hit
in the xx byte cache line , or from the L1 cache, or from the L2 cache
that would have a definite affect on the speed.

An immediate is different in that it is (hopefully) loaded in the cache line with the instruction,
(that is why alignment can be an issue) versus a memory reference in a data section or on the stack.

dsouza123 · March 10, 2007, 03:33:01 PM

There doesn't appear to be any MMX instructions that use an immediate value.
The only operands are MMX or ALU registers and memory addresses.

Latency on P4s are high for many instructions because among other issues
they have very deep pipelines.

On most other architectures such as
Intel P3, PM, Core, Core 2
AMD K7, K8 and the soon coming K10 (Barcelona)
the pipelines are considerably shorter and have many optimizations to reduce latency.

News:

using mmx/sse for temporary storage

kwadrofonik

PBrennick

u

dsouza123

ic2

hutch--

raymond

dsouza123

raymond

dsouza123

kwadrofonik

dsouza123

dsouza123