News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Adding Velocity Array to Location Array (Movement)

Started by OceanJeff32, March 25, 2005, 03:49:42 AM

Previous topic - Next topic

OceanJeff32

Ok, Mark, I have a question now:

If I have two arrays (of 32-bit floating point numbers) that I want to add together, Say one is X, the other XV.  One is position, the other velocity.

As I understand it loading all eight XMM registers first, then adding velocity to all eight, then storing all eight, is faster because of dependency, than if I just use one register to loop through the arrays.

Right?

Jeff C
::)
Any good programmer knows, every large and/or small job, is equally large, to the programmer!

Mark_Larson

  Yep, you are correct.  It depends on how long the read from memory takes, and how long the instruction takes.  You want to make sure the register has been updated with the proper value before you try and use it.  Usually you don't need to or want to use all 8 XMM registers.  I usually use a few.  Experiment and see which gives the best performance.

Additional tip.  If you are using a P4, sometimes it's faster to use PSHUFD over MOVAPS.  PSHUFD is 4 cycles of latency with 2 cycles of throughput.  MOVAPS is 6 cycles of latency with 1 cycle of throughput.  The trade off is MOVAPS can execute in parallel with other SSE instructions, whereas PSHUFD can't.  So time, time, time your code to see which way is fastest for you.  PSHUFD goes through port 1 along with all the other SSE instructions.  The exception is MOVE SSE instructions, which go through port 0 ( the same thing applies to FP, MMX, and SSE2).  Now to actually get instruction to execute in parallel they have to go through seperate ports.  That's why you see a lot of people talking about mixing FP and ALU code because FP code goes through port 1 ( except for FP moves which also go through port 0).  Most ALU instructions go through port 0 and port 1 ( some only go through port 1 such as the SHIFT instructions).  So any of the ALU instructions that can go through port 0 can be run in parallel on the processor with FP/MMX/SSE/SSE2 instructions that are running on port 1.


movaps  xmm0,[esi]

; is equivalent to

pshufd xmm0,[esi],0E4h
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm