News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

3D Vector Normalize with SSE2

Started by johnsa, June 03, 2008, 08:59:37 PM

Previous topic - Next topic

johnsa

Here is a vector normalization function I wrote a while back. Can anyone think / see potential improvements for performance? (I haven't really looked much at sse3/4 options.. but i would if it makes it faster :)


align 4
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov ebx,ptrVR
movaps xmm0,[esi]
movaps xmm3,xmm0
mulps xmm0,xmm0
pshufd xmm1,xmm0,00000001b
pshufd xmm2,xmm0,00000010b
addss xmm0,xmm1
addss xmm0,xmm2
rsqrtss xmm1,xmm0
pshufd xmm1,xmm1,00000000b
mulps xmm3,xmm1
movaps [ebx],xmm3

ret
Vector3D_Normalize ENDP

c0d1f1ed

It looks fine for normalizing one vector. To really get the highest possible performance though you might want to check whether you can use a Structure-of-Arrays data layout and normalize four vectors in parallel, like this:


movaps xmm0, x   // Load 4 different x values
movaps xmm1, y   // Load 4 different y values
movaps xmm2, z   // Load 4 different z values
movaps xmm3, xmm0
movaps xmm4, xmm1
movaps xmm5, xmm2

mulps xmm0, xmm0
mulps xmm1, xmm1
mulps xmm2, xmm2
addps xmm0, xmm1
addps xmm0, xmm2
rsqrtps xmm0, xmm0
mulps xmm3, xmm0
mulps xmm4, xmm0
mulps xmm5, xmm0

movaps x, xmm3
movaps y, xmm4
movaps z, xmm5

johnsa

Thanks for this!
I did consider SOA instead of AOS.. but I'd really like to keep my data grouping in logical units where ever possible. That said the only other thing I'm not sure about with an SOA implementation, although the actual
code is considerably faster (4 full vectors per iteration) I'm wondering what the effect on the cache would be like.

For example, if you have an obect with 10,000 vectors, and instead of AOS you're keeping the data SOA...
the 3 read operations from memory will a) have a large stride (possible cache line contention? - although 3 or 4 should be ok due to the set nature of the cache?) and b) will not follow a sequential forward read pattern, so every iteration would probably have to re-load the cache, worse still possibly in between the reads for x,y,z due to the distance between them?

Just thinking out aloud here :) I will try both approaches on large sets and see what the difference is exactly. Possibly the reduce amount of code and only needing 1 rsqrt makes up for the cache penalties..


c0d1f1ed

Modern CPUs have prefetchers that can distinguish different data streams. So I don't think there's a lot to worry about. Intel's documents recommend SoA for everything so I expect they also have the necessary cache enhancements for that.

By the way, rsqrt is actually a very fast instruction. It computes an approximation that is only accurate for 12 manitssa bits. That should be fine for vector normalization but be careful for more precision sensitive applications.

johnsa

Good point.. Yeah, rsqrt is only being used in this case for a 3d engine setup, nothing requiring too much precision. I have a seperate version of the proc with sqrtss for full-precision.

An interesting observation I've made with my original proc for dealing with one AOS vector at a time is this :
On my test machine I've setup some timing code calling  the function 64,000,000 times.

If I pass in a pointer to the same vector as both the source and output (ptrV1 and ptrVR):
The proc's cycle count comes in around: 28, and takes about 850ms for 64,000,000 callls.

However, If I pass in two different pointers: 23 cycles, and 640ms ... for the same 64,000,000 iterations.

So there seems to be a 200ms penalty over-all for reading from and then writing to the same address (which is quite large).

I'm trying to figure what cache-optimization rule would be causing this 200ms penalty as opposed to using two different pointers, which are both aligned 16 and sequential in memory.

johnsa

subsequently i did some more testing and experimentation.. i even ran the code through VTune..

VTune came back saying that bus utilization was low (which i suppose is to be expected as this is computation and not memory intensive).
And that FPAssist was high.. which indicates denormals or underflows... so this brings two things up:

I set the mxcsr register to flush to zero and denormals are zero which should improve the performance under those conditions... yet VTune still came back with the FPAssist insight and suggested setting FTZ and DAZ on... ???
Secondly... what could possibly cause denormals or underflow with that code?
It takes vector 1, normalizes it and stores it into vector 1...

Thirdly.. the performance hit from using the same address for the input and output vector is massive.. it almost halves the speed...
At first I thought it might be cache related, second I thought that could have something to do with the underflow, by constantly re-normalizing the same vector.. but i'm not sure how as the length would be unit length and re-normalizing it would have no effect?

PS I'm still running this on the single AOS version of the normalize with an entire <x,y,z,w> vector passed in a single xmm reg.

johnsa

On another point.. i reckon the w coordinate of the vector should remain unchanged (another advantage of the SOA method)... however with the AOS I guess one should just restore that part of the reg to 0 or 1?

c0d1f1ed

Quote from: johnsa on June 09, 2008, 04:53:26 PM
I set the mxcsr register to flush to zero and denormals are zero which should improve the performance under those conditions...

Some processors actually don't implement those features. Or they do, and it still incurs some performance penalty.

QuoteSecondly... what could possibly cause denormals or underflow with that code?

Multiplication can cause denormals when you already started with a small component (e.g. 0.0001 squared is 0.0000001). Underflow can happen if the components is already a denormal.

Anyway, you might want to try enabling exeptions for denormals and underflow. This could help locate exactly where it happens and what the input data was. It hard to eliminate though. Just make sure you don't work with uninitialized components.

QuoteAt first I thought it might be cache related, second I thought that could have something to do with the underflow, by constantly re-normalizing the same vector.. but i'm not sure how as the length would be unit length and re-normalizing it would have no effect?

It could be cache related. The cache eviction policy might cause data we no longer need to stick longer than data we will reuse. But it could also be the constant re-normalizing. Due to numerical impercision it's quite possible that you don't get the same vector back, and it slowly 'turns'. When the components become really small you get denormals and underflow... Only focussed testing will reveal what the cause really is.

johnsa

The input vector is <2.0,3.0,4.0,1.0> which gets normalized and stored into vector2.
So every iteration performs the same normalization on those same input values.. i don't see how that would be de-normal.. and if it were, it would be completely unavoidable.
If i run it so that the normalized vector goes back into the input, then the values switch every time it normalizes... eg: x=0.0312456 and then x=0.03115995 then x=0.0312456 and so on..