News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SSE Weirdness...

Started by johnsa, June 09, 2008, 11:38:53 PM

Previous topic - Next topic

johnsa


Ok... this piece of code just basically reads a value and writes it back (with some waffle in-between to make sure there should be no read/write stall)..
This version using normal 32bit dwords with mov eax,[esi]  .... mov [edi],eax runs at EXACTLY the same speed if esi/edi are two different align 4 dwords or the same address.

mov esi,offset testdd1
mov edi,offset testdd2
mov eax,[esi]
pshufd xmm0,xmm0,0
movaps xmm1,xmm0
xor ebx,ebx
add ebx,10
add ebx,20
sub ebx,5
mov [edi],eax


Now do exactly the same thing with align 16 and two xmmwords... using movaps ... and all of a sudden if esi=edi the code runs abut 15% slower.. make esi != edi (as below) and it speeds up... what the hecK?

mov esi,offset testdd1
mov edi,offset testdd2
movaps xmm0,[esi]
pshufd xmm0,xmm0,0
movaps xmm1,xmm0
xor ebx,ebx
add ebx,10
add ebx,20
sub ebx,5
movaps [edi],xmm0

Neo

Cool observation.  It probably has something to do with the way your processor checks and updates caches; it may be completely different on another processor, or it may be common to most processors.  Writing to a location in one of the primary data read cache lines (usually 64 consecutive bytes each?) means that the caches need to be updated, whereas if the locations being written aren't in any caches, the data can be sent to memory while other instructions are running.  If you read from the same cache line soon after, the write instruction needs to be mostly complete before the read instruction can get its data.  That may not be the reason at all; it's just a hunch.

johnsa

It does seem to be cache related.. I've even added more padding code after the write to ensure time between the write and subsequent re-read at the beginning of the loop.

What is odd is that it doesn't affect 32bit regs, only simd.. and assuming all data is aligned, using movaps or dqa, and ensuring the read and write are the same size .. i can't find any scenario in the intel reference of any type of stall that should occur in this scenario.. It puzzles me .. and it seems to make a huge performance difference, we're talking 200ms on a 1 million iteration loop.. thats big for something so trivial.

hutch--

John,

One of the things I have learnt is on late Intel hardware that leading and trailing instructions often effect the speed of the code you are working on. Just try this to see if it effects your code speed. Padd the space before the procedure if you are using MASM with about 4k of nops, align the procedure to 16 bytes then see if there is any speed difference.

Another choice is to allocate memory set to execute, align it to a page boundary and then copy the entire procedure into dynamic memory. In pseudo code,

[copy]
pmem = allloc( len_mypro + 8k)
memalign pmem, 4096
memcopy myproc, pmem, len_myproc
[/copy]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

This thread was basically a follow-on from what i found with my simd vector normalize function, in the other thread.. So tried your suggestion, the proc is aligned 16 and i've put 4096 nops in front of it.. the result is the same..

Here is the proc perhaps you can time/test it on a different machine to see what happens.. but as the dummy code a few posts up indicates it seems to be a general "memory-access" thing and not specific to this routine.



Vector3D STRUCT
x REAL4 0.0
y REAL4 0.0
z REAL4 0.0
w REAL4 0.0
Vector3D ENDS

rept 4096
nop
endm

align 16
Vector3D_Normalize PROC ptrVR:DWORD, ptrV1:DWORD

mov esi,ptrV1
mov edi,ptrVR
movaps xmm0,[esi]
movaps xmm3,xmm0
mulps xmm0,xmm0
pshufd xmm1,xmm0,00000001b
pshufd xmm2,xmm0,00000010b
addss xmm0,xmm1
addss xmm0,xmm2
rsqrtss xmm1,xmm0
pshufd xmm1,xmm1,00000000b
mulps xmm3,xmm1
movaps [edi],xmm3

ret
Vector3D_Normalize ENDP



Now try calling that routine in some timing loop and make ptrV1 and ptrVR the same and then try it with too different addresses... in my .data i have the following:

.data

align 16
myvector1 Vector3D < 2.0, 3.0, 4.0, 1.0 >
myvector2 Vector3D < 2.0, 3.0, 4.0, 1.0 >

NightWare

a test on my core2 (10000000 iteration) :
if esi != edi => 4 cycles
if esi = edi => 23 cycles
and no idea why...

c0d1f1ed

My guess is that it's caused by the write buffer. It makes sense that it has 64-bit entries, meaning that an SSE register would span two entries. This complicates reading back from it.

It would be interesting to see how it performs with MMX...

johnsa

Ok with mmx registers the result is much like it is with the gen. purpose 32bit regs. Theres doesn't seem to be any penalty for reading/writing the same address.

Which makes the writer buffer 64bit entry option more likely.. although I'm not sure why the write buffer would struggle so much more to write 2 64bit chunks and then read it from the same address as opposed to reading from another xmmword.

johnsa

Ok, I think i have it worked out.. the problem is
store-forwarding 3.6.7.3 from the Intel manuals and it's affect core2, PM etc, core duo and solo.

If a store to address is forwarded by a load from that same address the load has to wait for the store data to be available. Also applies to load/store operations that span 4kb strides.

The problem should actually happen with 32bit reg, mmx and xmm ....

After some experimenting with the loop by putting padding code after the store, i eventually found that about 40 opcodes must pass (i guess this depends on opcode timings, throughput and latency etc) before an xmm store / load penalty is avoided... with that padding in place esi == edi and esi != edi both perform the same.

Strangely enough with the mmx and gp 32bit regs i experience no such penalty in any case.. or perhaps the latency of the making the cached-store data available is so short that the very fact it's in the timing loop is enough for
it to be available by the subsequent load.