News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Memory Speed, QPI and Multicore

Started by johnsa, August 24, 2011, 10:32:47 AM

Previous topic - Next topic

johnsa

Many thanks for the results guys, very helpful!
Hutch, good point sorry I complete forgot to attach the exe ;)

My timings on my core i7:

350 ms, REP MOVSD
524 ms, MOV COPY
418 ms, UNROLLED MOV COPY
230 ms, SIMD COPY
158 ms, SIMD WRITE

johnsa

Ok so I'm now very satisfied with memory throughput, given my machine and what you could theoretically achieve with 2500mhz pc3-20000 RAM, QPI 6.4 and some overclocking, 48Gb/s is quite achievable (if you really wanted to push it that far).

So now it's back on to threading/multicore and why I'm not getting any extra performance from it. I took my same piece of test code and tried the following:
Run it on....Core 1, Core 1+2, Core 1+2+3+4, Core 1+3+5+7 all yield the same results. I tried fiddling with affinities and priorities, no difference noticed.
Moving from a single core to 2 takes the total time from 24ms to 17ms.. adding further cores nothing.

I know that the test code sits at about 7.2Gb/s of memory bandwidth utilisation out of a total 15Gb/s or so available to me. In theory this should mean I could get the code running x4 before it becomes memory-bound.

I then removed the read/write from RAM and the two prefetch instructions leaving the code as PURE compute. Now the threading helps, going from 6ms to 2ms with the extra 3 cores... not linear, but pretty good scaling across 4 cores at 300% improvement.

So here is the code for the vertex AOS batch transform with lines commented out.. I can't work out why this would saturate memory when it's coming in far under the maximum throughput unless it's some form of cache pollution that's the limiting factor.


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

align 4
@@:
;movaps xmm4,[esi] ;x1 y1 z1 w1
;prefetchnta [esi+64]

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.

;prefetchnta [eax+64]
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm6
addps xmm5,xmm7
addps xmm5,xmm3

;movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef


Is there ANY way I can do this with some form of memory access that will actually scale to more than 1.3x ? I've tried converting the memory load/store to non temporal but that was even slower overall.

johnsa

I've tried using combinations of non temporal load/stores and prefetching.. no gains to be had from that.
I've also tried re-organizing the code to use different memory access patterns.. as:

Data:
|0.......................|1.......................|2.......................|3.......................

|012301230123|012301230123|012301230123|012301230123|

And

|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....

0,1,2,3 being the core/thread accessing the data to process

This last option of having the call to batch transform process blocks of say 4096 vertices and keeping the seperate cores memory access as close together as possible without interleaving seems to be the best.. but by a sub-ms amount over the whole run.

hutch--

John,

Rough guess is there is memory contention with multiple threads accessing the same memory range, it may be something like memory lock for each thread as it accesses the common memory range. Now I wonder what would happen if you set each memory block at a different location for each thread/cord to try and avoid the potential contention ?
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Thats sort of what i was thinking. The first data arrangement tried to avoid that by having each core work on a block of say 20Mb, but i'm wondering if it's not to do with cache lines and their associativity.. the problem is the vertex data needs to be 16byte aligned so memory address->cache lines are ALWAYS going to conflict?

hutch--

What about with 4 threads, four separate memory allocations that are large enough to ensure one does not overlap the other ? The size need only be big enough to break the cache. Now the problem is you may end up with page thrashing instead which could be a lot slower again. It depends very much on the internal architecture of the processor as to what is the best technique to get parallel memory reads and writes without contention between the reads and writes across different threads. If it is possible you would be after each core being able to access a different memory address without contention between them and this characteristic is likely to be very processor dependent.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

ideally, you would use CPUID to identify packages, cores, and cache configuration
then, base the algorithm parameters on that information
i have to imagine that some high-end software, somewhere does something along these lines   :red

dioxin

Have you tried comparing your results with those from a standard benchmark program.
This one has a free "lite" version which does memory and cache speed tests:
http://www.sisoftware.net/

Paul.

johnsa

So i decided to download and try Intel's VTune and see if that would shed some light on where things are going wrong here... It shows up some interesting results, but to be honest I'm struggling to interpret it's results or derive any idea as to how to improve the situation..

For reference here is the function... its been identified by VTune as the critical area (obviously).


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

align 4
@@:
movaps xmm4,[esi] ;x1 y1 z1 w1
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm6
addps xmm5,xmm7
addps xmm5,xmm3

movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef


The output from VTune is as follows:

CPI Rate: 1.820 (ideal woul be 0.25.. caused by long latency memory, instruction starvation, stalls, branch misprediction)
Retire Stalls: 0.710
LLC Miss: 0.204 (a high number of cycles spent waiting for LLC (Last level cache) loads.)
Execution Stalls: 0.511 (Percentage of cycles with no micro operations executed is high. Look for long latency operations at code regions with high execution stalls).


Function                      CPU_CLK_UNHALTED.THREAD   INST_RETIRED.ANY   CPI Rate   Retire Stalls     LLC Miss   LLC Load Misses Serviced By Remote DRAM   Contested Accesses   Instruction Starvation   Branch Mispredict     Execution Stalls
TransformVertexBatch4  25,078,000,000                   6,592,000,000           3.804   0.813           0.317   0.000                                                   0.000                  -0.108                    0.001                   0.702

Zooming into the code view of that function:
(Attached as CSV)

I can immediately see a massive delay on the xmm4 load from ESI, which would make me think it's really struggling to get the data and this data probably isn't in L3 cache.. which is odd as I've setup the outer code to block in chunks of 4096 vertices (+- 192kb).


hutch--

John,

An unusual suggestion, load your stack variables in an authodox manner so that your PUSH / POP ratio is identical and see if this effects the timing. Also make sure you use RET rather than the JMP at the end. I have seen code drop dead in the past by having junk like this in front of it. For a slightly higher instruction count, MOV versus POP, you may solves some of the problem here.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Ok, tried that, it's slightly slower than the un-orthodox pop/jmp method. The bulk of the delay in this function seems to come waiting for the movaps xmm4,[esi] (as the next instructions stalls immensely) and then the first of the 3 mulps and the last of addps.
the movaps [eax],xmm5 is also slowed down, i guess because it's waiting for the previous addps...

dioxin

I'm not 100% clear what your information shows.
What are the numbers in the spreadsheet next to instructions? Clock cycles that the instruction takes over a number of loops? Why do some have no figure?
What's BTK?


Anyway, things I'd look at are:
Put the     movaps xmm4,[esi]  AHEAD of the loop and then repeat it the instruction after     add esi,sizeof(Vertex3D)
That way the load for the next value can be taking place while the calculation for the current value takes place.


AMD CPUs prefer loops aligned on 16 or 32 byte boundary, not 4 byte. Maybe Intel is different but it's worth a try.


Use PREFETCH to begin the fetching of data well in advance so it's in the cache when it's needed.

Paul.

johnsa

BTK is a macro (branch hint taken). I've tried with/without it. There is no difference so I've taken it out.
I've aligned the loop to 16 and taken your suggestion to move the load out of the loop and repeat inside to hide some of the load latency. I've added a prefetchnta back in.

The delay now happens on the first mulps which is stalling waiting for the movaps xmm4,[esi] AGAIN... even with the prefetch hint. The other odd thing is I tried changing the movaps [eax],xmm5 to movntdq but that lands up slowing the whole thing down massively rather than helping. I've also removed all the threading and am trying to profile this as a single execution against a large buffer.. so it's now just iterating sequentially (in a predictable manner) through a 200mb buffer of vertices.

updated code:


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm4,[esi] ;x1 y1 z1 w1

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

dec ecx

align 16
@@:
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.
prefetchnta [esi+128]
movaps xmm4,[esi] ;x1 y1 z1 w1

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm3
addps xmm5,xmm6
addps xmm5,xmm7

movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
;BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef

dioxin

John,
put the PREFETCH hint AFTER the movaps xmm4,[esi], not before it and try values of esi + 1024 or more.
I'm not sure how big your data is but if it's many k or many meg then you need to prefetch quite a long way in advance. If your data set is quite small then you need to try and prefetch it all in advance of the loop that processes it.

I get a speed improvement of near 50% with the PREFETCH. Do you get any improvement at all?

Paul.

johnsa

Moved the prefetch around to a few places and changed the offset from 900 - 2048. Same results. The data is quite large, there are 4 million vertices at 48 bytes each. The prefetch doesn't seem to be helping me at all, and neither does a streaming store which i though would in this case of having such a huge data-set... hmm