Memory Speed, QPI and Multicore

johnsa · August 25, 2011, 02:29:55 PM

Many thanks for the results guys, very helpful!
Hutch, good point sorry I complete forgot to attach the exe ;)

My timings on my core i7:

350 ms, REP MOVSD
524 ms, MOV COPY
418 ms, UNROLLED MOV COPY
230 ms, SIMD COPY
158 ms, SIMD WRITE

johnsa · August 26, 2011, 01:53:00 PM

Ok so I'm now very satisfied with memory throughput, given my machine and what you could theoretically achieve with 2500mhz pc3-20000 RAM, QPI 6.4 and some overclocking, 48Gb/s is quite achievable (if you really wanted to push it that far).

So now it's back on to threading/multicore and why I'm not getting any extra performance from it. I took my same piece of test code and tried the following:
Run it on....Core 1, Core 1+2, Core 1+2+3+4, Core 1+3+5+7 all yield the same results. I tried fiddling with affinities and priorities, no difference noticed.
Moving from a single core to 2 takes the total time from 24ms to 17ms.. adding further cores nothing.

I know that the test code sits at about 7.2Gb/s of memory bandwidth utilisation out of a total 15Gb/s or so available to me. In theory this should mean I could get the code running x4 before it becomes memory-bound.

I then removed the read/write from RAM and the two prefetch instructions leaving the code as PURE compute. Now the threading helps, going from 6ms to 2ms with the extra 3 cores... not linear, but pretty good scaling across 4 cores at 300% improvement.

So here is the code for the vertex AOS batch transform with lines commented out.. I can't work out why this would saturate memory when it's coming in far under the maximum throughput unless it's some form of cache pollution that's the limiting factor.

Code Select


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
	pop ebx
	pop eax
	pop esi
	pop edi
	pop ecx
	
	movaps xmm0,[edi]		;aeim
	movaps xmm1,[edi+16]	;bfjn
	movaps xmm2,[edi+32]	;cgko	
	movaps xmm3,[edi+48]	;dhlp

align 4
@@:
	;movaps xmm4,[esi]			;x1 y1 z1 w1
	;prefetchnta [esi+64]
	
	add esi,sizeof(Vertex3D)	; do this here to allow xmm4 to load.

	;prefetchnta [eax+64]	
	pshufd xmm5,xmm4,00000000b	;x1 x1 x1 x1
	pshufd xmm6,xmm4,01010101b	;y1 y1 y1 y1
	pshufd xmm7,xmm4,10101010b	;z1 z1 z1 z1
	
	mulps xmm5,xmm0
	mulps xmm6,xmm1
	mulps xmm7,xmm2

	addps xmm5,xmm6
	addps xmm5,xmm7
	addps xmm5,xmm3

	;movaps [eax],xmm5
	add eax,sizeof(Vertex3D)
	
	dec ecx
	BTK
	jnz short @B
	
	jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef

Is there ANY way I can do this with some form of memory access that will actually scale to more than 1.3x ? I've tried converting the memory load/store to non temporal but that was even slower overall.

johnsa · August 26, 2011, 02:37:45 PM

I've tried using combinations of non temporal load/stores and prefetching.. no gains to be had from that.
I've also tried re-organizing the code to use different memory access patterns.. as:

Data:
|0.......................|1.......................|2.......................|3.......................

|012301230123|012301230123|012301230123|012301230123|

And

|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....

0,1,2,3 being the core/thread accessing the data to process

This last option of having the call to batch transform process blocks of say 4096 vertices and keeping the seperate cores memory access as close together as possible without interleaving seems to be the best.. but by a sub-ms amount over the whole run.

hutch-- · August 26, 2011, 02:43:45 PM

John,

Rough guess is there is memory contention with multiple threads accessing the same memory range, it may be something like memory lock for each thread as it accesses the common memory range. Now I wonder what would happen if you set each memory block at a different location for each thread/cord to try and avoid the potential contention ?

johnsa · August 26, 2011, 02:50:42 PM

Thats sort of what i was thinking. The first data arrangement tried to avoid that by having each core work on a block of say 20Mb, but i'm wondering if it's not to do with cache lines and their associativity.. the problem is the vertex data needs to be 16byte aligned so memory address->cache lines are ALWAYS going to conflict?

hutch-- · August 26, 2011, 04:42:35 PM

What about with 4 threads, four separate memory allocations that are large enough to ensure one does not overlap the other ? The size need only be big enough to break the cache. Now the problem is you may end up with page thrashing instead which could be a lot slower again. It depends very much on the internal architecture of the processor as to what is the best technique to get parallel memory reads and writes without contention between the reads and writes across different threads. If it is possible you would be after each core being able to access a different memory address without contention between them and this characteristic is likely to be very processor dependent.

dedndave · August 26, 2011, 05:05:47 PM

ideally, you would use CPUID to identify packages, cores, and cache configuration
then, base the algorithm parameters on that information
i have to imagine that some high-end software, somewhere does something along these lines :red

dioxin · August 26, 2011, 05:45:28 PM

Have you tried comparing your results with those from a standard benchmark program.
This one has a free "lite" version which does memory and cache speed tests:
http://www.sisoftware.net/

Paul.

johnsa · August 29, 2011, 01:34:30 PM

So i decided to download and try Intel's VTune and see if that would shed some light on where things are going wrong here... It shows up some interesting results, but to be honest I'm struggling to interpret it's results or derive any idea as to how to improve the situation..

For reference here is the function... its been identified by VTune as the critical area (obviously).

Code Select


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
	pop ebx
	pop eax
	pop esi
	pop edi
	pop ecx
	
	movaps xmm0,[edi]		;aeim
	movaps xmm1,[edi+16]	;bfjn
	movaps xmm2,[edi+32]	;cgko	
	movaps xmm3,[edi+48]	;dhlp

align 4
@@:
	movaps xmm4,[esi]			;x1 y1 z1 w1
	pshufd xmm5,xmm4,00000000b	;x1 x1 x1 x1
	pshufd xmm6,xmm4,01010101b	;y1 y1 y1 y1
	pshufd xmm7,xmm4,10101010b	;z1 z1 z1 z1

	add esi,sizeof(Vertex3D)	; do this here to allow xmm4 to load.
	
	mulps xmm5,xmm0
	mulps xmm6,xmm1
	mulps xmm7,xmm2

	addps xmm5,xmm6
	addps xmm5,xmm7
	addps xmm5,xmm3

	movaps [eax],xmm5
	add eax,sizeof(Vertex3D)
	
	dec ecx
	BTK
	jnz short @B
	
	jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef

The output from VTune is as follows:

CPI Rate: 1.820 (ideal woul be 0.25.. caused by long latency memory, instruction starvation, stalls, branch misprediction)
Retire Stalls: 0.710
LLC Miss: 0.204 (a high number of cycles spent waiting for LLC (Last level cache) loads.)
Execution Stalls: 0.511 (Percentage of cycles with no micro operations executed is high. Look for long latency operations at code regions with high execution stalls).

Function CPU_CLK_UNHALTED.THREAD INST_RETIRED.ANY CPI Rate Retire Stalls LLC Miss LLC Load Misses Serviced By Remote DRAM Contested Accesses Instruction Starvation Branch Mispredict Execution Stalls
TransformVertexBatch4 25,078,000,000 6,592,000,000 3.804 0.813 0.317 0.000 0.000 -0.108 0.001 0.702

Zooming into the code view of that function:
(Attached as CSV)

I can immediately see a massive delay on the xmm4 load from ESI, which would make me think it's really struggling to get the data and this data probably isn't in L3 cache.. which is odd as I've setup the outer code to block in chunks of 4096 vertices (+- 192kb).

hutch-- · August 29, 2011, 01:57:38 PM

John,

An unusual suggestion, load your stack variables in an authodox manner so that your PUSH / POP ratio is identical and see if this effects the timing. Also make sure you use RET rather than the JMP at the end. I have seen code drop dead in the past by having junk like this in front of it. For a slightly higher instruction count, MOV versus POP, you may solves some of the problem here.

johnsa · August 29, 2011, 02:05:09 PM

Ok, tried that, it's slightly slower than the un-orthodox pop/jmp method. The bulk of the delay in this function seems to come waiting for the movaps xmm4,[esi] (as the next instructions stalls immensely) and then the first of the 3 mulps and the last of addps.
the movaps [eax],xmm5 is also slowed down, i guess because it's waiting for the previous addps...

dioxin · August 29, 2011, 02:05:52 PM

I'm not 100% clear what your information shows.
What are the numbers in the spreadsheet next to instructions? Clock cycles that the instruction takes over a number of loops? Why do some have no figure?
What's BTK?

Anyway, things I'd look at are:
Put the movaps xmm4,[esi] AHEAD of the loop and then repeat it the instruction after add esi,sizeof(Vertex3D)
That way the load for the next value can be taking place while the calculation for the current value takes place.

AMD CPUs prefer loops aligned on 16 or 32 byte boundary, not 4 byte. Maybe Intel is different but it's worth a try.

Use PREFETCH to begin the fetching of data well in advance so it's in the cache when it's needed.

Paul.

johnsa · August 29, 2011, 02:42:52 PM

BTK is a macro (branch hint taken). I've tried with/without it. There is no difference so I've taken it out.
I've aligned the loop to 16 and taken your suggestion to move the load out of the loop and repeat inside to hide some of the load latency. I've added a prefetchnta back in.

The delay now happens on the first mulps which is stalling waiting for the movaps xmm4,[esi] AGAIN... even with the prefetch hint. The other odd thing is I tried changing the movaps [eax],xmm5 to movntdq but that lands up slowing the whole thing down massively rather than helping. I've also removed all the threading and am trying to profile this as a single execution against a large buffer.. so it's now just iterating sequentially (in a predictable manner) through a 200mb buffer of vertices.

updated code:

Code Select


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
	pop ebx
	pop eax
	pop esi
	pop edi
	pop ecx
	
	movaps xmm4,[esi]			;x1 y1 z1 w1
		
	movaps xmm0,[edi]		;aeim
	movaps xmm1,[edi+16]	;bfjn
	movaps xmm2,[edi+32]	;cgko	
	movaps xmm3,[edi+48]	;dhlp
	
	dec ecx
	
align 16
@@:
	pshufd xmm5,xmm4,00000000b	;x1 x1 x1 x1
	pshufd xmm6,xmm4,01010101b	;y1 y1 y1 y1
	pshufd xmm7,xmm4,10101010b	;z1 z1 z1 z1

	add esi,sizeof(Vertex3D)	; do this here to allow xmm4 to load.
	prefetchnta [esi+128]
	movaps xmm4,[esi]			;x1 y1 z1 w1
	
	mulps xmm5,xmm0
	mulps xmm6,xmm1
	mulps xmm7,xmm2

	addps xmm5,xmm3
	addps xmm5,xmm6
	addps xmm5,xmm7

	movaps [eax],xmm5
	add eax,sizeof(Vertex3D)
	
	dec ecx
	;BTK
	jnz short @B

	jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef

dioxin · August 29, 2011, 02:53:44 PM

John,
put the PREFETCH hint AFTER the movaps xmm4,[esi], not before it and try values of esi + 1024 or more.
I'm not sure how big your data is but if it's many k or many meg then you need to prefetch quite a long way in advance. If your data set is quite small then you need to try and prefetch it all in advance of the loop that processes it.

I get a speed improvement of near 50% with the PREFETCH. Do you get any improvement at all?

Paul.

johnsa · August 29, 2011, 03:01:39 PM

Moved the prefetch around to a few places and changed the offset from 900 - 2048. Same results. The data is quite large, there are 4 million vertices at 48 bytes each. The prefetch doesn't seem to be helping me at all, and neither does a streaming store which i though would in this case of having such a huge data-set... hmm

News:

Memory Speed, QPI and Multicore