The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: johnsa on August 24, 2011, 10:32:47 AM

Title: Memory Speed, QPI and Multicore
Post by: johnsa on August 24, 2011, 10:32:47 AM
Hey all,

So I've started some threads in the past on this topic, but i've recently bought a new machine and done some more coding and wanted to pose some questions along with my findings.

Firstly the machine specs: Core i7 970 (4cores+HT=8cores), 2.8ghz, 6Gb DDR3 2000Mhz OCZ, QPI=6.4GT/s.
Based on this machine spec the research I've done should yield memory transfer rates as follows:

3.2GHz (QPI 6.4 / 2)
x 2 bits/Hz (double data rate) ?
x 20 (QPI link width) <--- this is the number of "lanes"
x (64/80) (data bits/flit bits)
x 2 (bidirectional)
* / 8 (bits/byte)
= 25.6 GB/s" or 12.5 dual direction

So in theory a write operation should (bearing in mind other overheads) get up to around 20Gb/s, and a copy should max out at about 12Gb in each direction (per second).

So based on this I went back to some memory transfer profiling code as follows:


; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/64
align 4
copy2:
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add esi,64
add edi,64
dec ecx
jnz short copy2

timer_end
    print ustr$(eax)," ms, SIMD COPY",13,10
   
; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).
timer_begin 1000, HIGH_PRIORITY_CLASS

mov edi,buffer2
mov ecx,MEM_SIZE/64
pxor xmm0,xmm0
pxor xmm1,xmm1
pxor xmm2,xmm2
pxor xmm3,xmm3
align 4
write0:
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add edi,64
dec ecx
jnz short write0

timer_end
    print ustr$(eax)," ms, SIMD WRITE",13,10


What this seems to exhibit is that write and read operations can run in parallel and each seem to be capped. (IE: a loop that just writes doesn't get the benefit of reduced traffic and still saturates at the same point).
I'm fine with that. What is odd to me is the figures I'm getting are FAR lower than the theoretical limit.

With the copy loop I have:

   ; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
   ; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
   ; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).

With the write-only loop:

   ; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
   ; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).

This seems to be half or a quarter of what it should be.
Firstly, am I missing something here? Perhaps others with similar spec machines could test these loops (or suggest improvements to get it closer to the theoretical limit).

The second part of the excercise and the reason why I've run these tests is to establish a saturation point for other algorithms that I'm looking to multi-thread(multi-core enable).
If I know that the memory transfer limit for a combined read/write loop is 12Gb/s.. and I have an algorithm/procedure that currently executes 3Gb/s worth of data transfer.. my logic would say then that the remainder of the time is spent in computation etc which should allow for parallel execution (in this case 4x). IE: Creating 4 threads should then saturate out at the 12Gb/s and adding more threads/cores would have no benefit. The idea is to programmatically identify this so I can decide how best to allocate tasks to cores as I've discussed in my previous posts.

My test case was once again a batched vertex/matrix transform, which at present takes 40ms for 3.6million vertices on one core (+- 3.6Gb of data or well below the saturation point of memory/qpi).
I added a second core, the time went down from 40ms to 33ms, then added 3 and 4.. no improvement... so i'm confused as to where the issue is here. In theory even if my benchmarks above are correct (and the spec is a complete over-estimation) adding the second core should have double the performance (give or take) as there is no locking or sync required and i tried different arrangements with interleaved data between the two cores as well as having totally seperate batches (which seems to be faster - The only reason I can think here is that the interleaving caused issues with cache as both cores are updating data in the same cache-line).
Title: Re: Memory Speed, QPI and Multicore
Post by: MichaelW on August 24, 2011, 11:08:40 AM
I don't really know what to make of your results, but I do have some questions. Are you restricting the test code to a particular core or cores? Is there a way to effectively disable HT for the test, to eliminate the overhead of two logical cores sharing the same execution core? And how does code that uses REP MOVSD/STOSD instead of SSE compare?
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 24, 2011, 11:28:30 AM
The code below (memory profiling) i'm not running any threads, just single-core as it's going to saturate anyway.

In terms of rep movsd/stosd they come in about 15% slower than the posted version (which i took through a few iterations to get it faster than rep versions as the idea is to get as close to that theoretical limit as possible).

With regards to my batch vertex/matrix transform i don't take HT into account, I have it running on either 2 or 4 threads with core affinities set. I would hope that the scheduler/os would favour real cores over HT ones, if not it could possibly reduce the efficiency of 2 threads.. but then switching to 4 theads should be running on at least 2 real seperate cores and both combinations seem to top-out at way under the memory-bus saturation point i would have expected.

I know I have had this argument before with people, but I have yet to see someone post a REAL example that actually scales with multiple cores. My example should scale between 1-4 cores nearly linearly (based on the lack of locking required and memory utilisation).
At best I constantly see the same results come out from EVERY single piece of test code:
1 core : X ms
2 cores: 80% of X ms
3 cores: 75% of X ms
4 cores and up: still 75% of X ms...
Throwing more cores at problems never seem to yield more than a 20-30% improvement on overall performance even when the memory/bus should allow it.

Title: Re: Memory Speed, QPI and Multicore
Post by: MichaelW on August 24, 2011, 12:12:31 PM
For my first question, it seems to me that if the system controls what runs where, the test code will not necessarily run on a core by itself or necessarily run on the same core throughout the test.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 24, 2011, 12:26:04 PM
I do force the affinity for the threads, so hopefully they'll run on the same core throughout.. but like most things in windows the documentation says the os can and will override anything it wants if it feels it needs to :)
So given the constraints of running under an OS, I do everything in my power to ensure the threads are created and linked to a specific core and stay on that one.
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 24, 2011, 12:40:59 PM
This is just from memory but apparently the gain of a single core yields about 1.8 times processing increase where a hyperthreaded core adds about 30% improvement. It may be worth trying the test piece of a Core2 series processor that does not have the hyperthreading to see if the core increase is anything like linear.

From the reference material I remember hyperthreading is effectively a faster form of task switching on any given core, it is not an independent core.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 24, 2011, 03:44:58 PM
Tried it on my core2 dual, same sort of relative gains.
On my i7 now i've even tried distributing the tasks across cores 0,2,4,6 instead of 0,1,2,3 to see if that would alter the outcome or even avoid HT cores... no difference.
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 24, 2011, 04:41:56 PM
On my i7 running Win7 64 bit you can watch all 8 cores at the same time. Its probably worthy seeing what the core load distribution is like to know if all of them are being used.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 25, 2011, 08:50:36 AM
Ok... I've tried monitoring the cores from process monitor... they don't seem to be going up above 50% on the graph..

I think before I look at this threading issue we should go back to the memory throughput... can anyone else run those tests to see what their read/write and write-only throughput is?
I would expect to be seeing 12Gb/s read, 12Gb/s write (so a combined 25Gb/s roughly) assuming you're on an i7 with ddr3 and QPI 6.4
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 25, 2011, 11:06:16 AM
john,

If you put together a test piece, i will happily try it out on both quads, a Core2 quad and an i7 quad.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 25, 2011, 12:29:49 PM
Gladly :)

So i've solved my memory throughput issue... I'm a twit.. I'm running OCZ 2000mhz XMP compliant RAM, but my machine was still set to default timings. Enabled XMP/Profile1 and put in the right CAS etc. timings and my memory throughput doubled!




;* 3.2GHz
;* x 2 bits/Hz (double data rate)
;* x 20 (QPI link width) <--- this is the number of "lanes"
;* x (64/80) (data bits/flit bits)
;* x 2 (bidirectional)
;* / 8 (bits/byte)
;* = 25.6 GB/s" or 12.5 dual direction
; OR 19.2 Gb/s for QPI 4.8 at 2.4ghz

include c:\masm32\include\masm32rt.inc

.data?

.data

buffer1 dd 0
buffer2 dd 0
MEM_SIZE equ (1024*1024)*200 ; 200Mb

.code

start:
invoke VirtualAlloc,NULL,MEM_SIZE,MEM_COMMIT,PAGE_EXECUTE_READWRITE
mov buffer1,eax
invoke VirtualAlloc,NULL,MEM_SIZE,MEM_COMMIT,PAGE_EXECUTE_READWRITE
mov buffer2,eax

; 586ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/4
rep movsd

timer_end
    print ustr$(eax)," ms, REP MOVSD",13,10
; ; 1150ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/4
align 4
copy0:
mov eax,[esi]
mov [edi],eax
add esi,4
add edi,4
dec ecx
jnz short copy0

timer_end
   print ustr$(eax)," ms, MOV COPY",13,10

; 904ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/16
align 4
copy1:
mov eax,[esi]
mov ebx,[esi+4]
mov [edi],eax
mov [edi+4],ebx

mov eax,[esi+8]
mov ebx,[esi+12]
mov [edi+8],eax
mov [edi+12],ebx

add esi,16
add edi,16
dec ecx
jnz short copy1

timer_end
   print ustr$(eax)," ms, UNROLLED MOV COPY",13,10

; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).
; 200ms = 2Gb. 19.5Gb/s. (9.75Gb/s in and out) (2Mb buffer copied 1000 times).
; 270ms = 2Gb. 14.8Gb/s  (200Mb copied 10 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/64
align 4
copy2:
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add esi,64
add edi,64
dec ecx
jnz short copy2

timer_end
    print ustr$(eax)," ms, SIMD COPY",13,10
   
; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).
; 156ms = 2Gb 12.8Gb/s out only. (2Mb buffer copied 1000).
; 156ms = 2Gb 12.8Gb/s out only. (200Mb buffer copied 10 times).
; 128ms = 2Gb 15.6Gb/s out only. (200Mb buffer copied 10 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov edi,buffer2
mov ecx,MEM_SIZE/64
pxor xmm0,xmm0
pxor xmm1,xmm1
pxor xmm2,xmm2
pxor xmm3,xmm3
align 4
write0:
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add edi,64
dec ecx
jnz short write0

timer_end
    print ustr$(eax)," ms, SIMD WRITE",13,10
   
invoke VirtualFree,buffer1,MEM_SIZE,MEM_RELEASE
invoke VirtualFree,buffer2,MEM_SIZE,MEM_RELEASE
invoke ExitProcess,0

end start


ml /c /coff test.asm
link /nologo /release /machine:ix86 /subsystem:console test

So in review I'm now getting close to max throughput.. 15.6Gb/s pure write... 15Gb/s read write of large data 200Mb and 19.5Gb/s for read/write when data is smaller (2Mb).

This has doubled the performance of my said other test-piece of vertex/matrix transforms understandably.. now back to investigating the core/multicore issue as it should have even more room to scale before saturating memory.
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 25, 2011, 01:16:58 PM
John,

This is the result on my Core2 quad.


911 ms, REP MOVSD
936 ms, MOV COPY
938 ms, UNROLLED MOV COPY
589 ms, SIMD COPY
302 ms, SIMD WRITE
Press any key to continue ...


Just a comment, you will get many more people who will tryit if you build the example. I had to find timers.asm and add the include line to get it to work.
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 25, 2011, 01:46:43 PM
Just had a quick play with the mov copy and nothing makes it faster. The cache pollution seems to be the limiting factor on write.


    .data?
      esp_ dd ?
    .code
    push ebx
    mov esi,buffer1
    mov edi,buffer2
    mov ecx,MEM_SIZE/16
    xor edx, edx
    push ebp
    mov esp_, esp
  align 4
  copy0:
    mov eax, [esi+edx]
    mov ebx, [esi+edx+4]
    mov ebp, [esi+edx+8]
    mov esp, [esi+edx+12]
    mov [edi+edx], eax
    mov [edi+ebx+4], ebx
    mov [edi+ebx+8], ebp
    mov [edi+ebx+12], esp
    add edx, 16
    dec ecx
    jnz short copy0
    mov esp, esp_
    pop ebp
    pop ebx

timer_end
   print ustr$(eax)," ms, MOV COPY",13,10
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 25, 2011, 02:03:51 PM
Here are the timings on my i7.


434 ms, REP MOVSD
604 ms, MOV COPY
547 ms, UNROLLED MOV COPY
308 ms, SIMD COPY
194 ms, SIMD WRITE
Press any key to continue ...
Title: Re: Memory Speed, QPI and Multicore
Post by: dedndave on August 25, 2011, 02:17:58 PM
not sure if John is interested in my old P4 results   :P

i am using a p4 prescott w/htt, SSE3 @ 3 GHz

i changed the preamble to get it to assemble...
        include \masm32\include\masm32rt.inc
        .686
        .mmx
        .xmm
        include \masm32\macros\timers.asm


i also added the following code to control the h/t cores (threads - whatever)
start:
        invoke  GetCurrentProcess
        invoke  SetProcessAffinityMask,eax,dwCoreMask
        invoke  Sleep,750


i set dwCoreMask to either 1 to bind to a single core, or 3 to allow both

dwCoreMask = 1
2243 ms, REP MOVSD
1995 ms, MOV COPY
2021 ms, UNROLLED MOV COPY
1198 ms, SIMD COPY
522 ms, SIMD WRITE

2238 ms, REP MOVSD
2027 ms, MOV COPY
2011 ms, UNROLLED MOV COPY
1196 ms, SIMD COPY
523 ms, SIMD WRITE


dwCoreMask = 3
2166 ms, REP MOVSD
1897 ms, MOV COPY
1929 ms, UNROLLED MOV COPY
1197 ms, SIMD COPY
520 ms, SIMD WRITE

2253 ms, REP MOVSD
1996 ms, MOV COPY
1985 ms, UNROLLED MOV COPY
1219 ms, SIMD COPY
521 ms, SIMD WRITE
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 25, 2011, 02:29:55 PM
Many thanks for the results guys, very helpful!
Hutch, good point sorry I complete forgot to attach the exe ;)

My timings on my core i7:

350 ms, REP MOVSD
524 ms, MOV COPY
418 ms, UNROLLED MOV COPY
230 ms, SIMD COPY
158 ms, SIMD WRITE
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 26, 2011, 01:53:00 PM
Ok so I'm now very satisfied with memory throughput, given my machine and what you could theoretically achieve with 2500mhz pc3-20000 RAM, QPI 6.4 and some overclocking, 48Gb/s is quite achievable (if you really wanted to push it that far).

So now it's back on to threading/multicore and why I'm not getting any extra performance from it. I took my same piece of test code and tried the following:
Run it on....Core 1, Core 1+2, Core 1+2+3+4, Core 1+3+5+7 all yield the same results. I tried fiddling with affinities and priorities, no difference noticed.
Moving from a single core to 2 takes the total time from 24ms to 17ms.. adding further cores nothing.

I know that the test code sits at about 7.2Gb/s of memory bandwidth utilisation out of a total 15Gb/s or so available to me. In theory this should mean I could get the code running x4 before it becomes memory-bound.

I then removed the read/write from RAM and the two prefetch instructions leaving the code as PURE compute. Now the threading helps, going from 6ms to 2ms with the extra 3 cores... not linear, but pretty good scaling across 4 cores at 300% improvement.

So here is the code for the vertex AOS batch transform with lines commented out.. I can't work out why this would saturate memory when it's coming in far under the maximum throughput unless it's some form of cache pollution that's the limiting factor.


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

align 4
@@:
;movaps xmm4,[esi] ;x1 y1 z1 w1
;prefetchnta [esi+64]

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.

;prefetchnta [eax+64]
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm6
addps xmm5,xmm7
addps xmm5,xmm3

;movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef


Is there ANY way I can do this with some form of memory access that will actually scale to more than 1.3x ? I've tried converting the memory load/store to non temporal but that was even slower overall.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 26, 2011, 02:37:45 PM
I've tried using combinations of non temporal load/stores and prefetching.. no gains to be had from that.
I've also tried re-organizing the code to use different memory access patterns.. as:

Data:
|0.......................|1.......................|2.......................|3.......................

|012301230123|012301230123|012301230123|012301230123|

And

|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....|0.....1....2.....3.....

0,1,2,3 being the core/thread accessing the data to process

This last option of having the call to batch transform process blocks of say 4096 vertices and keeping the seperate cores memory access as close together as possible without interleaving seems to be the best.. but by a sub-ms amount over the whole run.
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 26, 2011, 02:43:45 PM
John,

Rough guess is there is memory contention with multiple threads accessing the same memory range, it may be something like memory lock for each thread as it accesses the common memory range. Now I wonder what would happen if you set each memory block at a different location for each thread/cord to try and avoid the potential contention ?
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 26, 2011, 02:50:42 PM
Thats sort of what i was thinking. The first data arrangement tried to avoid that by having each core work on a block of say 20Mb, but i'm wondering if it's not to do with cache lines and their associativity.. the problem is the vertex data needs to be 16byte aligned so memory address->cache lines are ALWAYS going to conflict?
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 26, 2011, 04:42:35 PM
What about with 4 threads, four separate memory allocations that are large enough to ensure one does not overlap the other ? The size need only be big enough to break the cache. Now the problem is you may end up with page thrashing instead which could be a lot slower again. It depends very much on the internal architecture of the processor as to what is the best technique to get parallel memory reads and writes without contention between the reads and writes across different threads. If it is possible you would be after each core being able to access a different memory address without contention between them and this characteristic is likely to be very processor dependent.
Title: Re: Memory Speed, QPI and Multicore
Post by: dedndave on August 26, 2011, 05:05:47 PM
ideally, you would use CPUID to identify packages, cores, and cache configuration
then, base the algorithm parameters on that information
i have to imagine that some high-end software, somewhere does something along these lines   :red
Title: Re: Memory Speed, QPI and Multicore
Post by: dioxin on August 26, 2011, 05:45:28 PM
Have you tried comparing your results with those from a standard benchmark program.
This one has a free "lite" version which does memory and cache speed tests:
http://www.sisoftware.net/

Paul.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 01:34:30 PM
So i decided to download and try Intel's VTune and see if that would shed some light on where things are going wrong here... It shows up some interesting results, but to be honest I'm struggling to interpret it's results or derive any idea as to how to improve the situation..

For reference here is the function... its been identified by VTune as the critical area (obviously).


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

align 4
@@:
movaps xmm4,[esi] ;x1 y1 z1 w1
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm6
addps xmm5,xmm7
addps xmm5,xmm3

movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef


The output from VTune is as follows:

CPI Rate: 1.820 (ideal woul be 0.25.. caused by long latency memory, instruction starvation, stalls, branch misprediction)
Retire Stalls: 0.710
LLC Miss: 0.204 (a high number of cycles spent waiting for LLC (Last level cache) loads.)
Execution Stalls: 0.511 (Percentage of cycles with no micro operations executed is high. Look for long latency operations at code regions with high execution stalls).


Function                      CPU_CLK_UNHALTED.THREAD   INST_RETIRED.ANY   CPI Rate   Retire Stalls     LLC Miss   LLC Load Misses Serviced By Remote DRAM   Contested Accesses   Instruction Starvation   Branch Mispredict     Execution Stalls
TransformVertexBatch4  25,078,000,000                   6,592,000,000           3.804   0.813           0.317   0.000                                                   0.000                  -0.108                    0.001                   0.702

Zooming into the code view of that function:
(Attached as CSV)

I can immediately see a massive delay on the xmm4 load from ESI, which would make me think it's really struggling to get the data and this data probably isn't in L3 cache.. which is odd as I've setup the outer code to block in chunks of 4096 vertices (+- 192kb).

Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 29, 2011, 01:57:38 PM
John,

An unusual suggestion, load your stack variables in an authodox manner so that your PUSH / POP ratio is identical and see if this effects the timing. Also make sure you use RET rather than the JMP at the end. I have seen code drop dead in the past by having junk like this in front of it. For a slightly higher instruction count, MOV versus POP, you may solves some of the problem here.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 02:05:09 PM
Ok, tried that, it's slightly slower than the un-orthodox pop/jmp method. The bulk of the delay in this function seems to come waiting for the movaps xmm4,[esi] (as the next instructions stalls immensely) and then the first of the 3 mulps and the last of addps.
the movaps [eax],xmm5 is also slowed down, i guess because it's waiting for the previous addps...
Title: Re: Memory Speed, QPI and Multicore
Post by: dioxin on August 29, 2011, 02:05:52 PM
I'm not 100% clear what your information shows.
What are the numbers in the spreadsheet next to instructions? Clock cycles that the instruction takes over a number of loops? Why do some have no figure?
What's BTK?


Anyway, things I'd look at are:
Put the     movaps xmm4,[esi]  AHEAD of the loop and then repeat it the instruction after     add esi,sizeof(Vertex3D)
That way the load for the next value can be taking place while the calculation for the current value takes place.


AMD CPUs prefer loops aligned on 16 or 32 byte boundary, not 4 byte. Maybe Intel is different but it's worth a try.


Use PREFETCH to begin the fetching of data well in advance so it's in the cache when it's needed.

Paul.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 02:42:52 PM
BTK is a macro (branch hint taken). I've tried with/without it. There is no difference so I've taken it out.
I've aligned the loop to 16 and taken your suggestion to move the load out of the loop and repeat inside to hide some of the load latency. I've added a prefetchnta back in.

The delay now happens on the first mulps which is stalling waiting for the movaps xmm4,[esi] AGAIN... even with the prefetch hint. The other odd thing is I tried changing the movaps [eax],xmm5 to movntdq but that lands up slowing the whole thing down massively rather than helping. I've also removed all the threading and am trying to profile this as a single execution against a large buffer.. so it's now just iterating sequentially (in a predictable manner) through a 200mb buffer of vertices.

updated code:


option prologue:none
option epilogue:none
align 16
TransformVertexBatch4 proc pDestVertexBuffer:DWORD, pSrcVertexBuffer:DWORD, pMatrix:DWORD, vertexCount:DWORD
pop ebx
pop eax
pop esi
pop edi
pop ecx

movaps xmm4,[esi] ;x1 y1 z1 w1

movaps xmm0,[edi] ;aeim
movaps xmm1,[edi+16] ;bfjn
movaps xmm2,[edi+32] ;cgko
movaps xmm3,[edi+48] ;dhlp

dec ecx

align 16
@@:
pshufd xmm5,xmm4,00000000b ;x1 x1 x1 x1
pshufd xmm6,xmm4,01010101b ;y1 y1 y1 y1
pshufd xmm7,xmm4,10101010b ;z1 z1 z1 z1

add esi,sizeof(Vertex3D) ; do this here to allow xmm4 to load.
prefetchnta [esi+128]
movaps xmm4,[esi] ;x1 y1 z1 w1

mulps xmm5,xmm0
mulps xmm6,xmm1
mulps xmm7,xmm2

addps xmm5,xmm3
addps xmm5,xmm6
addps xmm5,xmm7

movaps [eax],xmm5
add eax,sizeof(Vertex3D)

dec ecx
;BTK
jnz short @B

jmp ebx
TransformVertexBatch4 endp
option prologue:PrologueDef
option epilogue:EpilogueDef
Title: Re: Memory Speed, QPI and Multicore
Post by: dioxin on August 29, 2011, 02:53:44 PM
John,
put the PREFETCH hint AFTER the movaps xmm4,[esi], not before it and try values of esi + 1024 or more.
I'm not sure how big your data is but if it's many k or many meg then you need to prefetch quite a long way in advance. If your data set is quite small then you need to try and prefetch it all in advance of the loop that processes it.

I get a speed improvement of near 50% with the PREFETCH. Do you get any improvement at all?

Paul.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 03:01:39 PM
Moved the prefetch around to a few places and changed the offset from 900 - 2048. Same results. The data is quite large, there are 4 million vertices at 48 bytes each. The prefetch doesn't seem to be helping me at all, and neither does a streaming store which i though would in this case of having such a huge data-set... hmm
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 03:16:10 PM
ok.. i tried something quite different, at the moment my source data and destination data are two totally separate large buffers...

IE: [ESI] => 200mb in data
[EAX] => seperate 200mb data

if I change the store from going to [EAX] to storing back to the original vertex, the code doubles in speed. VTune still telling me that there are massive delays, but the overall loop goes from 16ms to 10ms. Obviously this isn't ideal cause i don't want to over-write my original vertex... but what it doesn't indicate is that perhaps i should change the data structure so the original and transformed versions are adjacent rather than in two TOTALLY separate buffers...

So overall the performance is much better (I guess due to locality of the data being read and written)... however having this executed by multiple threads still doesn't add any performance... and with all VTunes issues, i'm sure this code is still far from optimal
Title: Re: Memory Speed, QPI and Multicore
Post by: hutch-- on August 29, 2011, 03:18:56 PM
John,

With data of that size you must be getting some memory page thrashing and this may be the limiting factor or at least one of them.
Title: Re: Memory Speed, QPI and Multicore
Post by: johnsa on August 29, 2011, 03:24:12 PM
Reoganized my vertices according to the above post so i can see the result, improvement holds.. 16ms down to 10ms (better) but i'm sure there's a way to get more from multiple cores on this..(although this is now hitting 10.4Gb/s worth of data processed).
At 32bytes per iteration (16byte vertex in and written out again).