News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Memory Speed, QPI and Multicore

Started by johnsa, August 24, 2011, 10:32:47 AM

Previous topic - Next topic

johnsa

Hey all,

So I've started some threads in the past on this topic, but i've recently bought a new machine and done some more coding and wanted to pose some questions along with my findings.

Firstly the machine specs: Core i7 970 (4cores+HT=8cores), 2.8ghz, 6Gb DDR3 2000Mhz OCZ, QPI=6.4GT/s.
Based on this machine spec the research I've done should yield memory transfer rates as follows:

3.2GHz (QPI 6.4 / 2)
x 2 bits/Hz (double data rate) ?
x 20 (QPI link width) <--- this is the number of "lanes"
x (64/80) (data bits/flit bits)
x 2 (bidirectional)
* / 8 (bits/byte)
= 25.6 GB/s" or 12.5 dual direction

So in theory a write operation should (bearing in mind other overheads) get up to around 20Gb/s, and a copy should max out at about 12Gb in each direction (per second).

So based on this I went back to some memory transfer profiling code as follows:


; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/64
align 4
copy2:
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add esi,64
add edi,64
dec ecx
jnz short copy2

timer_end
    print ustr$(eax)," ms, SIMD COPY",13,10
   
; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).
timer_begin 1000, HIGH_PRIORITY_CLASS

mov edi,buffer2
mov ecx,MEM_SIZE/64
pxor xmm0,xmm0
pxor xmm1,xmm1
pxor xmm2,xmm2
pxor xmm3,xmm3
align 4
write0:
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add edi,64
dec ecx
jnz short write0

timer_end
    print ustr$(eax)," ms, SIMD WRITE",13,10


What this seems to exhibit is that write and read operations can run in parallel and each seem to be capped. (IE: a loop that just writes doesn't get the benefit of reduced traffic and still saturates at the same point).
I'm fine with that. What is odd to me is the figures I'm getting are FAR lower than the theoretical limit.

With the copy loop I have:

   ; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
   ; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
   ; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).

With the write-only loop:

   ; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
   ; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).

This seems to be half or a quarter of what it should be.
Firstly, am I missing something here? Perhaps others with similar spec machines could test these loops (or suggest improvements to get it closer to the theoretical limit).

The second part of the excercise and the reason why I've run these tests is to establish a saturation point for other algorithms that I'm looking to multi-thread(multi-core enable).
If I know that the memory transfer limit for a combined read/write loop is 12Gb/s.. and I have an algorithm/procedure that currently executes 3Gb/s worth of data transfer.. my logic would say then that the remainder of the time is spent in computation etc which should allow for parallel execution (in this case 4x). IE: Creating 4 threads should then saturate out at the 12Gb/s and adding more threads/cores would have no benefit. The idea is to programmatically identify this so I can decide how best to allocate tasks to cores as I've discussed in my previous posts.

My test case was once again a batched vertex/matrix transform, which at present takes 40ms for 3.6million vertices on one core (+- 3.6Gb of data or well below the saturation point of memory/qpi).
I added a second core, the time went down from 40ms to 33ms, then added 3 and 4.. no improvement... so i'm confused as to where the issue is here. In theory even if my benchmarks above are correct (and the spec is a complete over-estimation) adding the second core should have double the performance (give or take) as there is no locking or sync required and i tried different arrangements with interleaved data between the two cores as well as having totally seperate batches (which seems to be faster - The only reason I can think here is that the interleaving caused issues with cache as both cores are updating data in the same cache-line).

MichaelW

I don't really know what to make of your results, but I do have some questions. Are you restricting the test code to a particular core or cores? Is there a way to effectively disable HT for the test, to eliminate the overhead of two logical cores sharing the same execution core? And how does code that uses REP MOVSD/STOSD instead of SSE compare?
eschew obfuscation

johnsa

The code below (memory profiling) i'm not running any threads, just single-core as it's going to saturate anyway.

In terms of rep movsd/stosd they come in about 15% slower than the posted version (which i took through a few iterations to get it faster than rep versions as the idea is to get as close to that theoretical limit as possible).

With regards to my batch vertex/matrix transform i don't take HT into account, I have it running on either 2 or 4 threads with core affinities set. I would hope that the scheduler/os would favour real cores over HT ones, if not it could possibly reduce the efficiency of 2 threads.. but then switching to 4 theads should be running on at least 2 real seperate cores and both combinations seem to top-out at way under the memory-bus saturation point i would have expected.

I know I have had this argument before with people, but I have yet to see someone post a REAL example that actually scales with multiple cores. My example should scale between 1-4 cores nearly linearly (based on the lack of locking required and memory utilisation).
At best I constantly see the same results come out from EVERY single piece of test code:
1 core : X ms
2 cores: 80% of X ms
3 cores: 75% of X ms
4 cores and up: still 75% of X ms...
Throwing more cores at problems never seem to yield more than a 20-30% improvement on overall performance even when the memory/bus should allow it.


MichaelW

For my first question, it seems to me that if the system controls what runs where, the test code will not necessarily run on a core by itself or necessarily run on the same core throughout the test.
eschew obfuscation

johnsa

I do force the affinity for the threads, so hopefully they'll run on the same core throughout.. but like most things in windows the documentation says the os can and will override anything it wants if it feels it needs to :)
So given the constraints of running under an OS, I do everything in my power to ensure the threads are created and linked to a specific core and stay on that one.

hutch--

This is just from memory but apparently the gain of a single core yields about 1.8 times processing increase where a hyperthreaded core adds about 30% improvement. It may be worth trying the test piece of a Core2 series processor that does not have the hyperthreading to see if the core increase is anything like linear.

From the reference material I remember hyperthreading is effectively a faster form of task switching on any given core, it is not an independent core.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Tried it on my core2 dual, same sort of relative gains.
On my i7 now i've even tried distributing the tasks across cores 0,2,4,6 instead of 0,1,2,3 to see if that would alter the outcome or even avoid HT cores... no difference.

hutch--

On my i7 running Win7 64 bit you can watch all 8 cores at the same time. Its probably worthy seeing what the core load distribution is like to know if all of them are being used.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Ok... I've tried monitoring the cores from process monitor... they don't seem to be going up above 50% on the graph..

I think before I look at this threading issue we should go back to the memory throughput... can anyone else run those tests to see what their read/write and write-only throughput is?
I would expect to be seeing 12Gb/s read, 12Gb/s write (so a combined 25Gb/s roughly) assuming you're on an i7 with ddr3 and QPI 6.4

hutch--

john,

If you put together a test piece, i will happily try it out on both quads, a Core2 quad and an i7 quad.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

johnsa

Gladly :)

So i've solved my memory throughput issue... I'm a twit.. I'm running OCZ 2000mhz XMP compliant RAM, but my machine was still set to default timings. Enabled XMP/Profile1 and put in the right CAS etc. timings and my memory throughput doubled!




;* 3.2GHz
;* x 2 bits/Hz (double data rate)
;* x 20 (QPI link width) <--- this is the number of "lanes"
;* x (64/80) (data bits/flit bits)
;* x 2 (bidirectional)
;* / 8 (bits/byte)
;* = 25.6 GB/s" or 12.5 dual direction
; OR 19.2 Gb/s for QPI 4.8 at 2.4ghz

include c:\masm32\include\masm32rt.inc

.data?

.data

buffer1 dd 0
buffer2 dd 0
MEM_SIZE equ (1024*1024)*200 ; 200Mb

.code

start:
invoke VirtualAlloc,NULL,MEM_SIZE,MEM_COMMIT,PAGE_EXECUTE_READWRITE
mov buffer1,eax
invoke VirtualAlloc,NULL,MEM_SIZE,MEM_COMMIT,PAGE_EXECUTE_READWRITE
mov buffer2,eax

; 586ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/4
rep movsd

timer_end
    print ustr$(eax)," ms, REP MOVSD",13,10
; ; 1150ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/4
align 4
copy0:
mov eax,[esi]
mov [edi],eax
add esi,4
add edi,4
dec ecx
jnz short copy0

timer_end
   print ustr$(eax)," ms, MOV COPY",13,10

; 904ms = 2Gb.
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/16
align 4
copy1:
mov eax,[esi]
mov ebx,[esi+4]
mov [edi],eax
mov [edi+4],ebx

mov eax,[esi+8]
mov ebx,[esi+12]
mov [edi+8],eax
mov [edi+12],ebx

add esi,16
add edi,16
dec ecx
jnz short copy1

timer_end
   print ustr$(eax)," ms, UNROLLED MOV COPY",13,10

; 517ms = 2Gb. (8Gb/s 4in/4out). (200Mb buffer copied 10 times).
; 341ms = 2Gb. (12Gbs 6in/6out). (2Mb buffer copied 1000 times).
; 328ms = 2Gb. (same). (200kb buffer copied 10000 times).
; 200ms = 2Gb. 19.5Gb/s. (9.75Gb/s in and out) (2Mb buffer copied 1000 times).
; 270ms = 2Gb. 14.8Gb/s  (200Mb copied 10 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov esi,buffer1
mov edi,buffer2
mov ecx,MEM_SIZE/64
align 4
copy2:
movdqa xmm0,[esi]
movdqa xmm1,[esi+16]
movdqa xmm2,[esi+32]
movdqa xmm3,[esi+48]
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add esi,64
add edi,64
dec ecx
jnz short copy2

timer_end
    print ustr$(eax)," ms, SIMD COPY",13,10
   
; 295ms = 2Gb. (6GB/s out only). (200Mb buffer copied 10 times).
; 248ms = 2Gb. (6GB/s out only). (2Mb buffer copied 1000).
; 156ms = 2Gb 12.8Gb/s out only. (2Mb buffer copied 1000).
; 156ms = 2Gb 12.8Gb/s out only. (200Mb buffer copied 10 times).
; 128ms = 2Gb 15.6Gb/s out only. (200Mb buffer copied 10 times).
timer_begin 10, HIGH_PRIORITY_CLASS

mov edi,buffer2
mov ecx,MEM_SIZE/64
pxor xmm0,xmm0
pxor xmm1,xmm1
pxor xmm2,xmm2
pxor xmm3,xmm3
align 4
write0:
movntdq [edi],xmm0
movntdq [edi+16],xmm1
movntdq [edi+32],xmm2
movntdq [edi+48],xmm3
add edi,64
dec ecx
jnz short write0

timer_end
    print ustr$(eax)," ms, SIMD WRITE",13,10
   
invoke VirtualFree,buffer1,MEM_SIZE,MEM_RELEASE
invoke VirtualFree,buffer2,MEM_SIZE,MEM_RELEASE
invoke ExitProcess,0

end start


ml /c /coff test.asm
link /nologo /release /machine:ix86 /subsystem:console test

So in review I'm now getting close to max throughput.. 15.6Gb/s pure write... 15Gb/s read write of large data 200Mb and 19.5Gb/s for read/write when data is smaller (2Mb).

This has doubled the performance of my said other test-piece of vertex/matrix transforms understandably.. now back to investigating the core/multicore issue as it should have even more room to scale before saturating memory.

hutch--

John,

This is the result on my Core2 quad.


911 ms, REP MOVSD
936 ms, MOV COPY
938 ms, UNROLLED MOV COPY
589 ms, SIMD COPY
302 ms, SIMD WRITE
Press any key to continue ...


Just a comment, you will get many more people who will tryit if you build the example. I had to find timers.asm and add the include line to get it to work.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Just had a quick play with the mov copy and nothing makes it faster. The cache pollution seems to be the limiting factor on write.


    .data?
      esp_ dd ?
    .code
    push ebx
    mov esi,buffer1
    mov edi,buffer2
    mov ecx,MEM_SIZE/16
    xor edx, edx
    push ebp
    mov esp_, esp
  align 4
  copy0:
    mov eax, [esi+edx]
    mov ebx, [esi+edx+4]
    mov ebp, [esi+edx+8]
    mov esp, [esi+edx+12]
    mov [edi+edx], eax
    mov [edi+ebx+4], ebx
    mov [edi+ebx+8], ebp
    mov [edi+ebx+12], esp
    add edx, 16
    dec ecx
    jnz short copy0
    mov esp, esp_
    pop ebp
    pop ebx

timer_end
   print ustr$(eax)," ms, MOV COPY",13,10
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Here are the timings on my i7.


434 ms, REP MOVSD
604 ms, MOV COPY
547 ms, UNROLLED MOV COPY
308 ms, SIMD COPY
194 ms, SIMD WRITE
Press any key to continue ...
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

not sure if John is interested in my old P4 results   :P

i am using a p4 prescott w/htt, SSE3 @ 3 GHz

i changed the preamble to get it to assemble...
        include \masm32\include\masm32rt.inc
        .686
        .mmx
        .xmm
        include \masm32\macros\timers.asm


i also added the following code to control the h/t cores (threads - whatever)
start:
        invoke  GetCurrentProcess
        invoke  SetProcessAffinityMask,eax,dwCoreMask
        invoke  Sleep,750


i set dwCoreMask to either 1 to bind to a single core, or 3 to allow both

dwCoreMask = 1
2243 ms, REP MOVSD
1995 ms, MOV COPY
2021 ms, UNROLLED MOV COPY
1198 ms, SIMD COPY
522 ms, SIMD WRITE

2238 ms, REP MOVSD
2027 ms, MOV COPY
2011 ms, UNROLLED MOV COPY
1196 ms, SIMD COPY
523 ms, SIMD WRITE


dwCoreMask = 3
2166 ms, REP MOVSD
1897 ms, MOV COPY
1929 ms, UNROLLED MOV COPY
1197 ms, SIMD COPY
520 ms, SIMD WRITE

2253 ms, REP MOVSD
1996 ms, MOV COPY
1985 ms, UNROLLED MOV COPY
1219 ms, SIMD COPY
521 ms, SIMD WRITE