With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.
I have been working with PIVs since about 2001 and with a lot of practice I have found that they are very sensitive to leading and trailing opcode selection, effectively the speed of a given set of instructions will be influenced by what proceeds it and what comes after it.
The test piece runs an empty proc with a selectable count of NOPS (db90h) as padding AFTER the test code to do what is reasonably normal after a potential stall, fill up the free cyces with other instructions. NOPS were chosen because they are reasonably well behaved in terms of throughput in multiple pipelines. The algos are testd with the same procedure but with the test code preceding the nops and the timings have been adjusted by subtracting the second empty timing from each of the algorithm timings.
With a low count of nops the tes algos are reasonably predictable but as the NOP count increases very substantial timing differences occur and in many cases the timings of some of the algos are faster than the empty test piece even though they do a lot more work than the empty test piece.
The variation in the original set of tests using Michaels timing code on my PIVs in fact has nothing to do with the macro design, it just happens to be a characteristic of instruction scheduling in PIVs and it is a phenomenon I have seen from my earliest 1.5 gig PIV to the latest EMT64 that I built a year or so ago.
To test this out if you are using a PIV, just vary the number of NOPS at the beginning with the equate and you will get some very strange numbers pop up from running the tests.
[attachment deleted by admin]
1. Does a Q6600 count as a PIV?
2. If so, do minus times count as strange numbers?
sinsi,
It probably qualifies as a Q6600 but if you are getting mixed positive / negative numbers then it probably has similar characteristics to a PIV and this means that any given sequence of instructions is dependent on its leading and trailing mnemonics. The interesting part is that some of the algos that have negative results are actually faster than the empty algo shell that is used to clock the overhead.
The idea came from something that is well known on PIV cores that a drop through jump is badly predicted the first time and often produces a stall and what I was trying to do was determine if the trailing nops would be absorbed by the hole left by the stall. Now it seems in part that the algos most effected by the trailing nops did not leave as big a hole as those that had a conditional jump in them. The result before adding the nops appears to have been masked by the RET after which often produces a stall.
I'm not quite sure what's going on here, but I certainly get weird results on an AMD. All over the place, lots of negatives.
Jim,
It was just a hunch from a practice I have had for some years of inserting nops in critical code to try and find holes left by stalls. Often you can plug a couple of nops after an instruction and it will not make the algo any slower but add one or two more and the timing drops by a large amount.
With the test piece I am guessing that the CALL / RET code was masking much of the difference in the algos but by adding the trailing nops it seems to have changed the results a lot, even though the nop count for each algo is the same.
If I have it right the algos that are the most effected by the trailing nops are the ones that are fastest by themselves where the ones that absorb the nops have a trailing stall that was being masked by the RET.
If the timing macros use RDTSC and you're running on a multi-core computer, the problem is probably that the timestamp counters of the CPUs are out of sync. When a thread changes to another core, the timing goes off. One way to solve the problem is to assign processor affinities to your threads if that's acceptable for your program. If not, you may need to do lots of funky stuff to determine the differences between the timestamp counters. I had huge problems trying to benchmark my screensaver on multi-core CPUs until I realised this problem.
Hope this helps! :bg
I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
Quote from: MichaelW on June 01, 2008, 09:30:36 AM
I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
The main thread in a normal app frequently changes processors unless you set its affinity. Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results. The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program. It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs. I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.
There QueryThreadCycleTime function can be used to get cycle timing for a given thread, i.e. excluding time not spent in the thread, but the main downside of doing that instead of setting the processor affinity is that the system call could take a few thousand clock cycles every time you want to read the cycle time. Plus, if you set the processor affinity, you'll mostly ensure that the thread stays in its CPU's cache, but there may be issues with interrupts taking up that CPU's time if you've got I/O going.
Thanks for the info Neo, it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores. My young brother has a dual core AMD but sad to say he is leaving for Europe tomorrow so I won't get a chance to get him to test anything for a couple of months.
Quote from: hutch-- on May 31, 2008, 07:26:11 AM
With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.
hmm, with p4 intel has introduced some new stuffs, the automatic data prefetcher, a trace cache instead of a code cache, etc... so i suppose they made changes in p4 to obtain the benefit of those new stuff, but theory and practice... (it could explain the heratic results).
Quote from: Neo on June 01, 2008, 10:13:36 AM
The main thread in a normal app frequently changes processors unless you set its affinity. Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results. The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program. It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs. I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
Results of the tests :
Test 1 (default hardware), done in 1041289 cycles
Test 2 (attempt core2 ADP usage), done in 944306 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 12 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles
Test 1 (default hardware), done in 5121560 cycles
Test 2 (attempt core2 ADP usage), done in 5182781 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 11 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles
Press ENTER to quit...
for this test, the first serie is with 3mb and the second with 6mb (i have a 4mb l2 cache on my core 2). as you can see, it's possible to outperform intel's hardware system with a clever use of the cache (but not if the cache have to change...)... i haven't tested it on a (single) p4, but theorycally it should also be faster (coz automatic data prefetcher)... if it's also faster on p4, it should also be faster on p3 if you prefetch data in the main program or in the thread...
[attachment deleted by admin]
Quote from: NightWare on June 01, 2008, 10:32:20 PM
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
I have absolutely no idea what you're talking about. I was talking about within a single thread, meaning that at any point in time, it is only running on one CPU. CPUs have no native concept of threads; it's the operating system that determines where to schedule a given thread. The operating system (let alone the CPU) doesn't know what my program is trying to do over millisecond time frames, so an operating system trying to improve cache performance by scheduling the thread on different CPUs is infeasible. Plus, in the vast majority of cases, having a single thread running on a different CPU each time means that cache performance is much worse than the thread running on the same CPU each time.
"frequently" is a relative term, but it was usually switching a few times a second, which is frequent enough to completely destroy some timing results.
Quote... it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores.
I am seeing the same behavior with my Pentium D and Vista, the single thread is being distributed across both cores. Attached is a screenshot of perfmon and p4timing (I really had to compress the image to get the file size down). My timings were all over the place too.
[attachment deleted by admin]
Thanks for the graph Greg, it shows that the processing load is being shared across both cores which is an iteresting development in that the threaded model is not necessarily the best representation of how to utilise multicore hardware.
Yeah, it's interesting how each core kind of goes opposite the other.
I attached another screenshot from Process Explorer taken when p4timing was running, it also shows that it is running on both cores.
[attachment deleted by admin]
Similar effect but it does show that cross core scheduling is occurring within a single thread and this is probably being done at a hardware level. Won't these be fun with 32 cores ? :bg
Greg,
I just had a piece of genius, tweak the source to set the nop count to zero if it will work or 1 if it won't then run it and see if the ratio from one core to another changes. The nop padding does not address memory at all and it may just get shoved off to another core as there are no dependencies.
Setting the NOP count to 0 did change the ratio from one core to another. I attached the screen shot.
----------------------
empty procedure timing
----------------------
3156 empty procedure
-------------
positive pass
-------------
609 abs0 herge
656 abs1 evlncrn8
906 abs2 jimg 1
657 abs3 rockoon 1
657 abs4 rockoon 2
657 abs5 drizz
844 abs6 jj2007
703 abs7 hutch 1
703 abs8 hutch 2
797 abs9 Nightware 1
875 abs10 Nightware 2
656 abs11 jimg 2
-------------
negative pass
-------------
672 abs0 herge
641 abs1 evlncrn8
657 abs2 jimg 1
640 abs3 rockoon 1
656 abs4 rockoon 2
657 abs5 drizz
657 abs6 jj2007
672 abs7 hutch 1
657 abs8 hutch 2
688 abs9 Nightware 1
657 abs10 Nightware 2
641 abs11 jimg 2
[attachment deleted by admin]
Thanks for testing the idea, it means the cores are sharing the single thread load with no tricks.
I finally got an opportunity to run some related tests on a Core 2 Duo, under Windows XP SP3. In the code below the timed routine essentially uses a simulation of the birthday problem to determine the probabilities over a limited range. It runs in about 100s on my 500MHz P3, versus about 15s on the Core 2 Duo. IIRC the processor is 2.2GHz / 1066MHz / 4MB, and a modified version of my CPUID code returns:
GenuineIntel Family 6 Model 15 Stepping 13
Intel Brand String: <ignored, did not have time to update the code for this>
Features: FPU TSC CX8 CMPXCHG16B CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3
Typical results:
systemAffinityMask: 00000003
processAffinityMask: 00000003
5 2.7
6 4.0
7 5.6
...
15625ms
processAffinityMask: 00000001
...
15625ms
processAffinityMask: 00000002
...
15593ms
The time for a process affinity mask of 10b was consistently lower than the other two. I did not have time to capture the traces for the CPU Usage History. For all of the tests, the CPU Usage in the left panel showed close to 50%. For the first test the traces showed the load divided between the cores roughly 85/15, and for the second and third tests 100/0 and 0/100.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
probability REAL8 0.0
bdays dd 24 dup(0)
hProcess dd 0
processAffinityMask dd 0
systemAffinityMask dd 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 16
calc:
mov ebx, 5
.WHILE ebx < 24
xor ebp, ebp
mov esi, 1000000
.WHILE esi
xor edi, edi
.WHILE edi < ebx
invoke nrandom, 365
inc eax
mov [bdays+edi*4], eax
inc edi
.ENDW
xor ecx, ecx
.WHILE ecx < ebx
xor edx, edx
.WHILE edx < ebx
.IF ecx != edx
mov eax, [bdays+ecx*4]
.IF [bdays+edx*4] == eax
inc ebp
jmp @F
.ENDIF
.ENDIF
inc edx
.ENDW
inc ecx
.ENDW
@@:
dec esi
.ENDW
push ebp
fild DWORD PTR [esp]
push 10000
fild DWORD PTR [esp]
add esp, 8
fdiv
fstp probability
invoke crt_printf, chr$("%d%c%3.1f%c"), ebx, 9, probability, 10
inc ebx
.ENDW
ret
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke GetCurrentProcess
mov hProcess, eax
invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
ADDR systemAffinityMask
print "systemAffinityMask: "
print uhex$(systemAffinityMask),13,10
print "processAffinityMask: "
print uhex$(processAffinityMask),13,10,13,10
invoke SetPriorityClass, hProcess, HIGH_PRIORITY_CLASS
invoke nseed, 12345678
invoke GetTickCount
push eax
call calc
print chr$(13,10)
invoke GetTickCount
pop edx
sub eax, edx
print ustr$(eax),"ms",13,10,13,10
invoke SetProcessAffinityMask, hProcess, 1
invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
ADDR systemAffinityMask
print "processAffinityMask: "
print uhex$(processAffinityMask),13,10,13,10
invoke nseed, 12345678
invoke GetTickCount
push eax
call calc
print chr$(13,10)
invoke GetTickCount
pop edx
sub eax, edx
print ustr$(eax),"ms",13,10,13,10
invoke SetProcessAffinityMask, hProcess, 2
invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
ADDR systemAffinityMask
print "processAffinityMask: "
print uhex$(processAffinityMask),13,10,13,10
invoke nseed, 12345678
invoke GetTickCount
push eax
call calc
print chr$(13,10)
invoke GetTickCount
pop edx
sub eax, edx
print ustr$(eax),"ms",13,10,13,10
invoke SetPriorityClass, hProcess, NORMAL_PRIORITY_CLASS
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Cool results. It looks like the timing discrepancy could be due to I/O interrupts (including timers) all being handled on core 0 and the thread scheduler preferring core 0 (which may be because of the order in which hardware gives locked access to memory). On core 1, the lower time would be due to not having to handle interrupts or switch cores.
QuoteOn core 1, the lower time would be due to not having to handle interrupts or switch cores.
That seems to me to be the most likely explanation. The main thing I was trying to determine was if splitting a thread across two cores would slow it down. It would not seem to, but I'm not convinced that the thread is being run on both cores. I originally intended to have each of the 1000000 loops determine which core it was running on and increment a count, that would be displayed after the loop terminated, but I ran out of time to implement this or even determine if it would be reasonably possible.
MichaelW,
I ran your program on my Pentium D 940 system. I attached a screen shot of Process Explorer - System Information. With the processAffinityMask set at 3 it was pretty much evenly divided between both cores. With the processAffinityMask set to 1 it was almost entirely running on core 0, but there was a very small portion running on core 1. With the processAffinityMask set to 2 it was almost entirely running on core 1, but there was a very small portion running on core 0. The times were pretty close on all three.
systemAffinityMask: 00000003
processAffinityMask: 00000003
5 2.7
6 4.0
7 5.6
8 7.4
9 9.5
10 11.7
11 14.1
12 16.8
13 19.5
14 22.3
15 25.3
16 28.4
17 31.5
18 34.7
19 37.9
20 41.1
21 44.3
22 47.5
23 50.7
21437ms
processAffinityMask: 00000001
5 2.7
6 4.0
7 5.6
8 7.4
9 9.5
10 11.7
11 14.1
12 16.8
13 19.5
14 22.3
15 25.3
16 28.4
17 31.5
18 34.7
19 37.9
20 41.1
21 44.3
22 47.5
23 50.7
21375ms
processAffinityMask: 00000002
5 2.7
6 4.0
7 5.6
8 7.4
9 9.5
10 11.7
11 14.1
12 16.8
13 19.5
14 22.3
15 25.3
16 28.4
17 31.5
18 34.7
19 37.9
20 41.1
21 44.3
22 47.5
23 50.7
21391ms
[attachment deleted by admin]
Thanks for testing. So it looks like Vista and XP distribute the work differently, and the choice of affinity mask has no significant effect on the performance of a single thread. After several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.
QuoteAfter several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.
Programmatically , I don't know either. In Process Explorer, in the System Information window that shows the CPU usage graphs for both cores, you can hover the mouse cursor over the graph and it tells you which process that portion of the graph is representing. It doesn't tell you anything about which thread though.