News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

PIV timing quirks.

Started by hutch--, May 31, 2008, 07:26:11 AM

Previous topic - Next topic

hutch--

With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.

I have been working with PIVs since about 2001 and with a lot of practice I have found that they are very sensitive to leading and trailing opcode selection, effectively the speed of a given set of instructions will be influenced by what proceeds it and what comes after it.

The test piece runs an empty proc with a selectable count of NOPS (db90h) as padding AFTER the test code to do what is reasonably normal after a potential stall, fill up the free cyces with other instructions. NOPS were chosen because they are reasonably well behaved in terms of throughput in multiple pipelines. The algos are testd with the same procedure but with the test code preceding the nops and the timings have been adjusted by subtracting the second empty timing from each of the algorithm timings.

With a low count of nops the tes algos are reasonably predictable but as the NOP count increases very substantial timing differences occur and in many cases the timings of some of the algos are faster than the empty test piece even though they do a lot more work than the empty test piece.

The variation in the original set of tests using Michaels timing code on my PIVs in fact has nothing to do with the macro design, it just happens to be a characteristic of instruction scheduling in PIVs and it is a phenomenon I have seen from my earliest 1.5 gig PIV to the latest EMT64 that I built a year or so ago.

To test this out if you are using a PIV, just vary the number of NOPS at the beginning with the equate and you will get some very strange numbers pop up from running the tests.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

sinsi

1. Does a Q6600 count as a PIV?
2. If so, do minus times count as strange numbers?
Light travels faster than sound, that's why some people seem bright until you hear them.

hutch--

sinsi,

It probably qualifies as a Q6600 but if you are getting mixed positive / negative numbers then it probably has similar characteristics to a PIV and this means that any given sequence of instructions is dependent on its leading and trailing mnemonics. The interesting part is that some of the algos that have negative results are actually faster than the empty algo shell that is used to clock the overhead.

The idea came from something that is well known on PIV cores that a drop through jump is badly predicted the first time and often produces a stall and what I was trying to do was determine if the trailing nops would be absorbed by the hole left by the stall. Now it seems in part that the algos most effected by the trailing nops did not leave as big a hole as those that had a conditional jump in them. The result before adding the nops appears to have been masked by the RET after which often produces a stall.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

I'm not quite sure what's going on here, but I certainly get weird results on an AMD.  All over the place, lots of negatives.

hutch--

Jim,

It was just a hunch from a practice I have had for some years of inserting nops in critical code to try and find holes left by stalls. Often you can plug a couple of nops after an instruction and it will not make the algo any slower but add one or two more and the timing drops by a large amount.

With the test piece I am guessing that the CALL / RET code was masking much of the difference in the algos but by adding the trailing nops it seems to have changed the results a lot, even though the nop count for each algo is the same.

If I have it right the algos that are the most effected by the trailing nops are the ones that are fastest by themselves where the ones that absorb the nops have a trailing stall that was being masked by the RET.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Neo

If the timing macros use RDTSC and you're running on a multi-core computer, the problem is probably that the timestamp counters of the CPUs are out of sync.  When a thread changes to another core, the timing goes off.  One way to solve the problem is to assign processor affinities to your threads if that's acceptable for your program.  If not, you may need to do lots of funky stuff to determine the differences between the timestamp counters.  I had huge problems trying to benchmark my screensaver on multi-core CPUs until I realised this problem.

Hope this helps!  :bg

MichaelW

I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
eschew obfuscation

Neo

Quote from: MichaelW on June 01, 2008, 09:30:36 AM
I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
The main thread in a normal app frequently changes processors unless you set its affinity.  Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results.  The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program.  It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs.  I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.

There QueryThreadCycleTime function can be used to get cycle timing for a given thread, i.e. excluding time not spent in the thread, but the main downside of doing that instead of setting the processor affinity is that the system call could take a few thousand clock cycles every time you want to read the cycle time.  Plus, if you set the processor affinity, you'll mostly ensure that the thread stays in its CPU's cache, but there may be issues with interrupts taking up that CPU's time if you've got I/O going.

hutch--

Thanks for the info Neo, it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores. My young brother has a dual core AMD but sad to say he is leaving for Europe tomorrow so I won't get a chance to get him to test anything for a couple of months.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NightWare

Quote from: hutch-- on May 31, 2008, 07:26:11 AM
With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.
hmm, with p4 intel has introduced some new stuffs, the automatic data prefetcher, a trace cache instead of a code cache, etc... so i suppose they made changes in p4 to obtain the benefit of those new stuff, but theory and practice... (it could explain the heratic results).
Quote from: Neo on June 01, 2008, 10:13:36 AM
The main thread in a normal app frequently changes processors unless you set its affinity. Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results. The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program. It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs. I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
Results of the tests :

Test 1 (default hardware), done in 1041289 cycles
Test 2 (attempt core2 ADP usage), done in 944306 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 12 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles

Test 1 (default hardware), done in 5121560 cycles
Test 2 (attempt core2 ADP usage), done in 5182781 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 11 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles


Press ENTER to quit...

for this test, the first serie is with 3mb and the second with 6mb (i have a 4mb l2 cache on my core 2). as you can see, it's possible to outperform intel's hardware system with a clever use of the cache (but not if the cache have to change...)... i haven't tested it on a (single) p4, but theorycally it should also be faster (coz automatic data prefetcher)... if it's also faster on p4, it should also be faster on p3 if you prefetch data in the main program or in the thread...


[attachment deleted by admin]

Neo

Quote from: NightWare on June 01, 2008, 10:32:20 PM
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
I have absolutely no idea what you're talking about.  I was talking about within a single thread, meaning that at any point in time, it is only running on one CPU.  CPUs have no native concept of threads; it's the operating system that determines where to schedule a given thread.  The operating system (let alone the CPU) doesn't know what my program is trying to do over millisecond time frames, so an operating system trying to improve cache performance by scheduling the thread on different CPUs is infeasible.  Plus, in the vast majority of cases, having a single thread running on a different CPU each time means that cache performance is much worse than the thread running on the same CPU each time.

"frequently" is a relative term, but it was usually switching a few times a second, which is frequent enough to completely destroy some timing results.

GregL

Quote... it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores.

I am seeing the same behavior with my Pentium D and Vista, the single thread is being distributed across both cores. Attached is a screenshot of perfmon and p4timing (I really had to compress the image to get the file size down). My timings were all over the place too.



[attachment deleted by admin]

hutch--

Thanks for the graph Greg, it shows that the processing load is being shared across both cores which is an iteresting development in that the threaded model is not necessarily the best representation of how to utilise multicore hardware.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

GregL

Yeah, it's interesting how each core kind of goes opposite the other.

I attached another screenshot from Process Explorer taken when p4timing was running, it also shows that it is running on both cores.



[attachment deleted by admin]

hutch--

Similar effect but it does show that cross core scheduling is occurring within a single thread and this is probably being done at a hardware level. Won't these be fun with 32 cores ?  :bg
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php