The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: hutch-- on May 31, 2008, 07:26:11 AM

Title: PIV timing quirks.
Post by: hutch-- on May 31, 2008, 07:26:11 AM
With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.

I have been working with PIVs since about 2001 and with a lot of practice I have found that they are very sensitive to leading and trailing opcode selection, effectively the speed of a given set of instructions will be influenced by what proceeds it and what comes after it.

The test piece runs an empty proc with a selectable count of NOPS (db90h) as padding AFTER the test code to do what is reasonably normal after a potential stall, fill up the free cyces with other instructions. NOPS were chosen because they are reasonably well behaved in terms of throughput in multiple pipelines. The algos are testd with the same procedure but with the test code preceding the nops and the timings have been adjusted by subtracting the second empty timing from each of the algorithm timings.

With a low count of nops the tes algos are reasonably predictable but as the NOP count increases very substantial timing differences occur and in many cases the timings of some of the algos are faster than the empty test piece even though they do a lot more work than the empty test piece.

The variation in the original set of tests using Michaels timing code on my PIVs in fact has nothing to do with the macro design, it just happens to be a characteristic of instruction scheduling in PIVs and it is a phenomenon I have seen from my earliest 1.5 gig PIV to the latest EMT64 that I built a year or so ago.

To test this out if you are using a PIV, just vary the number of NOPS at the beginning with the equate and you will get some very strange numbers pop up from running the tests.

[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: sinsi on May 31, 2008, 08:05:45 AM
1. Does a Q6600 count as a PIV?
2. If so, do minus times count as strange numbers?
Title: Re: PIV timing quirks.
Post by: hutch-- on May 31, 2008, 09:26:44 AM
sinsi,

It probably qualifies as a Q6600 but if you are getting mixed positive / negative numbers then it probably has similar characteristics to a PIV and this means that any given sequence of instructions is dependent on its leading and trailing mnemonics. The interesting part is that some of the algos that have negative results are actually faster than the empty algo shell that is used to clock the overhead.

The idea came from something that is well known on PIV cores that a drop through jump is badly predicted the first time and often produces a stall and what I was trying to do was determine if the trailing nops would be absorbed by the hole left by the stall. Now it seems in part that the algos most effected by the trailing nops did not leave as big a hole as those that had a conditional jump in them. The result before adding the nops appears to have been masked by the RET after which often produces a stall.
Title: Re: PIV timing quirks.
Post by: Jimg on May 31, 2008, 02:56:54 PM
I'm not quite sure what's going on here, but I certainly get weird results on an AMD.  All over the place, lots of negatives.
Title: Re: PIV timing quirks.
Post by: hutch-- on May 31, 2008, 03:07:58 PM
Jim,

It was just a hunch from a practice I have had for some years of inserting nops in critical code to try and find holes left by stalls. Often you can plug a couple of nops after an instruction and it will not make the algo any slower but add one or two more and the timing drops by a large amount.

With the test piece I am guessing that the CALL / RET code was masking much of the difference in the algos but by adding the trailing nops it seems to have changed the results a lot, even though the nop count for each algo is the same.

If I have it right the algos that are the most effected by the trailing nops are the ones that are fastest by themselves where the ones that absorb the nops have a trailing stall that was being masked by the RET.
Title: Re: PIV timing quirks.
Post by: Neo on June 01, 2008, 05:09:58 AM
If the timing macros use RDTSC and you're running on a multi-core computer, the problem is probably that the timestamp counters of the CPUs are out of sync.  When a thread changes to another core, the timing goes off.  One way to solve the problem is to assign processor affinities to your threads if that's acceptable for your program.  If not, you may need to do lots of funky stuff to determine the differences between the timestamp counters.  I had huge problems trying to benchmark my screensaver on multi-core CPUs until I realised this problem.

Hope this helps!  :bg
Title: Re: PIV timing quirks.
Post by: MichaelW on June 01, 2008, 09:30:36 AM
I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
Title: Re: PIV timing quirks.
Post by: Neo on June 01, 2008, 10:13:36 AM
Quote from: MichaelW on June 01, 2008, 09:30:36 AM
I don't have any way to test this, but I have doubts that the main thread in a normal app would change processors while it was running.
The main thread in a normal app frequently changes processors unless you set its affinity.  Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results.  The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program.  It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs.  I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.

There QueryThreadCycleTime function can be used to get cycle timing for a given thread, i.e. excluding time not spent in the thread, but the main downside of doing that instead of setting the processor affinity is that the system call could take a few thousand clock cycles every time you want to read the cycle time.  Plus, if you set the processor affinity, you'll mostly ensure that the thread stays in its CPU's cache, but there may be issues with interrupts taking up that CPU's time if you've got I/O going.
Title: Re: PIV timing quirks.
Post by: hutch-- on June 01, 2008, 11:36:57 AM
Thanks for the info Neo, it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores. My young brother has a dual core AMD but sad to say he is leaving for Europe tomorrow so I won't get a chance to get him to test anything for a couple of months.
Title: Re: PIV timing quirks.
Post by: NightWare on June 01, 2008, 10:32:20 PM
Quote from: hutch-- on May 31, 2008, 07:26:11 AM
With a recent set of algos and tests with benchmarks using Michael's timing macros I was a little perplexed with the level of variation on the PIVs I use so I tried out a diferent test today to see if I could make sense of whay such a reliable timing method still suffered from a leel of variation that was not being sen on other hardware.
hmm, with p4 intel has introduced some new stuffs, the automatic data prefetcher, a trace cache instead of a code cache, etc... so i suppose they made changes in p4 to obtain the benefit of those new stuff, but theory and practice... (it could explain the heratic results).
Quote from: Neo on June 01, 2008, 10:13:36 AM
The main thread in a normal app frequently changes processors unless you set its affinity. Perhaps it's not in every operating system or under every operating condition, and the thread scheduler is probably entered at least several times before a switch is likely to have occurred, but at least when I was testing my screensaver, the thread was switching CPUs, resulting in some negative and inflated results. The way to check this without even writing any code is to go the Task Manager under the Performance tab (if you're running Windows) and run a CPU-intensive, single-threaded program. It tends to favour one CPU over the other, but the sum of the time will be 100% of one CPU, split between two CPUs. I was testing on a dual-core AMD CPU running Windows Vista 64-bit Business, if that makes a difference.
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
Results of the tests :

Test 1 (default hardware), done in 1041289 cycles
Test 2 (attempt core2 ADP usage), done in 944306 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 12 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles

Test 1 (default hardware), done in 5121560 cycles
Test 2 (attempt core2 ADP usage), done in 5182781 cycles
Test 3 (attempt core2 ADP usage + differents algos), done in 11 cycles
Test 4 (attempt core4 ADP usage), done in 11 cycles


Press ENTER to quit...

for this test, the first serie is with 3mb and the second with 6mb (i have a 4mb l2 cache on my core 2). as you can see, it's possible to outperform intel's hardware system with a clever use of the cache (but not if the cache have to change...)... i haven't tested it on a (single) p4, but theorycally it should also be faster (coz automatic data prefetcher)... if it's also faster on p4, it should also be faster on p3 if you prefetch data in the main program or in the thread...


[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: Neo on June 01, 2008, 10:50:45 PM
Quote from: NightWare on June 01, 2008, 10:32:20 PM
i don't think there is "frequently changes" in the sens you said, i think the processors "share" a work to do (cut a task in 2 part, and each processor do "his" job...), and go to the next work and reproduce the job, etc..., and here the test that make me think that (parallelism with automatic data prefectching benefit of one thread for the other)...
I have absolutely no idea what you're talking about.  I was talking about within a single thread, meaning that at any point in time, it is only running on one CPU.  CPUs have no native concept of threads; it's the operating system that determines where to schedule a given thread.  The operating system (let alone the CPU) doesn't know what my program is trying to do over millisecond time frames, so an operating system trying to improve cache performance by scheduling the thread on different CPUs is infeasible.  Plus, in the vast majority of cases, having a single thread running on a different CPU each time means that cache performance is much worse than the thread running on the same CPU each time.

"frequently" is a relative term, but it was usually switching a few times a second, which is frequent enough to completely destroy some timing results.
Title: Re: PIV timing quirks.
Post by: GregL on June 01, 2008, 11:32:31 PM
Quote... it sounds like the dual AMD 64 is doing something different if its distributing a single thread across both cores.

I am seeing the same behavior with my Pentium D and Vista, the single thread is being distributed across both cores. Attached is a screenshot of perfmon and p4timing (I really had to compress the image to get the file size down). My timings were all over the place too.



[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: hutch-- on June 02, 2008, 12:47:29 AM
Thanks for the graph Greg, it shows that the processing load is being shared across both cores which is an iteresting development in that the threaded model is not necessarily the best representation of how to utilise multicore hardware.
Title: Re: PIV timing quirks.
Post by: GregL on June 02, 2008, 01:48:19 AM
Yeah, it's interesting how each core kind of goes opposite the other.

I attached another screenshot from Process Explorer taken when p4timing was running, it also shows that it is running on both cores.



[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: hutch-- on June 02, 2008, 02:38:15 AM
Similar effect but it does show that cross core scheduling is occurring within a single thread and this is probably being done at a hardware level. Won't these be fun with 32 cores ?  :bg
Title: Re: PIV timing quirks.
Post by: hutch-- on June 02, 2008, 04:06:01 AM
Greg,

I just had a piece of genius, tweak the source to set the nop count to zero if it will work or 1 if it won't then run it and see if the ratio from one core to another changes. The nop padding does not address memory at all and it may just get shoved off to another core as there are no dependencies.
Title: Re: PIV timing quirks.
Post by: GregL on June 02, 2008, 05:20:39 AM
Setting the NOP count to 0 did change the ratio from one core to another. I attached the screen shot.


----------------------
empty procedure timing
----------------------
3156 empty procedure
-------------
positive pass
-------------
609 abs0 herge
656 abs1 evlncrn8
906 abs2 jimg 1
657 abs3 rockoon 1
657 abs4 rockoon 2
657 abs5 drizz
844 abs6 jj2007
703 abs7 hutch 1
703 abs8 hutch 2
797 abs9 Nightware 1
875 abs10 Nightware 2
656 abs11 jimg 2
-------------
negative pass
-------------
672 abs0 herge
641 abs1 evlncrn8
657 abs2 jimg 1
640 abs3 rockoon 1
656 abs4 rockoon 2
657 abs5 drizz
657 abs6 jj2007
672 abs7 hutch 1
657 abs8 hutch 2
688 abs9 Nightware 1
657 abs10 Nightware 2
641 abs11 jimg 2




[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: hutch-- on June 02, 2008, 08:06:47 AM
Thanks for testing the idea, it means the cores are sharing the single thread load with no tricks.
Title: Re: PIV timing quirks.
Post by: MichaelW on June 04, 2008, 11:39:56 PM
I finally got an opportunity to run some related tests on a Core 2 Duo, under Windows XP SP3. In the code below the timed routine essentially uses a simulation of the birthday problem to determine the probabilities over a limited range. It runs in about 100s on my 500MHz P3, versus about 15s on the Core 2 Duo. IIRC the processor is 2.2GHz / 1066MHz / 4MB, and a modified version of my CPUID code returns:

GenuineIntel  Family 6  Model 15  Stepping 13
Intel Brand String: <ignored, did not have time to update the code for this>
Features: FPU TSC CX8 CMPXCHG16B CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Typical results:

systemAffinityMask: 00000003
processAffinityMask: 00000003

5       2.7
6       4.0
7       5.6
...

15625ms

processAffinityMask: 00000001

...

15625ms

processAffinityMask: 00000002

...

15593ms


The time for a process affinity mask of 10b was consistently lower than the other two. I did not have time to capture the traces for the CPU Usage History. For all of the tests, the CPU Usage in the left panel showed close to 50%. For the first test the traces showed the load divided between the cores roughly 85/15, and for the second and third tests 100/0 and 0/100.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      probability         REAL8 0.0
      bdays               dd    24 dup(0)
      hProcess            dd    0
      processAffinityMask dd    0
      systemAffinityMask  dd    0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

calc:

    mov ebx, 5
    .WHILE ebx < 24
        xor ebp, ebp
        mov esi, 1000000
        .WHILE esi
            xor edi, edi
            .WHILE edi < ebx
                invoke nrandom, 365
                inc eax
                mov [bdays+edi*4], eax
                inc edi
            .ENDW
            xor ecx, ecx
            .WHILE ecx < ebx
                xor edx, edx
                .WHILE edx < ebx
                    .IF ecx != edx
                        mov eax, [bdays+ecx*4]
                        .IF [bdays+edx*4] == eax
                            inc ebp
                            jmp @F
                        .ENDIF
                    .ENDIF
                    inc edx
                .ENDW
                inc ecx
            .ENDW
          @@:
            dec esi
        .ENDW
        push ebp
        fild DWORD PTR [esp]
        push 10000
        fild DWORD PTR [esp]
        add esp, 8
        fdiv
        fstp probability
        invoke crt_printf, chr$("%d%c%3.1f%c"), ebx, 9, probability, 10
        inc ebx
    .ENDW
    ret

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke GetCurrentProcess
    mov hProcess, eax

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "systemAffinityMask: "
    print uhex$(systemAffinityMask),13,10
    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke SetPriorityClass, hProcess, HIGH_PRIORITY_CLASS

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetProcessAffinityMask, hProcess, 1

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetProcessAffinityMask, hProcess, 2

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetPriorityClass, hProcess, NORMAL_PRIORITY_CLASS

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Title: Re: PIV timing quirks.
Post by: Neo on June 05, 2008, 12:09:01 AM
Cool results.  It looks like the timing discrepancy could be due to I/O interrupts (including timers) all being handled on core 0 and the thread scheduler preferring core 0 (which may be because of the order in which hardware gives locked access to memory).  On core 1, the lower time would be due to not having to handle interrupts or switch cores.
Title: Re: PIV timing quirks.
Post by: MichaelW on June 05, 2008, 12:57:28 AM
QuoteOn core 1, the lower time would be due to not having to handle interrupts or switch cores.

That seems to me to be the most likely explanation. The main thing I was trying to determine was if splitting a thread across two cores would slow it down. It would not seem to, but I'm not convinced that the thread is being run on both cores. I originally intended to have each of the 1000000 loops determine which core it was running on and increment a count, that would be displayed after the loop terminated, but I ran out of time to implement this or even determine if it would be reasonably possible.
Title: Re: PIV timing quirks.
Post by: GregL on June 05, 2008, 08:01:53 PM
MichaelW,

I ran your program on my Pentium D 940 system. I attached a screen shot of Process Explorer - System Information. With the processAffinityMask set at 3 it was pretty much evenly divided between both cores. With the processAffinityMask set to 1 it was almost entirely running on core 0, but there was a very small portion running on core 1. With the processAffinityMask set to 2 it was almost entirely running on core 1, but there was a very small portion running on core 0. The times were pretty close on all three.


systemAffinityMask: 00000003
processAffinityMask: 00000003

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21437ms

processAffinityMask: 00000001

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21375ms

processAffinityMask: 00000002

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21391ms






[attachment deleted by admin]
Title: Re: PIV timing quirks.
Post by: MichaelW on June 05, 2008, 10:35:42 PM
Thanks for testing. So it looks like Vista and XP distribute the work differently, and the choice of affinity mask has no significant effect on the performance of a single thread. After several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.
Title: Re: PIV timing quirks.
Post by: GregL on June 05, 2008, 11:12:06 PM
QuoteAfter several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.

Programmatically , I don't know either. In Process Explorer, in the System Information window that shows the CPU usage graphs for both cores, you can hover the mouse cursor over the graph and it tells you which process that portion of the graph is representing. It doesn't tell you anything about which thread though.