PIV timing quirks.

hutch-- · June 02, 2008, 04:06:01 AM

Greg,

I just had a piece of genius, tweak the source to set the nop count to zero if it will work or 1 if it won't then run it and see if the ratio from one core to another changes. The nop padding does not address memory at all and it may just get shoved off to another core as there are no dependencies.

GregL · June 02, 2008, 05:20:39 AM

Setting the NOP count to 0 did change the ratio from one core to another. I attached the screen shot.

Code Select


----------------------
empty procedure timing
----------------------
3156 empty procedure
-------------
positive pass
-------------
609 abs0 herge
656 abs1 evlncrn8
906 abs2 jimg 1
657 abs3 rockoon 1
657 abs4 rockoon 2
657 abs5 drizz
844 abs6 jj2007
703 abs7 hutch 1
703 abs8 hutch 2
797 abs9 Nightware 1
875 abs10 Nightware 2
656 abs11 jimg 2
-------------
negative pass
-------------
672 abs0 herge
641 abs1 evlncrn8
657 abs2 jimg 1
640 abs3 rockoon 1
656 abs4 rockoon 2
657 abs5 drizz
657 abs6 jj2007
672 abs7 hutch 1
657 abs8 hutch 2
688 abs9 Nightware 1
657 abs10 Nightware 2
641 abs11 jimg 2

[attachment deleted by admin]

hutch-- · June 02, 2008, 08:06:47 AM

Thanks for testing the idea, it means the cores are sharing the single thread load with no tricks.

MichaelW · June 04, 2008, 11:39:56 PM

I finally got an opportunity to run some related tests on a Core 2 Duo, under Windows XP SP3. In the code below the timed routine essentially uses a simulation of the birthday problem to determine the probabilities over a limited range. It runs in about 100s on my 500MHz P3, versus about 15s on the Core 2 Duo. IIRC the processor is 2.2GHz / 1066MHz / 4MB, and a modified version of my CPUID code returns:

GenuineIntel Family 6 Model 15 Stepping 13
Intel Brand String: <ignored, did not have time to update the code for this>
Features: FPU TSC CX8 CMPXCHG16B CMOV CLFSH FXSR HTT MMX SSE SSE2 SSE3

Typical results:

Code Select


systemAffinityMask: 00000003
processAffinityMask: 00000003

5       2.7
6       4.0
7       5.6
...

15625ms

processAffinityMask: 00000001

...

15625ms

processAffinityMask: 00000002

...

15593ms

The time for a process affinity mask of 10b was consistently lower than the other two. I did not have time to capture the traces for the CPU Usage History. For all of the tests, the CPU Usage in the left panel showed close to 50%. For the first test the traces showed the load divided between the cores roughly 85/15, and for the second and third tests 100/0 and 0/100.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      probability         REAL8 0.0
      bdays               dd    24 dup(0)
      hProcess            dd    0
      processAffinityMask dd    0
      systemAffinityMask  dd    0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 16

calc:

    mov ebx, 5
    .WHILE ebx < 24
        xor ebp, ebp
        mov esi, 1000000
        .WHILE esi
            xor edi, edi
            .WHILE edi < ebx
                invoke nrandom, 365
                inc eax
                mov [bdays+edi*4], eax
                inc edi
            .ENDW
            xor ecx, ecx
            .WHILE ecx < ebx
                xor edx, edx
                .WHILE edx < ebx
                    .IF ecx != edx
                        mov eax, [bdays+ecx*4]
                        .IF [bdays+edx*4] == eax
                            inc ebp
                            jmp @F
                        .ENDIF
                    .ENDIF
                    inc edx
                .ENDW
                inc ecx
            .ENDW
          @@:
            dec esi
        .ENDW
        push ebp
        fild DWORD PTR [esp]
        push 10000
        fild DWORD PTR [esp]
        add esp, 8
        fdiv
        fstp probability
        invoke crt_printf, chr$("%d%c%3.1f%c"), ebx, 9, probability, 10
        inc ebx
    .ENDW
    ret

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke GetCurrentProcess
    mov hProcess, eax

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "systemAffinityMask: "
    print uhex$(systemAffinityMask),13,10
    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke SetPriorityClass, hProcess, HIGH_PRIORITY_CLASS

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetProcessAffinityMask, hProcess, 1

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetProcessAffinityMask, hProcess, 2

    invoke GetProcessAffinityMask, hProcess, ADDR processAffinityMask,
                                   ADDR systemAffinityMask

    print "processAffinityMask: "
    print uhex$(processAffinityMask),13,10,13,10

    invoke nseed, 12345678
    invoke GetTickCount
    push eax
    call calc
    print chr$(13,10)
    invoke GetTickCount
    pop edx
    sub eax, edx
    print ustr$(eax),"ms",13,10,13,10

    invoke SetPriorityClass, hProcess, NORMAL_PRIORITY_CLASS

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Neo · June 05, 2008, 12:09:01 AM

Cool results. It looks like the timing discrepancy could be due to I/O interrupts (including timers) all being handled on core 0 and the thread scheduler preferring core 0 (which may be because of the order in which hardware gives locked access to memory). On core 1, the lower time would be due to not having to handle interrupts or switch cores.

MichaelW · June 05, 2008, 12:57:28 AM

QuoteOn core 1, the lower time would be due to not having to handle interrupts or switch cores.

That seems to me to be the most likely explanation. The main thing I was trying to determine was if splitting a thread across two cores would slow it down. It would not seem to, but I'm not convinced that the thread is being run on both cores. I originally intended to have each of the 1000000 loops determine which core it was running on and increment a count, that would be displayed after the loop terminated, but I ran out of time to implement this or even determine if it would be reasonably possible.

GregL · June 05, 2008, 08:01:53 PM

MichaelW,

I ran your program on my Pentium D 940 system. I attached a screen shot of Process Explorer - System Information. With the processAffinityMask set at 3 it was pretty much evenly divided between both cores. With the processAffinityMask set to 1 it was almost entirely running on core 0, but there was a very small portion running on core 1. With the processAffinityMask set to 2 it was almost entirely running on core 1, but there was a very small portion running on core 0. The times were pretty close on all three.

Code Select


systemAffinityMask: 00000003
processAffinityMask: 00000003

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21437ms

processAffinityMask: 00000001

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21375ms

processAffinityMask: 00000002

5       2.7
6       4.0
7       5.6
8       7.4
9       9.5
10      11.7
11      14.1
12      16.8
13      19.5
14      22.3
15      25.3
16      28.4
17      31.5
18      34.7
19      37.9
20      41.1
21      44.3
22      47.5
23      50.7

21391ms

[attachment deleted by admin]

MichaelW · June 05, 2008, 10:35:42 PM

Thanks for testing. So it looks like Vista and XP distribute the work differently, and the choice of affinity mask has no significant effect on the performance of a single thread. After several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.

GregL · June 05, 2008, 11:12:06 PM

QuoteAfter several hours of looking I still cannot find any way for a thread to determine which core it is running on, and without this capability I can see no way of knowing how the work is actually being distributed.

Programmatically , I don't know either. In Process Explorer, in the System Information window that shows the CPU usage graphs for both cores, you can hover the mouse cursor over the graph and it tells you which process that portion of the graph is representing. It doesn't tell you anything about which thread though.

News:

PIV timing quirks.