News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Why does my CPU behave like this??

Started by houyunqing, October 25, 2008, 06:57:00 AM

Previous topic - Next topic

houyunqing

My Processor is Intel CoreTM Duo for Centrino, it's 1.67Ghz
it has 3 ALUs in one core, so i'm doing a test to see its ability of executing instructions in parallel

in below, the macro testmac is the set of instructions to be executed in parallel, and testmac1000 in the middle of the code is another macro containing 1000 testmac

when i execute the following code, giving the process realtime priority, the time it takes to finish to loop varies between 0.203 to 0.219 second (minimum 1.13 cycle/loop)
the strange thing is, when I remove instruction 3, the time varies between 0.187 to 0.204 second (minimum 1.04 cycle/loop)
and when i remove instruction 2 as well, the time varies between 0.172 to 0.188 second (minimum 0.957 cycle/loop)

Why should there be a difference?
the core has 4 decoders, so decoding shouldn't be the factor that's resulting the extra latency right?
and it's capable of retiring up to 4 instructions per cycle, my code only requires it to retire 3 instructions per cycle, so retirement also can't be the problem right?
the maximum size of the loop(with 3 instructions present) is 27K, the minimum is 9K, so it's no larger than the  cache, this one i'm sure is not the problem
so what's the thing that's causing the extra latency??? is it due to some instruction/decoding caching mechanism?

testmac macro
   add eax, 1    ;instruction 1
   add ebx, 1    ;instruction 2
   add edx, 1    ;instruction 3
endm
   mov      ecx, 100000
   mov      eax, 0
   mov      ebx, 0
   mov      edx, 0
align 16
   @@loop:
      testmac1000  ; this is just a macro container 1000 testmac
      testmac1000
      testmac1000
      sub   ecx, 1
   jnz      @@loop

dsouza123

There are many things that affect timings,
the method of timing (what code is involved)
the number of iterations (try a billion).

On a purely instruction level remember in addition to affecting
the independant registers involved, each instruction also affects (side effects)
the shared flags AF CF OF PF SF ZF so the instructions aren't truely
independent so the order becomes important for the eventual correct value
of the flags.

Internally there can be register renaming, issues of fitting in the 16 byte
(32 byte for some newer CPUs) internal CPU instruction cache/fetch buffer ( L0 ?).
Decoding the x86 instructions into micro ops (uops) instructions.

Try recoding using SSE2 that do 4 32 bit integer values in parallel in one instruction.
Try using the paddd SSE2 instruction, flags aren't affected and overflow results
in roll over, max value + 1 becomes 0.

For ALU instructions, what about substituting  inc reg for add reg, 1 and dec reg for sub reg, 1 ?
The carry flag isn't affected.

As for variations, Windows OS has another level of priorities it imposes on applications
so other low priority processes don't completely starve for exectition cycles, so even
with yours at realtime priority sometimes your app gets bumped for some other lower
priority process that temporarily got a priority boost from the OS so it wouldn't starve.

There are other things that affect timings, but you are insightful to look at minimum, maximum and range.
--------------------------------------------
For real insight into the CPU (current generations, dual manufacturer) study this excellent article
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719

houyunqing

Quote from: dsouza123 on October 25, 2008, 12:25:25 PM
There are many things that affect timings,
the method of timing (what code is involved)
the number of iterations (try a billion).

On a purely instruction level remember in addition to affecting
the independant registers involved, each instruction also affects (side effects)
the shared flags AF CF OF PF SF ZF so the instructions aren't truely
independent so the order becomes important for the eventual correct value
of the flags.

Internally there can be register renaming, issues of fitting in the 16 byte
(32 byte for some newer CPUs) internal CPU instruction cache ( L0 ?).
Decoding the instructions into microcode instructions.

Try recoding using SSE2 that do 4 32 bit integer values in parallel in one instruction.
Try using the paddd SSE2 instruction, flags aren't affected and overflow results
in roll over, max value + 1 becomes 0.

For ALU instructions, what about substituting  inc reg for add reg, 1 and dec reg for sub reg, 1 ?
The carry flag isn't affected.

As for variations, Windows OS has another level of priorities it imposes on applications
so other low priority processes don't completely starve for exectition cycles, so even
with yours at realtime priority sometimes your app gets bumped for some other lower
priority process that temporarily got a priority boost from the OS so it wouldn't starve.

There are other things that affect timings, but you are insightful to look at minimum, maximum and range.
--------------------------------------------
For real insight into the CPU (current generations, dual manufacturer) study this excellent article
http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719

hmm...
add overwrites all flags, so it's not dependent on any previous flags writes, right?
Register renaming in my code is impossible. Each time a register is used it's dependent on its previous value
and...haha, did i use inc/dec anywhere?

so for the priority thing, i have two cores and my cpu usage never goes higher than 55%, in this case would my realtime process still be paused? I guess that's unlikely...

when my moods turn better i'll sit down and finish reading that article... i really know nothing about AMD's architectures...

thanks for your help!

dsouza123

Even though add overwrites the flags they still have to have the correct values
in the correct serial order of execution, otherwise if an instruction in between adds was present
(that acts depending on the value of a/multiple flags) the execution would be uncertain.

At the very least the last add in the group and it's flag values would have to be retired last.

The issues are parallelism and dependancies though the registers are independent
the flags are not, and their values for serial executition must remain correct.

The inc and dec were suggestions for alternatives to add and sub to see what the effect
would be by substituting them in, not that you used them.

Try reading   http://www.flounder.com/affinity.htm
it has some additional information about priorities and starvation, see Balance Set Manager.

On the whole my point was that there are so many variables below the surface,
OS priorities other processes, CPU uops, scheduling etc, Memory cache misses, cache line size
that results are dependent on all these factors and they can't be abstracted away.

So that the only thing that works is testing different code variations on a particular OS, CPU, RAM
and checking fastest, slowest and average execution timings.

Changing one small part of the puzzle will affect the outcome of the whole.

By the way Welcome to the Forum !
A thirst for understanding is commendable.

dsouza123

Rereading your code, the size 27KB down to 9KB is a substanial chunk of the L1 instruction cache
the OS has to swap between the various threads so you will get plenty of cache misses, which
affect timings, even with a dual core, multiple threads will contend for space in the caches,
after all even though L1 isn't shared between cores it is shared among threads.

MichaelW

The processors have long since become so complex that it's difficult to isolate a particular behavior and measure it effectively. I currently have only a P3 to test this on, and for a P3, according to Agner Fog's Optimization Manuals, an add reg, immed generates 1 uop that can go to either port 0 or 1, whichever is vacant first.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000

    FOR N, <10,100,1000,10000>

      print "N = "
      print ustr$(N),":",13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
        REPEAT N
          add eax, 1
        ENDM
      counter_end
      print ustr$(eax),13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
        REPEAT N
          add eax, 1
          add ebx, 1
        ENDM
      counter_end
      print ustr$(eax),13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
        REPEAT N
          add eax, 1
          add ebx, 1
          add ecx, 1
        ENDM
      counter_end
      print ustr$(eax),13,10

      counter_begin 1000, HIGH_PRIORITY_CLASS
        REPEAT N
          add eax, 1
          add ebx, 1
          add ecx, 1
          add edx, 1
        ENDM
      counter_end
      print ustr$(eax),13,10,13,10

    ENDM

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Typical results on my P3:

N = 10:
5
5
10
15

N = 100:
95
97
146
206

N = 1000:
1006
1253
1554
2384

N = 10000:
15824
30776
46700
61755


It looks to me like the processor is using both execution units effectively. I think at the higher repeat counts the results are being increasingly dominated by code-size effects (the test instructions are 3 bytes each).
eschew obfuscation

woonsan

Execute the code without an operating system (without a preemptive scheduler).

houyunqing

Quote from: Stephanos on October 26, 2008, 01:32:11 AM
Execute the code without an operating system (without a preemptive scheduler).
ah... if only i knew how to write a programme that runs without OS...oh, maybe i can use DOS to do it?

woonsan

Quote from: houyunqing on October 26, 2008, 02:32:38 AM
Quote from: Stephanos on October 26, 2008, 01:32:11 AM
Execute the code without an operating system (without a preemptive scheduler).
ah... if only i knew how to write a programme that runs without OS...oh, maybe i can use DOS to do it?

Yes, that is a good idea.