Two Instructions in One Atomic Operation

MichaelW · July 12, 2009, 09:06:50 PM

Quote from: Neo on July 12, 2009, 07:15:59 PM
Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.

You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished. Without serialization you can have instructions overlapping the TSC read, potentially introducing a relatively large error per iteration, depending on the overlapping instructions.

Within my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.

Neo · July 12, 2009, 11:40:58 PM

Quote from: MichaelW on July 12, 2009, 09:06:50 PM
You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished.

Serialization has its place, but likewise, you seem to be missing the point of not serializing. Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction. Maybe you want to know the latency, so you can use a macro to serialize and call my non-serialized thread timing function, but for much of what I'm interested in, the throughput is what matters, specifically because in the code before putting in the timing, there were many operations running at once. Serializing artificially makes it look like the code would take much longer than it would otherwise, unless the code is full of big dependency chains.

QuoteWithin my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.

The big things that throw off cycle times are:

The 1kHz tick, if present; it takes about 50,000 clocks
The thread scheduler, occurring about once every 25ms (or maybe it was 40ms); it takes about 200,000 clocks (or more)
Other interrupt handlers, occuring sporadiacally; they take between 200,000 and 2,000,000 clocks
Other threads, which will run for at least 25ms (or 40ms), which works out to be at least 40,000,000 clocks on this laptop

For short timing runs, these are all way off the scale, so I basically just eliminate points that are sufficiently far from the curve of best fit (it's a bit more complicated than that, but in a nutshell). That means that I can get reliable results by measuring maybe 100 times, and just throwing away maybe 10% of the data. It's even very consistent when another program is maxing out the CPU; although the caching is much worse in that case, so it's about 1.5 times longer. Here's 3 runs each of 2 Mersenne Twister implementations while BOINC was using roughly 100% CPU:

Also, I'm not implementing this particular function for use in Windows, since it depends on the thread scheduler. This is to get thread cycle timings in PwnOS so that those bad data don't appear in the first place. It'll only count time actually spent in that thread, like QueryThreadCycleTime, only much faster and more stable, so the results should be even more reliable. The idea really is to help make it easier and faster for everyone to time more reliably, by giving an alternative approach.

I'll be sure to post data when I get the thread scheduler and other stuff up and running in PwnOS, but it won't be very comparable with timings done in Windows (in terms of testing the timing function), since there'll be much less overhead.

Sorry if I've sounded rude; I get frustrated when I keep not being able to explain myself clearly. :(

Neo · July 13, 2009, 12:03:03 AM

Quote from: dedndave on July 12, 2009, 08:04:04 PM
well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with

Thanks for your understanding. I'll try not to disappoint when I finally get PwnOS up an running, and I hope that the data actually work out in the end. I think what I might do for comparison is to time individual iterations of a loop and then the whole loop:

with interrupts disabled using raw rdtsc times with and without serialization, then
with interrupts enabled using this approach with and without serialization.

I suspect that the sum of timing each iteration of a loop without serialization should be closer to the time of the whole loop in one shot than the sum of timing each iteration with serialization. However, I could be wrong. I think we can all agree that the sum of times with serialization will be larger than the time for the whole loop. The question is whether the times without serialization are smaller than the time for the whole loop, and if so, is it a larger magnitude of error.

Doing the test with and without interrupts will ensure that the timer is getting accurate thread cycle times instead of total cycle times.

Sorry again about such a silly argument when I just haven't been explaining myself clearly. :red

dedndave · July 13, 2009, 12:09:44 AM

well - the applications are different
unless i misunderstand, your code is always running
in our application, we run a quick console mode app to get some timing values - and that's it
we don't care if it takes 10,000 clocks in overhead (well, we do, really - lol)
if it takes 10,000 +/-1 clocks every time, that would be ok
+/-1 would be great resolution, to us

even so - using IRET makes serialization "inexpensive" time-wise

PUSHFD
CALL FAR PTR LabelA
(rdtsc, etc)

LabelA PROC FAR
IRET
LabelA ENDP

i really like that - no more hassling with cpuid
don't have to "warm up" IRET, either - lol

MichaelW · July 13, 2009, 01:22:47 AM

Quote from: Neo on July 12, 2009, 11:40:58 PM
Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction.

Timing a single instruction is not workable by any method that I know of because you cannot reliably isolate a single instruction. For consistency when serializing with CPUID you must control the function number in EAX, and for the CPUID that follows the timed instruction the instruction that sets the function number may or may not execute in parallel with the timed instruction. Without serializing, a similar problem exists for the instructions on both sides of the timed instruction. So depending on whether or not you serialize, the result may represent the latency or throughput of the timed instruction, or it may represent the latency or throughput of the timed instruction plus the latency or throughput of one or both adjacent instructions.

MichaelW · July 13, 2009, 02:10:20 AM

I had my doubts, and it took a total of several hours to fumble my way through this, and the code does not fully simulate an INT n so it likely would not work for a real handler, but within this limited test it appears to work.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      fw FWORD 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    pushfd
    mov DWORD PTR fw, handler
    mov WORD PTR fw+4, cs
    call FWORD PTR fw

    inkey "Press any key to exit..."
    exit

  handler:
    pushad
    print "in handler",13,10
    popad
    iretd
        
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

The remaining problem is that I still can't see any reasonable way to serialize with an IRET.

dedndave · July 13, 2009, 02:14:18 AM

tell me what instruction executes just prior to RDTSC... (hint - it isn't CALL)

PUSHFD
CALL FAR PTR LabelA
RDTSC
.
.
.

LabelA PROC FAR
IRET
LabelA ENDP

MichaelW · July 13, 2009, 02:33:55 AM

Obviously the IRETD, but where using CPUID for the second serialization the timed instructions could not be isolated from a fast:

xor eax, eax

Using IRETD for the second serialization the timed instructions could not be isolated from a much slower:

pushfd
call FWORD PTR

To me the main incentive for using a different serializing instruction is to completely isolate the timed instructions from the timing instructions, and to do that the serializing instruction must be freestanding.

dedndave · July 13, 2009, 02:41:24 AM

any place you are inserting PUSHFD/CALL(IRET), is a place where we are currently using an XOR EAX,EAX/CPUID
besides the fact that PUSHFD/CALL(IRET) is always faster than XOR EAX,EAX/CPUID,
there is an added advantage of not having to preserve registers (EBX, etc, as well as the RDTSC values in EDX:EAX)
another advantage: all processors support PUSHFD/CALL(IRET) - some do not support CPUID (we don't have to make that test, either)
one last advantage: PUSHFD/CALL(IRET) is always the same number of clock cycles - CPUID varies, even after warm-up
this is a win-win all the way around

dedndave · July 13, 2009, 02:54:46 AM

i was thinking of having an empty handler, but you could place the code to be timed in the handler
for the cases where we want to serialize without running the code - use the IRET at the end of the handler to make an empty one

;------------------------------------------------------------

timed PROC FAR

;code to be timed goes here

;----------------------------------------

empty PROC FAR

IRET

empty ENDP

;----------------------------------------

timed ENDP

;------------------------------------------------------------

;to get the overhead and start times:

PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX
PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX

;now, to run the timed code:

PUSHFD
CALL FAR PTR timed
RDTSC
;edx:eax now contain the end time
POP EAX
POP EDX
;edx:eax now contain the start time
POP EAX
POP EDX
;edx:eax now contain the overhead reference time

of course, you have to play with the registers and subtract at the end, but you get the idea (that is all after the last RDTSC)
gotta love it ! :U
as an old friend of mine used to say, "That's slicker than hot snot on a glass doorknob!"
of course, he always said that after looking at one of my circuit designs - lol

dedndave · July 13, 2009, 04:24:12 AM

well - it looked great on paper - lol
calling a far routine isn't as easy as it used to be
i will figure it out and mod the code a bit
i never give up - lol
and
if it doesn't fit, force it; if it breaks, it needed replacement, anyways

MichaelW · July 13, 2009, 05:49:19 AM

In RM the results may be different, but in PM, and running on a P3, xor eax, eax, cpuid is significantly faster.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      fw FWORD 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  handler:

    iretd

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    mov DWORD PTR fw, handler
    mov WORD PTR fw+4, cs

    invoke Sleep, 4000

    REPEAT 10
      counter_begin 1000, HIGH_PRIORITY_CLASS
        xor eax, eax
        cpuid
      counter_end
      print ustr$(eax)," cycles, xor eax,eax | cpuid",13,10
    ENDM

    REPEAT 10
      counter_begin 1000, HIGH_PRIORITY_CLASS
        pushfd
        call FWORD PTR fw
      counter_end
      print ustr$(eax)," cycles, pushfd | call FWORD PTR | iretd",13,10
    ENDM

    inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd

The CPUID execution time does vary with the function, but I have never noticed it varying significantly for a given function.

And I'm not convinced that "warming up" the CPUID instruction serves any useful purpose. In the Intel application note PDF where I saw this being done the programmer had failed to control the CPUID function, an obvious error with an effect likely much larger than the lack of a "warm up".

The problem I had trying to make my version work was that I had forgotten about IRETD. No matter how I arranged the stack the IRET would fault (c0000005 (access violation)). And then I noticed the 66h operand-size prefix on the instruction, and knew what the problem was.

dedndave · July 13, 2009, 06:22:28 AM

MichaelW · July 13, 2009, 07:34:02 AM

I think the P4 TSC runs at the internal clock speed, but updates in step with the external clock. The IPC for the P4 is lower than that for the P2/P3 and I think the post-P4 processors, but for at least most of the instructions the difference is nothing like 5:1. I can't recall checking the cycle counts for CPUID on a P4, so for all I know those counts are normal.

dedndave · July 13, 2009, 11:10:31 AM

well, the machine is quite fast - i am very happy with it's performance
it takes roughly 3 seconds to run through an empty loop with ecx set to 0 (1.8e+19 iterations using LOOP)
but the clock cycle counts are always very high compared to all others in the fourm
the 5:1 ratio would make sense, as it has a clock multiplier of x15 (bus clock 200 MHz - cpu clock 3 GHz)
i find one place where it mentions different "rate of tick" between CPU's, but it doesn't say much more about it
http://en.wikipedia.org/wiki/Time_Stamp_Counter

News:

Two Instructions in One Atomic Operation