Print Page - Two Instructions in One Atomic Operation

Title: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 02:19:29 AM

This could be a silly question, and there may be no good answer. I'm trying to devise a way to get reliable thread cycle times in my OS without switching to privilege level 0. I've got it down to that if I can do:

Code Select

movdqa     xmm0,xmmword ptr fs[0].PL3ThreadData.CycleTimeTotal ;CycleTimeStart is in the upper qword
rdtsc

atomically, it'll be a piece of cake to calculate the thread cycle time while taking hardly any clock cycles itself. It's pretty unlikely that the thread scheduler will actually kick in between those two instructions, but I wouldn't want it to give results that cause errors in that case. The count would be larger by however much time until the thread starts running again, so a second call to it may give a smaller number. If I switch the instruction order, I could have a negative number if the time waited is larger than the time executed before that first time slice.

I can't really have a lock on the data, 'cause the thread scheduler updates it, meaning that the app would have to be able to prevent the thread scheduler from running.
Any ideas?

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 02:38:28 AM

not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 02:45:25 AM

Oh wait, I think I've just thought of a simple solution that doesn't need them to be atomic:

Code Select

NotValidTime:
    rdtsc
    movdqa  xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
    shl     rdx,32
    or      rax,rdx
    movq    rcx,xmm0
    sub     rax,rcx
    js      NotValidTime   ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and movdqa

Note that the thread scheduler setting CycleTimeStart and CycleTimeTotal is atomic relative to this code, since I won't allow this thread to run until my thread scheduler has set both values. It's also technically possible to livelock on this loop if someone set the scheduler to kick in every APIC cycle or two (about 5-25 clock cycles each), but that'd be ridiculous. :lol

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 02:52:49 AM

Quote from: dedndave on July 12, 2009, 02:38:28 AM
not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core

To clarify, this isn't for in Windows code, this is for in the API of my own OS, PwnOS. I have control over the thread scheduling and all that jazz, so I can do things that Windows wouldn't let me get away with and give some extra useful operations to the programs involved. :wink

I should put in something to serialize it as you suggest, now that you mention it, but cpuid takes quite a bit of time; I wonder if there's a serializing instruction that doesn't take so long. The fence ones may not work since rdtssc doesn't use memory.

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 12, 2009, 03:35:04 AM

If the thread scheduler is unlikely to interfere, you could minimize the effects by collecting a large number of counts and averaging them.

The CPUID instruction is relatively slow, and the execution time varies with the function number in EAX. The lowest cycle count is for function 0, 79 cycles on my P3.

In the timing macros I use an empty reference loop to get the cycle count for the timing instructions, including the serializing instructions, and then subtract it from the cycle count for the working loop.

Per the Intel System Programming Guide the non-privileged serializing instructions are CPUID, IRET, and RSM, and I can't see any reasonable way to use IRET or RSM.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 03:50:56 AM

hmmmm - IRET has some possibilities
push the flags
push the right "code segment" - whatever that means in protected mode - lol
push the return address (to the rdtsc instruction, of course)
i.e. - does not neccessarily require an INT

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 04:57:23 AM

Well, I don't think I'd spring for IRET, 'cause it seems like the kind of thing that may have unexpected behaviour in PL3 in 64-bit mode, and/or it'd take more time than CPUID.

What if I added a dependency somehow, like:

Code Select

NotValidTime:
    pxor    xmm0,xmm0
    rdtsc
    shl     rdx,32
    or      rax,rdx
    movq    xmm0,rax
    psubq   xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
    movq    rax,xmm0
    test    rax,rax
    js      NotValidTime   ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and psubq
    pshufd  xmm0,xmm0,1110b
    movq    rdx,xmm0
    sub     rax,rdx

Then the psubq has to occur after the rdtsc, and the CycleTimeTotal component in xmm0 will be negative, so it gets subtracted below instead of added.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 10:21:03 AM

well, the idea is to insure that all other instructions are complete prior to executing the rdtsc
with processors that perform out-of-order execution and have multiple cores,
it seems the best approach if you want highly accurate and repeatable readings
i think IRET may be a great soultion - i wasn't aware that it serialized instructions until Michael's post above
it is much faster than cpuid, and is supported on all processors that provide rdtsc
cpuid has so many issues
on some cyrix cpu's, for example, cpuid has to be enabled

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 11:11:07 AM

Well, there are arguments for either side, and thank you for yours, but until I encounter an issue with not serializing, that's what I'll do, and here's my loose reasoning:

RDTSC can take 60 or 70 clocks on this machine, plus the overhead of the other instructions, coming to maybe 80-100 clocks total. The standard deviation of the time to run this function would be larger than the time of a short instruction or two, so serializing still wouldn't make it produce reliable results for the really short times. Maybe it'd be worthwhile to serialize for timing the transcendental instructions, but it's usually clear that they take a while. As such, I'd rather minimize the impact on times by not serializing, to get more realistic results for timing several instructions (since serializing isn't what would've happened in the code without the timing). It'll also allow more frequent timing with less of an impact on the overall time.

In the end, though, the biggest reason is that I could just provide a macro that serializes then calls this, so that people have the option to serialize or not, instead of putting the serialization in the API function and not giving people the option (short of rewriting the API function themselves).

Thanks for the healthy discussion! :U

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 11:19:29 AM

i didn't realize rdtsc took so many clock cycles
at any rate, it should be nearly the same number each time, so it should have no impact on variations
it is a question of what you want for accuracy and repeatability
both Michael and myself are coming at the problem from a different angle than you, perhaps
we have tried to improve our ability to clock algorithms
so, you see, in our app, the size of the overhead isn't an issue
the variations in overhead clock cycles, from one run to the next, is what we try to reduce

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 11:29:52 AM

there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas

http://www.masm32.com/board/index.php?topic=770.0

Title: Re: Two Instructions in One Atomic Operation
Post by: Mark Jones on July 12, 2009, 04:05:12 PM

Curious here, and I am thinking aloud, has anyone considered a hardware solution to code timing? I mean, how much latency would be involved in latching some particular hardware line, like say a parallel port pin, from "Ring 0"? Then build a simple parallel port device which "started counting clocks" (period really) when the pin went low, and stopped when it went high, then latched this value (in "clocks") for reading by the host?

Or a PCI pin, for that matter. "Free power" available using PCI...

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 05:22:22 PM

i think the parallel port gives +5v, to some limited current - cmos circuits today require very little
one problem i foresee is that windows doesn't allow direct manipulation, so a driver is involved, i guess
maybe you can do direct i/o in ring0 - i dunno
another problem is, you would have to provide a counter
the tsc counter is internal to the cpu
my machine runs at 3 Ghz with an X15 multiplier, so i would only be able to acquire the 200 mhz bus clock
counting the bus clock would have a 15 cycle ganularity - and a 200 Mhz counter would require a special circuit
even if i could get the 3 Ghz signal, it would require a fairly expensive microwave frequency counter

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 07:15:59 PM

Quote from: dedndave on July 12, 2009, 11:29:52 AM
there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas

http://www.masm32.com/board/index.php?topic=770.0

That kind of massive-iteration timing is exactly what I'm trying to avoid, and now able to avoid. With my performance viewer (http://www.masm32.com/board/index.php?topic=11804.0), I can get accurate, repeatable timing results on very small timings, because it's able to clean up the noisy data. You can even see the distribution of times if you run something multiple times instead of just getting one number.

Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 12, 2009, 08:04:04 PM

well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 12, 2009, 09:06:50 PM

Quote from: Neo on July 12, 2009, 07:15:59 PM
Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.

You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished. Without serialization you can have instructions overlapping the TSC read, potentially introducing a relatively large error per iteration, depending on the overlapping instructions.

Within my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 12, 2009, 11:40:58 PM

Quote from: MichaelW on July 12, 2009, 09:06:50 PM
You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished.

Serialization has its place, but likewise, you seem to be missing the point of not serializing. Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction. Maybe you want to know the latency, so you can use a macro to serialize and call my non-serialized thread timing function, but for much of what I'm interested in, the throughput is what matters, specifically because in the code before putting in the timing, there were many operations running at once. Serializing artificially makes it look like the code would take much longer than it would otherwise, unless the code is full of big dependency chains.

QuoteWithin my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.

The big things that throw off cycle times are:

The 1kHz tick, if present; it takes about 50,000 clocks
The thread scheduler, occurring about once every 25ms (or maybe it was 40ms); it takes about 200,000 clocks (or more)
Other interrupt handlers, occuring sporadiacally; they take between 200,000 and 2,000,000 clocks
Other threads, which will run for at least 25ms (or 40ms), which works out to be at least 40,000,000 clocks on this laptop

For short timing runs, these are all way off the scale, so I basically just eliminate points that are sufficiently far from the curve of best fit (it's a bit more complicated than that, but in a nutshell). That means that I can get reliable results by measuring maybe 100 times, and just throwing away maybe 10% of the data. It's even very consistent when another program is maxing out the CPU; although the caching is much worse in that case, so it's about 1.5 times longer. Here's 3 runs each of 2 Mersenne Twister implementations while BOINC was using roughly 100% CPU:
(http://ndickson.files.wordpress.com/2009/07/mtheavyload.png)

Also, I'm not implementing this particular function for use in Windows, since it depends on the thread scheduler. This is to get thread cycle timings in PwnOS so that those bad data don't appear in the first place. It'll only count time actually spent in that thread, like QueryThreadCycleTime, only much faster and more stable, so the results should be even more reliable. The idea really is to help make it easier and faster for everyone to time more reliably, by giving an alternative approach.

I'll be sure to post data when I get the thread scheduler and other stuff up and running in PwnOS, but it won't be very comparable with timings done in Windows (in terms of testing the timing function), since there'll be much less overhead.

Sorry if I've sounded rude; I get frustrated when I keep not being able to explain myself clearly. :(

Title: Re: Two Instructions in One Atomic Operation
Post by: Neo on July 13, 2009, 12:03:03 AM

Quote from: dedndave on July 12, 2009, 08:04:04 PM
well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with

Thanks for your understanding. I'll try not to disappoint when I finally get PwnOS up an running, and I hope that the data actually work out in the end. I think what I might do for comparison is to time individual iterations of a loop and then the whole loop:

with interrupts disabled using raw rdtsc times with and without serialization, then
with interrupts enabled using this approach with and without serialization.

I suspect that the sum of timing each iteration of a loop without serialization should be closer to the time of the whole loop in one shot than the sum of timing each iteration with serialization. However, I could be wrong. I think we can all agree that the sum of times with serialization will be larger than the time for the whole loop. The question is whether the times without serialization are smaller than the time for the whole loop, and if so, is it a larger magnitude of error.

Doing the test with and without interrupts will ensure that the timer is getting accurate thread cycle times instead of total cycle times.

Sorry again about such a silly argument when I just haven't been explaining myself clearly. :red

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 12:09:44 AM

well - the applications are different
unless i misunderstand, your code is always running
in our application, we run a quick console mode app to get some timing values - and that's it
we don't care if it takes 10,000 clocks in overhead (well, we do, really - lol)
if it takes 10,000 +/-1 clocks every time, that would be ok
+/-1 would be great resolution, to us

even so - using IRET makes serialization "inexpensive" time-wise

PUSHFD
CALL FAR PTR LabelA
(rdtsc, etc)

LabelA PROC FAR
IRET
LabelA ENDP

i really like that - no more hassling with cpuid
don't have to "warm up" IRET, either - lol

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 13, 2009, 01:22:47 AM

Quote from: Neo on July 12, 2009, 11:40:58 PM
Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction.

Timing a single instruction is not workable by any method that I know of because you cannot reliably isolate a single instruction. For consistency when serializing with CPUID you must control the function number in EAX, and for the CPUID that follows the timed instruction the instruction that sets the function number may or may not execute in parallel with the timed instruction. Without serializing, a similar problem exists for the instructions on both sides of the timed instruction. So depending on whether or not you serialize, the result may represent the latency or throughput of the timed instruction, or it may represent the latency or throughput of the timed instruction plus the latency or throughput of one or both adjacent instructions.

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 13, 2009, 02:10:20 AM

I had my doubts, and it took a total of several hours to fumble my way through this, and the code does not fully simulate an INT n so it likely would not work for a real handler, but within this limited test it appears to work.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data      
      fw FWORD 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    pushfd
    mov DWORD PTR fw, handler
    mov WORD PTR fw+4, cs
    call FWORD PTR fw

    inkey "Press any key to exit..."
    exit

  handler:
    pushad
    print "in handler",13,10
    popad
    iretd
        
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

The remaining problem is that I still can't see any reasonable way to serialize with an IRET.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 02:14:18 AM

tell me what instruction executes just prior to RDTSC... (hint - it isn't CALL)

PUSHFD
CALL FAR PTR LabelA
RDTSC
.
.
.

LabelA PROC FAR
IRET
LabelA ENDP

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 13, 2009, 02:33:55 AM

Obviously the IRETD, but where using CPUID for the second serialization the timed instructions could not be isolated from a fast:

xor eax, eax

Using IRETD for the second serialization the timed instructions could not be isolated from a much slower:

pushfd
call FWORD PTR

To me the main incentive for using a different serializing instruction is to completely isolate the timed instructions from the timing instructions, and to do that the serializing instruction must be freestanding.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 02:41:24 AM

any place you are inserting PUSHFD/CALL(IRET), is a place where we are currently using an XOR EAX,EAX/CPUID
besides the fact that PUSHFD/CALL(IRET) is always faster than XOR EAX,EAX/CPUID,
there is an added advantage of not having to preserve registers (EBX, etc, as well as the RDTSC values in EDX:EAX)
another advantage: all processors support PUSHFD/CALL(IRET) - some do not support CPUID (we don't have to make that test, either)
one last advantage: PUSHFD/CALL(IRET) is always the same number of clock cycles - CPUID varies, even after warm-up
this is a win-win all the way around

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 02:54:46 AM

i was thinking of having an empty handler, but you could place the code to be timed in the handler
for the cases where we want to serialize without running the code - use the IRET at the end of the handler to make an empty one

;------------------------------------------------------------

timed PROC FAR

;code to be timed goes here

;----------------------------------------

empty PROC FAR

IRET

empty ENDP

;----------------------------------------

timed ENDP

;------------------------------------------------------------

;to get the overhead and start times:

PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX
PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX

;now, to run the timed code:

PUSHFD
CALL FAR PTR timed
RDTSC
;edx:eax now contain the end time
POP EAX
POP EDX
;edx:eax now contain the start time
POP EAX
POP EDX
;edx:eax now contain the overhead reference time

of course, you have to play with the registers and subtract at the end, but you get the idea (that is all after the last RDTSC)
gotta love it ! :U
as an old friend of mine used to say, "That's slicker than hot snot on a glass doorknob!"
of course, he always said that after looking at one of my circuit designs - lol

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 04:24:12 AM

well - it looked great on paper - lol
calling a far routine isn't as easy as it used to be
i will figure it out and mod the code a bit
i never give up - lol
and
if it doesn't fit, force it; if it breaks, it needed replacement, anyways

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 13, 2009, 05:49:19 AM

In RM the results may be different, but in PM, and running on a P3, xor eax, eax, cpuid is significantly faster.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      fw FWORD 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

  handler:

    iretd

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    mov DWORD PTR fw, handler
    mov WORD PTR fw+4, cs

    invoke Sleep, 4000

    REPEAT 10
      counter_begin 1000, HIGH_PRIORITY_CLASS
        xor eax, eax
        cpuid
      counter_end
      print ustr$(eax)," cycles, xor eax,eax | cpuid",13,10
    ENDM

    REPEAT 10
      counter_begin 1000, HIGH_PRIORITY_CLASS
        pushfd
        call FWORD PTR fw
      counter_end
      print ustr$(eax)," cycles, pushfd | call FWORD PTR | iretd",13,10
    ENDM

    inkey "Press any key to exit..."
    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd

The CPUID execution time does vary with the function, but I have never noticed it varying significantly for a given function.

And I'm not convinced that "warming up" the CPUID instruction serves any useful purpose. In the Intel application note PDF where I saw this being done the programmer had failed to control the CPUID function, an obvious error with an effect likely much larger than the lack of a "warm up".

The problem I had trying to make my version work was that I had forgotten about IRETD. No matter how I arranged the stack the IRET would fault (c0000005 (access violation)). And then I noticed the 66h operand-size prefix on the instruction, and knew what the problem was.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 06:22:28 AM

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 13, 2009, 07:34:02 AM

I think the P4 TSC runs at the internal clock speed, but updates in step with the external clock. The IPC for the P4 is lower than that for the P2/P3 and I think the post-P4 processors, but for at least most of the instructions the difference is nothing like 5:1. I can't recall checking the cycle counts for CPUID on a P4, so for all I know those counts are normal.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 11:10:31 AM

well, the machine is quite fast - i am very happy with it's performance
it takes roughly 3 seconds to run through an empty loop with ecx set to 0 (1.8e+19 iterations using LOOP)
but the clock cycle counts are always very high compared to all others in the fourm
the 5:1 ratio would make sense, as it has a clock multiplier of x15 (bus clock 200 MHz - cpu clock 3 GHz)
i find one place where it mentions different "rate of tick" between CPU's, but it doesn't say much more about it
http://en.wikipedia.org/wiki/Time_Stamp_Counter

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 13, 2009, 12:21:11 PM

any way you look at it...
PUSHFD+CALL far+IRET shouldn't be that bad - lol
you are using the indirect form of CALL - that would account for 2 or 3 clock cycles - not many
ohh - i bet i know what it is - the CPU has to check IRET for privilege level changes
still - no preservation of EBX is required - you could add PUSH EBX/POP EBX into your stream
it is not the absolute number of clocks, but the variations - i can't get any believable numbers out of my machine to play with it

EDIT
i added push ebx/pop ebx - the two methods are roughly the same - and they both jump around on my machine too - lol
dang - i need a way to benchmark code

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 14, 2009, 02:35:07 AM

There appears to be a lot of variation between processors in the cycle counts for the two instruction sequences. This is for an old AMD K5:

Code Select


14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 03:28:36 AM

please run this for me Michael - i would like to see some real numbers - lol
i can't believe anything my machine tells me

[attachment deleted by admin]

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 14, 2009, 04:53:35 AM

P3:

Code Select


80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd

K5:

Code Select


14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 05:22:32 AM

thanks Michael
i guess we can put iret to rest, then - lol
it sounded so good - had to try

Title: Re: Two Instructions in One Atomic Operation
Post by: FORTRANS on July 14, 2009, 01:55:00 PM

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 02:17:51 PM

it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special

Title: Re: Two Instructions in One Atomic Operation
Post by: MichaelW on July 14, 2009, 02:24:21 PM

Quote
pushfd | call (far) | pushfd | call (far) | iretd | work | iretd

The IRETDs would execute in the order called, even if they were physically separate.

Title: Re: Two Instructions in One Atomic Operation
Post by: FORTRANS on July 14, 2009, 03:26:04 PM

Quote from: dedndave on July 14, 2009, 02:17:51 PM
it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions

Well that sorta was the question, does the CPU check the CALL segment/selector
in the same way?

Quote
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there

Or just push them and avoid the calls altogether.

Quote
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special

Sounds like too much that way, unless the is some default set up. (I don't
think so.)

Regards,

Steve N.

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 03:40:03 PM

QuoteOr just push them and avoid the calls altogether.

that would be pushfd | push cs | push offset | jmp
INT3 is probably the best utilization of IRET
but, i doubt it is worth the effort because the CPUID method would still be faster by a few cycles

Title: Re: Two Instructions in One Atomic Operation
Post by: FORTRANS on July 14, 2009, 04:32:54 PM

Quote from: dedndave on July 14, 2009, 03:40:03 PM
that would be pushfd | push cs | push offset | jmp

I would think

pushfd | push cs | push offset | iretd

to save a jump.

Steve N.

Quote

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 04:38:59 PM

i see what you mean - that might actually be close to the cpuid method for time
i think it will be very close, but cpuid wins by 1 or 2 clocks - lol

pushfd
push cs
push LabelA
iret
LabelA: rdtsc

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 04:51:42 PM

here - try it out - i get 9 clocks difference - better by 1 clock cycle
i have to calculate that because of my oddball cpu - lol
it would be good to see some real numbers posted

[attachment deleted by admin]

Title: Re: Two Instructions in One Atomic Operation
Post by: FORTRANS on July 14, 2009, 05:45:00 PM

Code Select


        pushfd
        push    cs
        push    LabelB
        pushfd
        push    cs
        push    LabelA
        iret
LabelA: rdtsc
        mov     ebx,eax
        iret
LabelB: rdtsc

Regards,

Steve N.

Edit: Had the labels wrong...

Title: Re: Two Instructions in One Atomic Operation
Post by: dedndave on July 14, 2009, 06:06:39 PM

Thanks, Steve
it is this sequence repeated several times:

pushfd
push cs
push $+3
db 0cfh ;iret

which is equivalent to:

pushfd
push cs
push LabelX
iret
LabelX:

it does not include the RDTSC inst, as that is common to both methods
the other one is:

push ebx
xor eax,eax
cpuid
pop ebx

i have a couple ideas to speed it up, but the same ideas could be applied to CPUID, as well - lol

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: Neo on July 12, 2009, 02:19:29 AM