Two Instructions in One Atomic Operation

Neo · July 12, 2009, 02:19:29 AM

This could be a silly question, and there may be no good answer. I'm trying to devise a way to get reliable thread cycle times in my OS without switching to privilege level 0. I've got it down to that if I can do:

Code Select

movdqa     xmm0,xmmword ptr fs[0].PL3ThreadData.CycleTimeTotal ;CycleTimeStart is in the upper qword
rdtsc

atomically, it'll be a piece of cake to calculate the thread cycle time while taking hardly any clock cycles itself. It's pretty unlikely that the thread scheduler will actually kick in between those two instructions, but I wouldn't want it to give results that cause errors in that case. The count would be larger by however much time until the thread starts running again, so a second call to it may give a smaller number. If I switch the instruction order, I could have a negative number if the time waited is larger than the time executed before that first time slice.

I can't really have a lock on the data, 'cause the thread scheduler updates it, meaning that the app would have to be able to prevent the thread scheduler from running.
Any ideas?

dedndave · July 12, 2009, 02:38:28 AM

not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core

Neo · July 12, 2009, 02:45:25 AM

Oh wait, I think I've just thought of a simple solution that doesn't need them to be atomic:

Code Select

NotValidTime:
    rdtsc
    movdqa  xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
    shl     rdx,32
    or      rax,rdx
    movq    rcx,xmm0
    sub     rax,rcx
    js      NotValidTime   ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and movdqa

Note that the thread scheduler setting CycleTimeStart and CycleTimeTotal is atomic relative to this code, since I won't allow this thread to run until my thread scheduler has set both values. It's also technically possible to livelock on this loop if someone set the scheduler to kick in every APIC cycle or two (about 5-25 clock cycles each), but that'd be ridiculous. :lol

Neo · July 12, 2009, 02:52:49 AM

Quote from: dedndave on July 12, 2009, 02:38:28 AM
not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core

To clarify, this isn't for in Windows code, this is for in the API of my own OS, PwnOS. I have control over the thread scheduling and all that jazz, so I can do things that Windows wouldn't let me get away with and give some extra useful operations to the programs involved. :wink

I should put in something to serialize it as you suggest, now that you mention it, but cpuid takes quite a bit of time; I wonder if there's a serializing instruction that doesn't take so long. The fence ones may not work since rdtssc doesn't use memory.

MichaelW · July 12, 2009, 03:35:04 AM

If the thread scheduler is unlikely to interfere, you could minimize the effects by collecting a large number of counts and averaging them.

The CPUID instruction is relatively slow, and the execution time varies with the function number in EAX. The lowest cycle count is for function 0, 79 cycles on my P3.

In the timing macros I use an empty reference loop to get the cycle count for the timing instructions, including the serializing instructions, and then subtract it from the cycle count for the working loop.

Per the Intel System Programming Guide the non-privileged serializing instructions are CPUID, IRET, and RSM, and I can't see any reasonable way to use IRET or RSM.

dedndave · July 12, 2009, 03:50:56 AM

hmmmm - IRET has some possibilities
push the flags
push the right "code segment" - whatever that means in protected mode - lol
push the return address (to the rdtsc instruction, of course)
i.e. - does not neccessarily require an INT

Neo · July 12, 2009, 04:57:23 AM

Well, I don't think I'd spring for IRET, 'cause it seems like the kind of thing that may have unexpected behaviour in PL3 in 64-bit mode, and/or it'd take more time than CPUID.

What if I added a dependency somehow, like:

Code Select

NotValidTime:
    pxor    xmm0,xmm0
    rdtsc
    shl     rdx,32
    or      rax,rdx
    movq    xmm0,rax
    psubq   xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
    movq    rax,xmm0
    test    rax,rax
    js      NotValidTime   ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and psubq
    pshufd  xmm0,xmm0,1110b
    movq    rdx,xmm0
    sub     rax,rdx

Then the psubq has to occur after the rdtsc, and the CycleTimeTotal component in xmm0 will be negative, so it gets subtracted below instead of added.

dedndave · July 12, 2009, 10:21:03 AM

well, the idea is to insure that all other instructions are complete prior to executing the rdtsc
with processors that perform out-of-order execution and have multiple cores,
it seems the best approach if you want highly accurate and repeatable readings
i think IRET may be a great soultion - i wasn't aware that it serialized instructions until Michael's post above
it is much faster than cpuid, and is supported on all processors that provide rdtsc
cpuid has so many issues
on some cyrix cpu's, for example, cpuid has to be enabled

Neo · July 12, 2009, 11:11:07 AM

Well, there are arguments for either side, and thank you for yours, but until I encounter an issue with not serializing, that's what I'll do, and here's my loose reasoning:

RDTSC can take 60 or 70 clocks on this machine, plus the overhead of the other instructions, coming to maybe 80-100 clocks total. The standard deviation of the time to run this function would be larger than the time of a short instruction or two, so serializing still wouldn't make it produce reliable results for the really short times. Maybe it'd be worthwhile to serialize for timing the transcendental instructions, but it's usually clear that they take a while. As such, I'd rather minimize the impact on times by not serializing, to get more realistic results for timing several instructions (since serializing isn't what would've happened in the code without the timing). It'll also allow more frequent timing with less of an impact on the overall time.

In the end, though, the biggest reason is that I could just provide a macro that serializes then calls this, so that people have the option to serialize or not, instead of putting the serialization in the API function and not giving people the option (short of rewriting the API function themselves).

Thanks for the healthy discussion! :U

dedndave · July 12, 2009, 11:19:29 AM

i didn't realize rdtsc took so many clock cycles
at any rate, it should be nearly the same number each time, so it should have no impact on variations
it is a question of what you want for accuracy and repeatability
both Michael and myself are coming at the problem from a different angle than you, perhaps
we have tried to improve our ability to clock algorithms
so, you see, in our app, the size of the overhead isn't an issue
the variations in overhead clock cycles, from one run to the next, is what we try to reduce

dedndave · July 12, 2009, 11:29:52 AM

there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas

http://www.masm32.com/board/index.php?topic=770.0

Mark Jones · July 12, 2009, 04:05:12 PM

Curious here, and I am thinking aloud, has anyone considered a hardware solution to code timing? I mean, how much latency would be involved in latching some particular hardware line, like say a parallel port pin, from "Ring 0"? Then build a simple parallel port device which "started counting clocks" (period really) when the pin went low, and stopped when it went high, then latched this value (in "clocks") for reading by the host?

Or a PCI pin, for that matter. "Free power" available using PCI...

dedndave · July 12, 2009, 05:22:22 PM

i think the parallel port gives +5v, to some limited current - cmos circuits today require very little
one problem i foresee is that windows doesn't allow direct manipulation, so a driver is involved, i guess
maybe you can do direct i/o in ring0 - i dunno
another problem is, you would have to provide a counter
the tsc counter is internal to the cpu
my machine runs at 3 Ghz with an X15 multiplier, so i would only be able to acquire the 200 mhz bus clock
counting the bus clock would have a 15 cycle ganularity - and a 200 Mhz counter would require a special circuit
even if i could get the 3 Ghz signal, it would require a fairly expensive microwave frequency counter

Neo · July 12, 2009, 07:15:59 PM

Quote from: dedndave on July 12, 2009, 11:29:52 AM
there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas

http://www.masm32.com/board/index.php?topic=770.0

That kind of massive-iteration timing is exactly what I'm trying to avoid, and now able to avoid. With my performance viewer, I can get accurate, repeatable timing results on very small timings, because it's able to clean up the noisy data. You can even see the distribution of times if you run something multiple times instead of just getting one number.

Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.

dedndave · July 12, 2009, 08:04:04 PM

well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with

News:

Two Instructions in One Atomic Operation