This could be a silly question, and there may be no good answer. I'm trying to devise a way to get reliable thread cycle times in my OS without switching to privilege level 0. I've got it down to that if I can do:
movdqa xmm0,xmmword ptr fs[0].PL3ThreadData.CycleTimeTotal ;CycleTimeStart is in the upper qword
rdtsc
atomically, it'll be a piece of cake to calculate the thread cycle time while taking hardly any clock cycles itself. It's pretty unlikely that the thread scheduler will actually kick in between those two instructions, but I wouldn't want it to give results that cause errors in that case. The count would be larger by however much time until the thread starts running again, so a second call to it may give a smaller number. If I switch the instruction order, I could have a negative number if the time waited is larger than the time executed before that first time slice.
I can't really have a lock on the data, 'cause the thread scheduler updates it, meaning that the app would have to be able to prevent the thread scheduler from running.
Any ideas?
not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core
Oh wait, I think I've just thought of a simple solution that doesn't need them to be atomic:
NotValidTime:
rdtsc
movdqa xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
shl rdx,32
or rax,rdx
movq rcx,xmm0
sub rax,rcx
js NotValidTime ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and movdqa
Note that the thread scheduler setting CycleTimeStart and CycleTimeTotal is atomic relative to this code, since I won't allow this thread to run until my thread scheduler has set both values. It's also technically possible to livelock on this loop if someone set the scheduler to kick in every APIC cycle or two (about 5-25 clock cycles each), but that'd be ridiculous. :lol
Quote from: dedndave on July 12, 2009, 02:38:28 AM
not sure about the atomic thing - sounds like it may explode or cause warts
but, there is nothing wrong with switching to HIGH_PRIORITY_CLASS for a very brief measurement period
cpuid should be used prior to rdtsc, as it serializes the instruction
Michael's timing macro also demonstrates "warming up" the cpuid instruction a couple times
also, if the timing routine needs to be run on a multi-core machine, you should use SetProcessAffinity
to select a single core during the measurement period (you can set it back to original value when done)
that way, the tsc values come from the same core
To clarify, this isn't for in Windows code, this is for in the API of my own OS, PwnOS. I have control over the thread scheduling and all that jazz, so I can do things that Windows wouldn't let me get away with and give some extra useful operations to the programs involved. :wink
I should put in something to serialize it as you suggest, now that you mention it, but cpuid takes quite a bit of time; I wonder if there's a serializing instruction that doesn't take so long. The fence ones may not work since rdtssc doesn't use memory.
If the thread scheduler is unlikely to interfere, you could minimize the effects by collecting a large number of counts and averaging them.
The CPUID instruction is relatively slow, and the execution time varies with the function number in EAX. The lowest cycle count is for function 0, 79 cycles on my P3.
In the timing macros I use an empty reference loop to get the cycle count for the timing instructions, including the serializing instructions, and then subtract it from the cycle count for the working loop.
Per the Intel System Programming Guide the non-privileged serializing instructions are CPUID, IRET, and RSM, and I can't see any reasonable way to use IRET or RSM.
hmmmm - IRET has some possibilities
push the flags
push the right "code segment" - whatever that means in protected mode - lol
push the return address (to the rdtsc instruction, of course)
i.e. - does not neccessarily require an INT
Well, I don't think I'd spring for IRET, 'cause it seems like the kind of thing that may have unexpected behaviour in PL3 in 64-bit mode, and/or it'd take more time than CPUID.
What if I added a dependency somehow, like:
NotValidTime:
pxor xmm0,xmm0
rdtsc
shl rdx,32
or rax,rdx
movq xmm0,rax
psubq xmm0,xmmword ptr fs:[0].PL3ThreadData.CycleTimeStart ;CycleTimeTotal is in the upper qword
movq rax,xmm0
test rax,rax
js NotValidTime ;Will have a negative number iff the thread scheduler kicked in in between rdtsc and psubq
pshufd xmm0,xmm0,1110b
movq rdx,xmm0
sub rax,rdx
Then the psubq has to occur after the rdtsc, and the CycleTimeTotal component in xmm0 will be negative, so it gets subtracted below instead of added.
well, the idea is to insure that all other instructions are complete prior to executing the rdtsc
with processors that perform out-of-order execution and have multiple cores,
it seems the best approach if you want highly accurate and repeatable readings
i think IRET may be a great soultion - i wasn't aware that it serialized instructions until Michael's post above
it is much faster than cpuid, and is supported on all processors that provide rdtsc
cpuid has so many issues
on some cyrix cpu's, for example, cpuid has to be enabled
Well, there are arguments for either side, and thank you for yours, but until I encounter an issue with not serializing, that's what I'll do, and here's my loose reasoning:
RDTSC can take 60 or 70 clocks on this machine, plus the overhead of the other instructions, coming to maybe 80-100 clocks total. The standard deviation of the time to run this function would be larger than the time of a short instruction or two, so serializing still wouldn't make it produce reliable results for the really short times. Maybe it'd be worthwhile to serialize for timing the transcendental instructions, but it's usually clear that they take a while. As such, I'd rather minimize the impact on times by not serializing, to get more realistic results for timing several instructions (since serializing isn't what would've happened in the code without the timing). It'll also allow more frequent timing with less of an impact on the overall time.
In the end, though, the biggest reason is that I could just provide a macro that serializes then calls this, so that people have the option to serialize or not, instead of putting the serialization in the API function and not giving people the option (short of rewriting the API function themselves).
Thanks for the healthy discussion! :U
i didn't realize rdtsc took so many clock cycles
at any rate, it should be nearly the same number each time, so it should have no impact on variations
it is a question of what you want for accuracy and repeatability
both Michael and myself are coming at the problem from a different angle than you, perhaps
we have tried to improve our ability to clock algorithms
so, you see, in our app, the size of the overhead isn't an issue
the variations in overhead clock cycles, from one run to the next, is what we try to reduce
there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas
http://www.masm32.com/board/index.php?topic=770.0
Curious here, and I am thinking aloud, has anyone considered a hardware solution to code timing? I mean, how much latency would be involved in latching some particular hardware line, like say a parallel port pin, from "Ring 0"? Then build a simple parallel port device which "started counting clocks" (period really) when the pin went low, and stopped when it went high, then latched this value (in "clocks") for reading by the host?
Or a PCI pin, for that matter. "Free power" available using PCI...
i think the parallel port gives +5v, to some limited current - cmos circuits today require very little
one problem i foresee is that windows doesn't allow direct manipulation, so a driver is involved, i guess
maybe you can do direct i/o in ring0 - i dunno
another problem is, you would have to provide a counter
the tsc counter is internal to the cpu
my machine runs at 3 Ghz with an X15 multiplier, so i would only be able to acquire the 200 mhz bus clock
counting the bus clock would have a 15 cycle ganularity - and a 200 Mhz counter would require a special circuit
even if i could get the 3 Ghz signal, it would require a fairly expensive microwave frequency counter
Quote from: dedndave on July 12, 2009, 11:29:52 AM
there has been considerable work in the forum on this subject
if you d/l the counter2.zip at the link below, you can examine Michael's code for ideas
http://www.masm32.com/board/index.php?topic=770.0
That kind of massive-iteration timing is exactly what I'm trying to avoid, and now able to avoid. With my performance viewer (http://www.masm32.com/board/index.php?topic=11804.0), I can get accurate, repeatable timing results on very small timings, because it's able to clean up the noisy data. You can even see the distribution of times if you run something multiple times instead of just getting one number.
Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.
well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with
Quote from: Neo on July 12, 2009, 07:15:59 PM
Also, when you're doing 1,000,000,000 iterations, the affect of serialization is negligible. A few clock cycles out of billions doesn't matter, so I have no idea why you're pushing it so blindly.
You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished. Without serialization you can have instructions overlapping the TSC read, potentially introducing a relatively large error
per iteration, depending on the overlapping instructions.
Within my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.
Quote from: MichaelW on July 12, 2009, 09:06:50 PM
You appear to be missing the point of serialization. Before each read of the TSC, serialization ensures that any pending ops have finished.
Serialization has its place, but likewise, you seem to be missing the point of not serializing. Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction. Maybe you want to know the latency, so you can use a macro to serialize and call my non-serialized thread timing function, but for much of what I'm interested in, the throughput is what matters, specifically
because in the code before putting in the timing, there were many operations running at once. Serializing artificially makes it look like the code would take much longer than it would otherwise, unless the code is full of big dependency chains.
QuoteWithin my experience, executing under Windows, it's not possible to get consistent cycle counts in a single pass through a block of code. The counter2 macros were an attempt to minimize the number of loops, and while it is possible to get consistent results on most processors for a relatively small number of loops, say ~10, for one loop there is a large amount of scatter.
The big things that throw off cycle times are:
- The 1kHz tick, if present; it takes about 50,000 clocks
- The thread scheduler, occurring about once every 25ms (or maybe it was 40ms); it takes about 200,000 clocks (or more)
- Other interrupt handlers, occuring sporadiacally; they take between 200,000 and 2,000,000 clocks
- Other threads, which will run for at least 25ms (or 40ms), which works out to be at least 40,000,000 clocks on this laptop
For short timing runs, these are all way off the scale, so I basically just eliminate points that are sufficiently far from the curve of best fit (it's a bit more complicated than that, but in a nutshell). That means that I can get reliable results by measuring maybe 100 times, and just throwing away maybe 10% of the data. It's even very consistent when another program is maxing out the CPU; although the caching is much worse in that case, so it's about 1.5 times longer. Here's 3 runs each of 2 Mersenne Twister implementations while BOINC was using roughly 100% CPU:
(http://ndickson.files.wordpress.com/2009/07/mtheavyload.png)
Also,
I'm not implementing this particular function for use in Windows, since it depends on the thread scheduler. This is to get thread cycle timings in PwnOS so that those bad data don't appear in the first place. It'll only count time actually spent in that thread, like QueryThreadCycleTime, only much faster and more stable, so the results should be even more reliable. The idea really is to help make it easier and faster for everyone to time more reliably, by giving an alternative approach.
I'll be sure to post data when I get the thread scheduler and other stuff up and running in PwnOS, but it won't be very comparable with timings done in Windows (in terms of testing the timing function), since there'll be much less overhead.
Sorry if I've sounded rude; I get frustrated when I keep not being able to explain myself clearly. :(
Quote from: dedndave on July 12, 2009, 08:04:04 PM
well - i am just letting you know what has worked for us
we're always open to learning new tricks
it would be great to see your timing code to see if we can apply it for our needs
another approach would be for you to look at some of the algorithms we have timed in the past
see how the measurements compare
the laboratory subforum is full of material to play with
Thanks for your understanding. I'll try not to disappoint when I finally get PwnOS up an running, and I hope that the data actually work out in the end. I think what I might do for comparison is to time individual iterations of a loop and then the whole loop:
- with interrupts disabled using raw rdtsc times with and without serialization, then
- with interrupts enabled using this approach with and without serialization.
I suspect that the sum of timing each iteration of a loop without serialization should be closer to the time of the whole loop in one shot than the sum of timing each iteration with serialization. However, I could be wrong. I think we can all agree that the sum of times with serialization will be larger than the time for the whole loop. The question is whether the times without serialization are smaller than the time for the whole loop, and if so, is it a larger magnitude of error.
Doing the test with and without interrupts will ensure that the timer is getting accurate thread cycle times instead of total cycle times.
Sorry again about such a silly argument when I just haven't been explaining myself clearly. :red
well - the applications are different
unless i misunderstand, your code is always running
in our application, we run a quick console mode app to get some timing values - and that's it
we don't care if it takes 10,000 clocks in overhead (well, we do, really - lol)
if it takes 10,000 +/-1 clocks every time, that would be ok
+/-1 would be great resolution, to us
even so - using IRET makes serialization "inexpensive" time-wise
PUSHFD
CALL FAR PTR LabelA
(rdtsc, etc)
LabelA PROC FAR
IRET
LabelA ENDP
i really like that - no more hassling with cpuid
don't have to "warm up" IRET, either - lol
Quote from: Neo on July 12, 2009, 11:40:58 PM
Suppose you were to time a single instruction, which it sounds like you want to. Serializing means that you measure the latency of that instruction; not serializing means that you measure the throughput of the instruction.
Timing a single instruction is not workable by any method that I know of because you cannot reliably isolate a single instruction. For consistency when serializing with CPUID you must control the function number in EAX, and for the CPUID that follows the timed instruction the instruction that sets the function number may or may not execute in parallel with the timed instruction. Without serializing, a similar problem exists for the instructions on both sides of the timed instruction. So depending on whether or not you serialize, the result may represent the latency or throughput of the timed instruction, or it may represent the latency or throughput of the timed instruction plus the latency or throughput of one or both adjacent instructions.
I had my doubts, and it took a total of several hours to fumble my way through this, and the code does not fully simulate an INT n so it likely would not work for a real handler, but within this limited test it appears to work.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
fw FWORD 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
pushfd
mov DWORD PTR fw, handler
mov WORD PTR fw+4, cs
call FWORD PTR fw
inkey "Press any key to exit..."
exit
handler:
pushad
print "in handler",13,10
popad
iretd
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
The remaining problem is that I still can't see any reasonable way to serialize with an IRET.
tell me what instruction executes just prior to RDTSC... (hint - it isn't CALL)
PUSHFD
CALL FAR PTR LabelA
RDTSC
.
.
.
LabelA PROC FAR
IRET
LabelA ENDP
Obviously the IRETD, but where using CPUID for the second serialization the timed instructions could not be isolated from a fast:
xor eax, eax
Using IRETD for the second serialization the timed instructions could not be isolated from a much slower:
pushfd
call FWORD PTR
To me the main incentive for using a different serializing instruction is to completely isolate the timed instructions from the timing instructions, and to do that the serializing instruction must be freestanding.
any place you are inserting PUSHFD/CALL(IRET), is a place where we are currently using an XOR EAX,EAX/CPUID
besides the fact that PUSHFD/CALL(IRET) is always faster than XOR EAX,EAX/CPUID,
there is an added advantage of not having to preserve registers (EBX, etc, as well as the RDTSC values in EDX:EAX)
another advantage: all processors support PUSHFD/CALL(IRET) - some do not support CPUID (we don't have to make that test, either)
one last advantage: PUSHFD/CALL(IRET) is always the same number of clock cycles - CPUID varies, even after warm-up
this is a win-win all the way around
i was thinking of having an empty handler, but you could place the code to be timed in the handler
for the cases where we want to serialize without running the code - use the IRET at the end of the handler to make an empty one
;------------------------------------------------------------
timed PROC FAR
;code to be timed goes here
;----------------------------------------
empty PROC FAR
IRET
empty ENDP
;----------------------------------------
timed ENDP
;------------------------------------------------------------
;to get the overhead and start times:
PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX
PUSHFD
CALL FAR PTR empty
RDTSC
PUSH EDX
PUSH EAX
;now, to run the timed code:
PUSHFD
CALL FAR PTR timed
RDTSC
;edx:eax now contain the end time
POP EAX
POP EDX
;edx:eax now contain the start time
POP EAX
POP EDX
;edx:eax now contain the overhead reference time
of course, you have to play with the registers and subtract at the end, but you get the idea (that is all after the last RDTSC)
gotta love it ! :U
as an old friend of mine used to say, "That's slicker than hot snot on a glass doorknob!"
of course, he always said that after looking at one of my circuit designs - lol
well - it looked great on paper - lol
calling a far routine isn't as easy as it used to be
i will figure it out and mod the code a bit
i never give up - lol
and
if it doesn't fit, force it; if it breaks, it needed replacement, anyways
In RM the results may be different, but in PM, and running on a P3, xor eax, eax, cpuid is significantly faster.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
fw FWORD 0
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
handler:
iretd
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
mov DWORD PTR fw, handler
mov WORD PTR fw+4, cs
invoke Sleep, 4000
REPEAT 10
counter_begin 1000, HIGH_PRIORITY_CLASS
xor eax, eax
cpuid
counter_end
print ustr$(eax)," cycles, xor eax,eax | cpuid",13,10
ENDM
REPEAT 10
counter_begin 1000, HIGH_PRIORITY_CLASS
pushfd
call FWORD PTR fw
counter_end
print ustr$(eax)," cycles, pushfd | call FWORD PTR | iretd",13,10
ENDM
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
79 cycles, xor eax,eax | cpuid
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
107 cycles, pushfd | call FWORD PTR | iretd
The CPUID execution time does vary with the function, but I have never noticed it varying significantly for a given function.
And I'm not convinced that "warming up" the CPUID instruction serves any useful purpose. In the Intel application note PDF where I saw this being done the programmer had failed to control the CPUID function, an obvious error with an effect likely much larger than the lack of a "warm up".
The problem I had trying to make my version work was that I had forgotten about IRETD. No matter how I arranged the stack the IRET would fault (c0000005 (access violation)). And then I noticed the 66h operand-size prefix on the instruction, and knew what the problem was.
this is what i get on my prescott, Michael
391 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
390 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
396 cycles, xor eax,eax | cpuid
391 cycles, xor eax,eax | cpuid
583 cycles, pushfd | call FWORD PTR | iretd
599 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
585 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
583 cycles, pushfd | call FWORD PTR | iretd
something is fishy - lol
i wonder if the tsc is running with the same multiplier on the prescott
that is a consistent 5:1 ratio compared to yours
btw - if you set the PROC type to "FAR", the CS is pushed for you
pushfd
call timed
.
.
.
timed proc far
iret
timed endp
it took me a while to figure that out - in the old DOS days, i had to use FAR PTR to do the same thing
I think the P4 TSC runs at the internal clock speed, but updates in step with the external clock. The IPC for the P4 is lower than that for the P2/P3 and I think the post-P4 processors, but for at least most of the instructions the difference is nothing like 5:1. I can't recall checking the cycle counts for CPUID on a P4, so for all I know those counts are normal.
well, the machine is quite fast - i am very happy with it's performance
it takes roughly 3 seconds to run through an empty loop with ecx set to 0 (1.8e+19 iterations using LOOP)
but the clock cycle counts are always very high compared to all others in the fourm
the 5:1 ratio would make sense, as it has a clock multiplier of x15 (bus clock 200 MHz - cpu clock 3 GHz)
i find one place where it mentions different "rate of tick" between CPU's, but it doesn't say much more about it
http://en.wikipedia.org/wiki/Time_Stamp_Counter
any way you look at it...
PUSHFD+CALL far+IRET shouldn't be that bad - lol
you are using the indirect form of CALL - that would account for 2 or 3 clock cycles - not many
ohh - i bet i know what it is - the CPU has to check IRET for privilege level changes
still - no preservation of EBX is required - you could add PUSH EBX/POP EBX into your stream
it is not the absolute number of clocks, but the variations - i can't get any believable numbers out of my machine to play with it
EDIT
i added push ebx/pop ebx - the two methods are roughly the same - and they both jump around on my machine too - lol
dang - i need a way to benchmark code
There appears to be a lot of variation between processors in the cycle counts for the two instruction sequences. This is for an old AMD K5:
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
please run this for me Michael - i would like to see some real numbers - lol
i can't believe anything my machine tells me
[attachment deleted by admin]
P3:
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
K5:
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
thanks Michael
i guess we can put iret to rest, then - lol
it sounded so good - had to try
Hi,
What if you do something like:
pushfd | call (far) | pushfd | call (far) | iretd | work | iretd
to keep the call far timings out of the loop?
Pondering,
Steve N.
it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special
Quote
pushfd | call (far) | pushfd | call (far) | iretd | work | iretd
The IRETDs would execute in the order called, even if they were physically separate.
Quote from: dedndave on July 14, 2009, 02:17:51 PM
it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions
Well that sorta was the question, does the CPU check the CALL segment/selector
in the same way?
Quote
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there
Or just push them and avoid the calls altogether.
Quote
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special
Sounds like too much that way, unless the is some default set up. (I don't
think so.)
Regards,
Steve N.
QuoteOr just push them and avoid the calls altogether.
that would be pushfd | push cs | push offset | jmp
INT3 is probably the best utilization of IRET
but, i doubt it is worth the effort because the CPUID method would still be faster by a few cycles
Quote from: dedndave on July 14, 2009, 03:40:03 PM
that would be pushfd | push cs | push offset | jmp
I would think
pushfd | push cs | push offset | iretd
to save a jump.
Steve N.
Quote
i see what you mean - that might actually be close to the cpuid method for time
i think it will be very close, but cpuid wins by 1 or 2 clocks - lol
pushfd
push cs
push LabelA
iret
LabelA: rdtsc
here - try it out - i get 9 clocks difference - better by 1 clock cycle
i have to calculate that because of my oddball cpu - lol
it would be good to see some real numbers posted
[attachment deleted by admin]
Hi,
For an AMD processor.
F:\>time6
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
104 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
104 cycles, pushfd | push cs | push offset | iretd
Press any key to exit...
Oh, well. I'm assuming that's from something like?
pushfd
push cs
push LabelB
pushfd
push cs
push LabelA
iret
LabelA: rdtsc
mov ebx,eax
iret
LabelB: rdtsc
Regards,
Steve N.
Edit: Had the labels wrong...
Thanks, Steve
it is this sequence repeated several times:
pushfd
push cs
push $+3
db 0cfh ;iret
which is equivalent to:
pushfd
push cs
push LabelX
iret
LabelX:
it does not include the RDTSC inst, as that is common to both methods
the other one is:
push ebx
xor eax,eax
cpuid
pop ebx
i have a couple ideas to speed it up, but the same ideas could be applied to CPUID, as well - lol