News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Two Instructions in One Atomic Operation

Started by Neo, July 12, 2009, 02:19:29 AM

Previous topic - Next topic

dedndave

any way you look at it...
PUSHFD+CALL far+IRET shouldn't be that bad - lol
you are using the indirect form of CALL - that would account for 2 or 3 clock cycles - not many
ohh - i bet i know what it is - the CPU has to check IRET for privilege level changes
still - no preservation of EBX is required - you could add PUSH EBX/POP EBX into your stream
it is not the absolute number of clocks, but the variations - i can't get any believable numbers out of my machine to play with it

EDIT
i added push ebx/pop ebx - the two methods are roughly the same - and they both jump around on my machine too - lol
dang - i need a way to benchmark code

MichaelW

There appears to be a lot of variation between processors in the cycle counts for the two instruction sequences. This is for an old AMD K5:

14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
14 cycles, xor eax,eax | cpuid
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd
30 cycles, pushfd | call FWORD PTR | iretd


eschew obfuscation

dedndave

please run this for me Michael - i would like to see some real numbers - lol
i can't believe anything my machine tells me


[attachment deleted by admin]

MichaelW

P3:

80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
80 cycles, push ebx | xor eax,eax | cpuid | pop ebx
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd
90 cycles, pushfd | call (far) | iretd

K5:

14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
14 cycles, push ebx | xor eax,eax | cpuid | pop ebx
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd
32 cycles, pushfd | call (far) | iretd
29 cycles, pushfd | call (far) | iretd

eschew obfuscation

dedndave

thanks Michael
i guess we can put iret to rest, then - lol
it sounded so good - had to try

FORTRANS

Hi,

   What if you do something like:

pushfd | call (far) | pushfd | call (far) | iretd | work | iretd

to keep the call far timings out of the loop?

Pondering,

Steve N.

dedndave

it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special

MichaelW

Quote
pushfd | call (far) | pushfd | call (far) | iretd | work | iretd
The IRETDs would execute in the order called, even if they were physically separate.
eschew obfuscation

FORTRANS

Quote from: dedndave on July 14, 2009, 02:17:51 PM
it is the IRET instruction that sucks up the clock cycles, i think
that is because the CPU has to check the privilege level of the return-to segment for exceptions

   Well that sorta was the question, does the CPU check the CALL segment/selector
in the same way?

Quote
IRET needs an offset, segment, and flags on the stack - pushfd | call far is the fastest way to put them there

   Or just push them and avoid the calls altogether.

Quote
well - we could revector an interrupt and use INT nn, but that isn't going to be much faster and requires ring 0, i think
i am a bit of a n00b, so i am not sure about revectoring interrupts in protected mode windows - lol
might be able to use INT3 - that is special

   Sounds like too much that way, unless the is some default set up.  (I don't
think so.)

Regards,

Steve N.

dedndave

QuoteOr just push them and avoid the calls altogether.
that would be pushfd | push cs | push offset | jmp
INT3 is probably the best utilization of IRET
but, i doubt it is worth the effort because the CPUID method would still be faster by a few cycles

FORTRANS

Quote from: dedndave on July 14, 2009, 03:40:03 PM
that would be pushfd | push cs | push offset | jmp

   I would think

pushfd | push cs | push offset | iretd

to save a jump.

Steve N.
Quote

dedndave

i see what you mean - that might actually be close to the cpuid method for time
i think it will be very close, but cpuid wins by 1 or 2 clocks - lol

        pushfd
        push    cs
        push    LabelA
        iret
LabelA: rdtsc


dedndave

here - try it out - i get 9 clocks difference - better by 1 clock cycle
i have to calculate that because of my oddball cpu - lol
it would be good to see some real numbers posted

[attachment deleted by admin]

FORTRANS

Hi,

   For an AMD processor.

F:\>time6
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
45 cycles, push ebx | xor eax,eax | cpuid | pop ebx
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
104 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
103 cycles, pushfd | push cs | push offset | iretd
104 cycles, pushfd | push cs | push offset | iretd
Press any key to exit...


   Oh, well.  I'm assuming that's from something like?


        pushfd
        push    cs
        push    LabelB
        pushfd
        push    cs
        push    LabelA
        iret
LabelA: rdtsc
        mov     ebx,eax
        iret
LabelB: rdtsc


Regards,

Steve N.

Edit:  Had the labels wrong...

dedndave

Thanks, Steve
it is this sequence repeated several times:

        pushfd
        push cs
        push $+3
        db 0cfh ;iret

which is equivalent to:

        pushfd
        push cs
        push LabelX
        iret
LabelX:

it does not include the RDTSC inst, as that is common to both methods
the other one is:

        push ebx
        xor  eax,eax
        cpuid
        pop  ebx

i have a couple ideas to speed it up, but the same ideas could be applied to CPUID, as well - lol