The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: dedndave on September 24, 2009, 12:36:21 PM

Title: STD instruction
Post by: dedndave on September 24, 2009, 12:36:21 PM
am i doing something wrong, or is STD one slow-ass instruction ? - lol - CLD seems to be fast enough
i have it at about 215 cycles on a p4 prescott
the following code is only about 25 cycles faster

        pushfd
        pop     eax
        or      eax,400h
        push    eax
        popfd

EDIT - i am about to write a loop to do a "manual" reverse scan - lol
Title: Re: STD instruction
Post by: MichaelW on September 24, 2009, 12:47:52 PM
Is it possible that your code is triggering an exception? On my P3 I get 13 cycles total for a STD followed by a CLD.
Title: Re: STD instruction
Post by: dedndave on September 24, 2009, 12:49:01 PM
it functions ok
i dunno what kind of exception it would generate ???  :eek
Title: Re: STD instruction
Post by: dedndave on September 24, 2009, 12:52:00 PM
i am using it to scan a bignum from the top down - to skip over unused bytes (FF's for negative - 0's for positive and unsigned)
Title: Re: STD instruction
Post by: MichaelW on September 24, 2009, 01:04:16 PM
I can't actually recall ever seeing an exception caused by leaving the direction flag set, but I have seen my application die because of it. For example, this code:

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    invoke Sleep, 3000
   
    counter_begin 1000, HIGH_PRIORITY_CLASS
        std
        cld
    counter_end
    print ustr$(eax),13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


Runs OK as is, but if I comment out the CLD, then it dies just as it starts displaying the results.
Title: Re: STD instruction
Post by: dedndave on September 24, 2009, 01:09:12 PM
on mine - lol - it continues to run
- displays the data
- but no cr/lf - lol - strange
Title: Re: STD instruction
Post by: MichaelW on September 24, 2009, 01:15:35 PM
I'm running Windows 2000. Perhaps on your system it continues to run because Windows is detecting the problem and correcting it, and that accounts for the lost cycles. Or perhaps the direction flag has been virtualized.

Edit: "virtualized" is the wrong term. What I mean is that Windows may be actively managing the direction flag to prevent problems with it being left set during a call to a CRT or API function that expects it to be clear.
Title: Re: STD instruction
Post by: dedndave on September 24, 2009, 01:16:35 PM
this is odd also...

std
cld
220 cycles

cld
5 cycles

cld
cld
100 cycles

i will play with the instruction placement
Title: Re: STD instruction
Post by: dedndave on September 24, 2009, 03:30:08 PM
this really sux, but....
the best solution seems to be

        mov     eax,400h
        pushfd
        or      [esp],eax
        popfd

all of that is faster than

        std
Title: Re: STD instruction
Post by: drizz on September 25, 2009, 10:45:26 AM
i think the best solution is not using std  :lol


direction_N=4

_std macro
  direction_N=-4
endm

_cld macro
  direction_N=4
endm

_lodsd macro
  mov eax,[esi]
  add esi,direction_N
endm

_stosd macro
  mov [edi],eax
  add edi,direction_N
endm

_lodsb macro
  mov al,[esi]
  add esi,direction_N/4
endm

_stosb macro
  mov [edi],al
  add edi,direction_N/4
endm

_lodsw macro
  mov ax,[esi]
  add esi,direction_N/2
endm

_stosw macro
  mov [edi],ax
  add edi,direction_N/2
endm

Title: Re: STD instruction
Post by: dedndave on September 25, 2009, 03:29:07 PM
well - i tried that Drizz
i am glad to see someone confirm my grief, though - lol
you see how i am using it
i have found a very good soultion for this application
1) measure the integer length where using scasb to reduce length yields an advantage
2) skip it for shorter integers - just go ahead and evaluate the unused bytes   :U
by branching around the std/repz scasb/cld - we speed it up for short values
using the loop method works well if there aren't many unused bytes
if there are a lot - scasb kicks butt
what i may do is sample the length where there is an advantage and repz scasb in the up direction - tricky, huh   :U
Title: Re: STD instruction
Post by: herge on September 25, 2009, 07:29:40 PM

Hi *.*:

If I remember correctly windows O S likes the direction flag up ie
forward it will react badly to a direction flag down.
Translation it will crash or may be hang or send message to
Redmond, Washington, USA to our friends at the Big M.

Always put direction flag up for Windows!

Regards: herge
Title: Re: STD instruction
Post by: dedndave on September 25, 2009, 08:19:55 PM
yah - we got that Hegre
we want to set it down temporarily
it is just very slow
i am guessing that the OS traps that instruction for some reason
Title: Re: STD instruction
Post by: herge on September 26, 2009, 03:56:36 AM
Hi dedndave:

This is a Windows driver site out of I believe is New Hampshire, USA.
I find it useful info on WinDBG from MicroSoft.

http://www.osronline.com

Regards: herge
Title: Re: STD instruction
Post by: sinsi on September 26, 2009, 05:09:50 AM
About the only thing the intel docs say could be a problem is a partial flag register stall
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 08:29:35 AM
well - i tried repositioning the instruction in several places - no help
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 01:48:56 PM
well - i am wondering if this is a P4 only issue
Michael already assured us that it isn't a problem on P3's
what about the newer processors ? (duos quads etc)
Title: Re: STD instruction
Post by: Magnum on September 26, 2009, 02:39:23 PM
Quote from: dedndave on September 24, 2009, 01:16:35 PM
this is odd also...

std
cld
220 cycles

cld
5 cycles

cld
cld
100 cycles

i will play with the instruction placement

Something does not make any sense.
Using 2 cld statements caused a 20 fold increase in cycles?

Andy
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 04:24:24 PM
i know - lol
funky, huh
if you have a p4 processor - try it out
i am using xp mce2005 (pretty much the same as xp pro), sp2
i have a p4 prescott cpu
Title: Re: STD instruction
Post by: Magnum on September 26, 2009, 05:26:51 PM
I don't think that the std instruction is the problem.

I have several 16 bit programs that run fast that use that instruction.

Andy
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 05:58:48 PM
we are talking 32-bit code
apples and oranges
it has been confirmed by others - at least on a p4
Title: Re: STD instruction
Post by: FORTRANS on September 26, 2009, 08:36:49 PM
Hi,

   Have you booted to another OS, or is this only with one specific
OS?  I.e. is this a processor or OS problem?

Regards,

Steve N.
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 11:15:16 PM
no - i haven't Steve
i have too much crapolla on my drives at the moment, so it isn't practical for me to mess with that
i was hoping a few others might try it out in here
MichaelW says it is no problem for him - he is using a p3 under win2K, i think
i am only guessing that it is just one more "p4 handicap" to go with all the rest - lol
or - maybe the OS traps that instruction so it knows the direction has been changed
if that were the case, it shouldn' hiccup when you leave the flag set
who knows - i have a good work-around in mind, at least
Title: Re: STD instruction
Post by: jj2007 on September 26, 2009, 11:16:21 PM
Dave, these are Celeron M Win XP SP2 values:

13      cycles for std cld
6       cycles for cld
13      cycles for cld cld
Title: Re: STD instruction
Post by: dedndave on September 26, 2009, 11:20:08 PM
thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?
Title: Re: STD instruction
Post by: jj2007 on September 26, 2009, 11:24:36 PM
Quote from: dedndave on September 26, 2009, 11:20:08 PM
thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?

The Celeron M "Yonah" is a Core but not Core Duo. Definitely later than P4.
Title: Re: STD instruction
Post by: Astro on September 27, 2009, 08:48:11 PM
Hi,

Can you post your full code that you wrote? I'll test here.

Core2 Duo E6700, Win XP Pro SP3 and Vista Ultimate SP2.

Best regards,
Robin.
Title: Re: STD instruction
Post by: hutch-- on September 28, 2009, 12:22:58 AM
Dave,

If you can pop a small test piece, I have a real single core PIV 3.8 running win2k and a core series quad running XP sp3 to test it on.
Title: Re: STD instruction
Post by: dedndave on September 28, 2009, 07:47:23 AM
i attempted to make a simple timing program
the problem does not arise
the time i was getting was from the initialization section of my bignum to ascii routine
i measured the entire init code at ~245 cycles with std (i had commented out the repz scasb)
then, when the std was commented out, i measured about 30 cycles
thus, my conclusion that std was slow
this damn machine gives me such odd numbers
they jump around a lot too - very difficult for me to time things and learn optimization
so - now i have to go and figure out what other instructions, combined with std, are giving me trouble

as another example of my machine's inconsistancy.....
i have a multiple-precision multiply-by-constant-to-divide snippet
in it, there are 5 large constant values (3 actually - 2 of them are the same value loaded into register twice)
the last one wants to be loaded as an immediate value "mov     edx,3906250"
but, with the others, i have placed the constant on the stack frame, and can load them via "mov     edx,[ebp-20]" or similar
so - loading the other 4 constants as either immediates, or from the stack frame, yields wide and varied results
4 constants - 2 ways to load - 16 possible combinations
the snippet can take from ~40 to ~80 cycles, depending on how i load these variables
if i load them all as immediates, it is the 80 cycles
if i load them all from the stack, it is the 80 cycles
if i load 2 of them from the stack frame and 2 of them as immediates, i get the ~40
slightly better results are obtained if i load the two constants immediate one time and off the stack the other
other combinations aren't as good
i have also tried pushing them, as well as a few other methods of loading them
i isolated that one piece of code and selected the loads that yielded the best times
i also re-ordered several instructions several ways to try and get the best time
then - put the code back in the loop and got the worst time ever - lol
i feel like i have to be fricken Karnac the Magnificent to optimize code - lol

(http://www.delawareliberal.net/wp-content/uploads/2009/05/carnac.jpg)
Pee Wee Herman, Michael Jackson, and Tom Cruise.......

(name two fruits and a vegetable)
Title: Re: STD instruction
Post by: dedndave on September 28, 2009, 01:42:38 PM
ok guys - i got a test that shows the issue on my machine...

CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2

...CLD
49      clock cycles
49      clock cycles
49      clock cycles

CLD...CLD
104     clock cycles
104     clock cycles
104     clock cycles

STD...CLD
239     clock cycles
238     clock cycles
238     clock cycles

program and source attached...
Title: Re: STD instruction
Post by: MichaelW on September 28, 2009, 02:34:49 PM

CPU 0: Fam 6 Mod 7 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1

...CLD
10      clock cycles
10      clock cycles
10      clock cycles

CLD...CLD
18      clock cycles
18      clock cycles
18      clock cycles

STD...CLD
18      clock cycles
18      clock cycles
18      clock cycles


It occurred to me that the problem could be due to an "errata" in your processor that the BIOS is correcting by applying a micro-code patch.

http://cseweb.ucsd.edu/~calder/papers/ICCD-06-HWPatch.pdf
Title: Re: STD instruction
Post by: dedndave on September 28, 2009, 03:16:15 PM
you may be right, Michael - that would make sense
i didn't see a specific mention of STD in that paper - that doesn't mean it isn't one of the cases he is speaking of
i was thinking it may be an overhead problem in the out-of-order operation scheme
the processor is looking through the code-stream for something else to chew on
and - STD causes that mechanism to operate more slowly due to the number of possible affected instructions
i am not a big fan of out-of-order instructions - lol
it kind of takes some of the fun out of programming in assembler, even if does speed things up

EDIT - btw - that guy has several other interesting papers, as well - great site   :U

EDIT again - if i run the STD in a timer loop alone - i get 13 cycles (CLD after the timer loop)
Title: Re: STD instruction
Post by: dedndave on September 28, 2009, 06:00:50 PM
Hutch ? Astro ? Jochen ?
i thought you guys wanted to run this for me
i linked it again in case you missed that post
thanks

http://www.masm32.com/board/index.php?action=dlattach;topic=12368.0;id=6694
Title: Re: STD instruction
Post by: cobold on September 28, 2009, 07:26:12 PM
Dave,

I don't think this will help you, but I ran dftime on my machine (WIN XP SP2) with following "funny" results:


CPU 0: AMD Athlon(TM) XP 2000+ MMX+ SSE 3DNow!+ Cores: 1

...CLD
1       clock cycles
0       clock cycles
0       clock cycles

CLD...CLD
2       clock cycles
2       clock cycles
2       clock cycles

STD...CLD
0       clock cycles
0       clock cycles
0       clock cycles

Press any key to continue ...


rgds
cobold
Title: Re: STD instruction
Post by: dedndave on September 28, 2009, 07:56:53 PM
actually, it does help - lol
i can see that it is fast in all three cases
Title: Re: STD instruction
Post by: jj2007 on September 28, 2009, 08:22:27 PM
Quote from: dedndave on September 28, 2009, 06:00:50 PM
Hutch ? Astro ? Jochen ?
i thought you guys wanted to run this for me
i linked it again in case you missed that post
thanks

http://www.masm32.com/board/index.php?action=dlattach;topic=12368.0;id=6694

Here it is :bg

CPU 0: Intel(R) Celeron(R) M CPU        420  @ 1.60GHz MMX SSE3 Cores: 1

...CLD
10      clock cycles
10      clock cycles
10      clock cycles

CLD...CLD
19      clock cycles
19      clock cycles
19      clock cycles

STD...CLD
19      clock cycles
19      clock cycles
19      clock cycles
Title: Re: STD instruction
Post by: BlackVortex on September 28, 2009, 08:50:24 PM
CPU 0: Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz MMX SSE4.1 Cores: 2

...CLD
1       clock cycles
1       clock cycles
1       clock cycles

CLD...CLD
4       clock cycles
4       clock cycles
4       clock cycles

STD...CLD
52      clock cycles
52      clock cycles
52      clock cycles
Title: Re: STD instruction
Post by: FORTRANS on September 28, 2009, 10:27:56 PM

CPU 0: Intel(R) Pentium(R) M processor 1.70GHz MMX SSE2 Cores: 1

...CLD
10   clock cycles
10   clock cycles
10   clock cycles

CLD...CLD
19   clock cycles
19   clock cycles
19   clock cycles

STD...CLD
19   clock cycles
19   clock cycles
19   clock cycles

Press any key to continue ...
CPU 0: Fam 6 Mod 8 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1

...CLD
10   clock cycles
10   clock cycles
10   clock cycles

CLD...CLD
18   clock cycles
18   clock cycles
18   clock cycles

STD...CLD
18   clock cycles
18   clock cycles
18   clock cycles

Press any key to continue ...


HTH,

Steve N.
Title: Re: STD instruction
Post by: dedndave on September 29, 2009, 12:35:48 AM
thanks all
i am surprised to see it act up on the core duo
Title: Re: STD instruction
Post by: sinsi on September 29, 2009, 12:54:18 AM
Added a couple of tests, just for completeness.

CPU 0: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz MMX SSSE3 Cores: 4

...STD
15      clock cycles
15      clock cycles
15      clock cycles

...CLD
2       clock cycles
2       clock cycles
2       clock cycles

CLD...CLD
6       clock cycles
6       clock cycles
6       clock cycles

STD...CLD
53      clock cycles
52      clock cycles
52      clock cycles

STD...STD
28      clock cycles
28      clock cycles
28      clock cycles
Title: Re: STD instruction
Post by: dedndave on September 29, 2009, 01:39:39 AM
thanks Sinsi
50 cycles is still longer than it should be - although not nearly as bad as my p4 prescott - lol
Title: Re: STD instruction
Post by: ThexDarksider on October 13, 2009, 12:45:20 PM
I ran that too.

CPU 0: Intel(R) Atom(TM) CPU N270   @ 1.60GHz MMX SSSE3 Cores: 2

...CLD
30      clock cycles
19      clock cycles
22      clock cycles

CLD...CLD
30      clock cycles
30      clock cycles
29      clock cycles

STD...CLD
89      clock cycles
87      clock cycles
90      clock cycles


My CPU usage is at 30% minimal, mostly around 45%. If that matters, lol.

What are STD and CLD for, anyway? Are they for hooking interrupts or something?
Title: Re: STD instruction
Post by: Magnum on October 13, 2009, 01:30:02 PM
STD sets the direction flag and CLD clears the direction flag.

They are commonly used when searching.

Andy
Title: Re: STD instruction
Post by: FORTRANS on October 13, 2009, 01:41:37 PM
Hi,

   The direction flag controls how the string instructions are used.
The string instructions use DI (EDI) and SI (ESI) to access memory.


        MOVSB           ; Move String Byte.

Is equivalent to

        MOV     BYTE PTR [DI],[SI]      ; Move (copy) the byte from where DS:SI
                                        ; points to where ES:DI is pointing.
                                        ; Of couse that kind of move is normally illegal,
                                        ; which is why the MOVS is useful.
        INC     DI      ; If the direction flag is clear.
        INC     SI      ; If the flag is set these would be DECrements.


HTH,

Steve N.
Title: Re: STD instruction
Post by: ThexDarksider on October 13, 2009, 04:39:48 PM
Oh I think I understand now. :bg
Title: Re: STD instruction
Post by: Astro on October 13, 2009, 04:44:23 PM
Sorry - chaos here as usual...

CPU 0: Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz MMX SSSE3 Cores: 2

...CLD
2       clock cycles
2       clock cycles
2       clock cycles

CLD...CLD
6       clock cycles
6       clock cycles
6       clock cycles

STD...CLD
52      clock cycles
52      clock cycles
52      clock cycles


STD...CLD pair is a bit variable - between 52 and 56 (e.g. 56, 52, 52).

Best regards,
Astro.
Title: Re: STD instruction
Post by: dsouza123 on November 01, 2009, 03:40:06 PM

CPU 0: Intel(R) Core(TM)2 Duo CPU     P9500  @ 2.53GHz MMX SSE4.1 Cores: 2

...CLD
1       clock cycles
1       clock cycles
1       clock cycles

CLD...CLD
4       clock cycles
4       clock cycles
4       clock cycles

STD...CLD
52      clock cycles
53      clock cycles
51      clock cycles


and


CPU 0: AMD Athlon(tm) Processor MMX+ 3DNow!+ Cores: 1

...CLD
1       clock cycles
1       clock cycles
1       clock cycles

CLD...CLD
2       clock cycles
2       clock cycles
2       clock cycles

STD...CLD
1       clock cycles
0       clock cycles
0       clock cycles
Title: Re: STD instruction
Post by: UtillMasm on November 02, 2009, 04:58:21 AM
and
C:\mytest>DFtime.exe
CPU 0: Genuine Intel(R) CPU           T2400  @ 1.83GHz MMX SSE3 Cores: 2

...CLD
10      clock cycles
10      clock cycles
10      clock cycles

CLD...CLD
20      clock cycles
20      clock cycles
20      clock cycles

STD...CLD
20      clock cycles
20      clock cycles
20      clock cycles

Press any key to continue ...

C:\mytest>
Title: Re: STD instruction
Post by: frktons on August 31, 2010, 02:03:36 AM
This is quite an old thread, but, I've to say it is the first time a testing
routine gives me the correct SSE version, so here I post nevertheless.

Compliment Dave, you got my correct CPU  :U

CPU 0: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz MMX SSSE3 Cores: 2

...CLD
2       clock cycles
2       clock cycles
2       clock cycles

CLD...CLD
6       clock cycles
6       clock cycles
6       clock cycles

STD...CLD
52      clock cycles
52      clock cycles
52      clock cycles

Press any key to continue ...


By the way, what happened to this preliminary version?
Is it still preliminary or it is now a grown one?

Frank
Title: Re: STD instruction
Post by: dedndave on August 31, 2010, 04:09:57 AM
if you use the forum search tool, you can find newer versions
however, i had to put the overall project on hold
a complete implementation requires that i learn KMD's - it's on my list  :bg
Title: Re: STD instruction
Post by: mineiro on August 31, 2010, 06:04:04 AM

CPU 0: Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz MMX SSSE3 Cores: 2

...CLD
2       clock cycles
2       clock cycles
2       clock cycles

CLD...CLD
6       clock cycles
6       clock cycles
6       clock cycles

STD...CLD
52      clock cycles
52      clock cycles
52      clock cycles

Press any key to continue ...
Title: Re: STD instruction
Post by: Rockoon on August 31, 2010, 06:49:36 AM
Things havent changed on AMD's:

CPU 0: AMD Phenom(tm) II X6 1055T Processor MMX+ SSE4a 3DNow!+ Cores: 6

...CLD
-1      clock cycles
0       clock cycles
-1      clock cycles

CLD...CLD
-1      clock cycles
-1      clock cycles
-1      clock cycles

STD...CLD
-1      clock cycles
-1      clock cycles
-1      clock cycles
Title: Re: STD instruction
Post by: KeepingRealBusy on August 31, 2010, 04:16:04 PM
Quote from: dedndave on September 28, 2009, 01:42:38 PM
ok guys - i got a test that shows the issue on my machine...

CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2

...CLD
49      clock cycles
49      clock cycles
49      clock cycles

CLD...CLD
104     clock cycles
104     clock cycles
104     clock cycles

STD...CLD
239     clock cycles
238     clock cycles
238     clock cycles

program and source attached...

Dave,

Here are my times on my P4:


CPU 0: Intel(R) Pentium(R) 4 CPU 3.20GHz MMX SSE2 Cores: 2

...CLD
45      clock cycles
45      clock cycles
45      clock cycles

CLD...CLD
93      clock cycles
93      clock cycles
93      clock cycles

STD...CLD
93      clock cycles
94      clock cycles
94      clock cycles

Press any key to continue ...


Dave
Title: Re: STD instruction
Post by: xandaz on September 04, 2010, 11:33:48 PM
   May i ask what u use to check how much cycles are spent?
   Thanks and bye
Title: Re: STD instruction
Post by: dedndave on September 05, 2010, 01:18:46 PM
i used MichaelW's timing macros

i found a solution that works pretty well
in this example, i wanted a cleared DF
if you want a set DF, you still need STD   :tdown
        pushfd
        pop     edx
        test    dh,40h
        jz      @F

        cld

@@:

;do string operations here

        test    dh,40h
        jz      @F

        std

@@:

the second test is only needed if you want to return the DF to it's original state
if you just want to leave it cleared, omit that part
the net effect of the above code is the same as pushf/popf, but can be faster
PUSHFD is ok, but POPFD, CLD, and STD are slow
this code, when modified to 16-bit, solves the issue mentioned in the 16-bit sub-forum thread, as well
http://www.masm32.com/board/index.php?topic=14699.msg119416#msg119416
i still don't really have a handle on that problem yet - lol
Title: Re: STD instruction
Post by: KeepingRealBusy on September 05, 2010, 07:47:14 PM
dedndave,

Do you have any words of wisdom about the following two timings, note they are both P4's? Any reason that the last "STD...CLD" should be so different? For your P4, each test doubled in time, but for my P4, the last two tests were the same?

Quote
Posted by: dedndave
ok guys - i got a test that shows the issue on my machine...

CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2

...CLD
49      clock cycles
49      clock cycles
49      clock cycles

CLD...CLD
104     clock cycles
104     clock cycles
104     clock cycles

STD...CLD
239     clock cycles
238     clock cycles
238     clock cycles

program and source attached...

Dave,

Quote
Posted by: KeepingRealBusy
Here are my times on my P4:

Code:
CPU 0: Intel(R) Pentium(R) 4 CPU 3.20GHz MMX SSE2 Cores: 2

...CLD
45      clock cycles
45      clock cycles
45      clock cycles

CLD...CLD
93      clock cycles
93      clock cycles
93      clock cycles

STD...CLD
93      clock cycles
94      clock cycles
94      clock cycles

Press any key to continue ...

Dave

Dave.

Title: Re: STD instruction
Post by: dedndave on September 06, 2010, 10:22:53 AM
i don't know about any words of wisdom - lol

but - the last one is better - not worse
it is possible that the OS has something to do with it
i am not sure about the precise mechanics - maybe Clive or one of the other guys can shed some light
it may have something to do with the OS checking for priviledge level

some time ago, MichaelW and i were playing with the IRET instruction
we wanted to use it as a serializing instruction to replace CPUID in the timing macros
i was surprised by how long it takes
but, i guess it is a similar case
the OS needs to verify that the target address of the IRET is allowable

this appears to be the same kind of issue
altering some of the flags requires lower (logically higher) priviledge
so, when you change any flags directly, tests have to be made to insure it is allowed
this could be upgraded in newer CPU's, as they only test the priviledge-critical flags