am i doing something wrong, or is STD one slow-ass instruction ? - lol - CLD seems to be fast enough
i have it at about 215 cycles on a p4 prescott
the following code is only about 25 cycles faster
pushfd
pop eax
or eax,400h
push eax
popfd
EDIT - i am about to write a loop to do a "manual" reverse scan - lol
Is it possible that your code is triggering an exception? On my P3 I get 13 cycles total for a STD followed by a CLD.
it functions ok
i dunno what kind of exception it would generate ??? :eek
i am using it to scan a bignum from the top down - to skip over unused bytes (FF's for negative - 0's for positive and unsigned)
I can't actually recall ever seeing an exception caused by leaving the direction flag set, but I have seen my application die because of it. For example, this code:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
invoke Sleep, 3000
counter_begin 1000, HIGH_PRIORITY_CLASS
std
cld
counter_end
print ustr$(eax),13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
Runs OK as is, but if I comment out the CLD, then it dies just as it starts displaying the results.
on mine - lol - it continues to run
- displays the data
- but no cr/lf - lol - strange
I'm running Windows 2000. Perhaps on your system it continues to run because Windows is detecting the problem and correcting it, and that accounts for the lost cycles. Or perhaps the direction flag has been virtualized.
Edit: "virtualized" is the wrong term. What I mean is that Windows may be actively managing the direction flag to prevent problems with it being left set during a call to a CRT or API function that expects it to be clear.
this is odd also...
std
cld
220 cycles
cld
5 cycles
cld
cld
100 cycles
i will play with the instruction placement
this really sux, but....
the best solution seems to be
mov eax,400h
pushfd
or [esp],eax
popfd
all of that is faster than
std
i think the best solution is not using std :lol
direction_N=4
_std macro
direction_N=-4
endm
_cld macro
direction_N=4
endm
_lodsd macro
mov eax,[esi]
add esi,direction_N
endm
_stosd macro
mov [edi],eax
add edi,direction_N
endm
_lodsb macro
mov al,[esi]
add esi,direction_N/4
endm
_stosb macro
mov [edi],al
add edi,direction_N/4
endm
_lodsw macro
mov ax,[esi]
add esi,direction_N/2
endm
_stosw macro
mov [edi],ax
add edi,direction_N/2
endm
well - i tried that Drizz
i am glad to see someone confirm my grief, though - lol
you see how i am using it
i have found a very good soultion for this application
1) measure the integer length where using scasb to reduce length yields an advantage
2) skip it for shorter integers - just go ahead and evaluate the unused bytes :U
by branching around the std/repz scasb/cld - we speed it up for short values
using the loop method works well if there aren't many unused bytes
if there are a lot - scasb kicks butt
what i may do is sample the length where there is an advantage and repz scasb in the up direction - tricky, huh :U
Hi *.*:
If I remember correctly windows O S likes the direction flag up ie
forward it will react badly to a direction flag down.
Translation it will crash or may be hang or send message to
Redmond, Washington, USA to our friends at the Big M.
Always put direction flag up for Windows!
Regards: herge
yah - we got that Hegre
we want to set it down temporarily
it is just very slow
i am guessing that the OS traps that instruction for some reason
Hi dedndave:
This is a Windows driver site out of I believe is New Hampshire, USA.
I find it useful info on WinDBG from MicroSoft.
http://www.osronline.com
Regards: herge
About the only thing the intel docs say could be a problem is a partial flag register stall
well - i tried repositioning the instruction in several places - no help
well - i am wondering if this is a P4 only issue
Michael already assured us that it isn't a problem on P3's
what about the newer processors ? (duos quads etc)
Quote from: dedndave on September 24, 2009, 01:16:35 PM
this is odd also...
std
cld
220 cycles
cld
5 cycles
cld
cld
100 cycles
i will play with the instruction placement
Something does not make any sense.
Using 2 cld statements caused a 20 fold increase in cycles?
Andy
i know - lol
funky, huh
if you have a p4 processor - try it out
i am using xp mce2005 (pretty much the same as xp pro), sp2
i have a p4 prescott cpu
I don't think that the std instruction is the problem.
I have several 16 bit programs that run fast that use that instruction.
Andy
we are talking 32-bit code
apples and oranges
it has been confirmed by others - at least on a p4
Hi,
Have you booted to another OS, or is this only with one specific
OS? I.e. is this a processor or OS problem?
Regards,
Steve N.
no - i haven't Steve
i have too much crapolla on my drives at the moment, so it isn't practical for me to mess with that
i was hoping a few others might try it out in here
MichaelW says it is no problem for him - he is using a p3 under win2K, i think
i am only guessing that it is just one more "p4 handicap" to go with all the rest - lol
or - maybe the OS traps that instruction so it knows the direction has been changed
if that were the case, it shouldn' hiccup when you leave the flag set
who knows - i have a good work-around in mind, at least
Dave, these are Celeron M Win XP SP2 values:
13 cycles for std cld
6 cycles for cld
13 cycles for cld cld
thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?
Quote from: dedndave on September 26, 2009, 11:20:08 PM
thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?
The Celeron M "Yonah" is a Core but not Core Duo. Definitely later than P4.
Hi,
Can you post your full code that you wrote? I'll test here.
Core2 Duo E6700, Win XP Pro SP3 and Vista Ultimate SP2.
Best regards,
Robin.
Dave,
If you can pop a small test piece, I have a real single core PIV 3.8 running win2k and a core series quad running XP sp3 to test it on.
i attempted to make a simple timing program
the problem does not arise
the time i was getting was from the initialization section of my bignum to ascii routine
i measured the entire init code at ~245 cycles with std (i had commented out the repz scasb)
then, when the std was commented out, i measured about 30 cycles
thus, my conclusion that std was slow
this damn machine gives me such odd numbers
they jump around a lot too - very difficult for me to time things and learn optimization
so - now i have to go and figure out what other instructions, combined with std, are giving me trouble
as another example of my machine's inconsistancy.....
i have a multiple-precision multiply-by-constant-to-divide snippet
in it, there are 5 large constant values (3 actually - 2 of them are the same value loaded into register twice)
the last one wants to be loaded as an immediate value "mov edx,3906250"
but, with the others, i have placed the constant on the stack frame, and can load them via "mov edx,[ebp-20]" or similar
so - loading the other 4 constants as either immediates, or from the stack frame, yields wide and varied results
4 constants - 2 ways to load - 16 possible combinations
the snippet can take from ~40 to ~80 cycles, depending on how i load these variables
if i load them all as immediates, it is the 80 cycles
if i load them all from the stack, it is the 80 cycles
if i load 2 of them from the stack frame and 2 of them as immediates, i get the ~40
slightly better results are obtained if i load the two constants immediate one time and off the stack the other
other combinations aren't as good
i have also tried pushing them, as well as a few other methods of loading them
i isolated that one piece of code and selected the loads that yielded the best times
i also re-ordered several instructions several ways to try and get the best time
then - put the code back in the loop and got the worst time ever - lol
i feel like i have to be fricken Karnac the Magnificent to optimize code - lol
(http://www.delawareliberal.net/wp-content/uploads/2009/05/carnac.jpg)
Pee Wee Herman, Michael Jackson, and Tom Cruise.......
(name two fruits and a vegetable)
ok guys - i got a test that shows the issue on my machine...
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
...CLD
49 clock cycles
49 clock cycles
49 clock cycles
CLD...CLD
104 clock cycles
104 clock cycles
104 clock cycles
STD...CLD
239 clock cycles
238 clock cycles
238 clock cycles
program and source attached...
CPU 0: Fam 6 Mod 7 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1
...CLD
10 clock cycles
10 clock cycles
10 clock cycles
CLD...CLD
18 clock cycles
18 clock cycles
18 clock cycles
STD...CLD
18 clock cycles
18 clock cycles
18 clock cycles
It occurred to me that the problem could be due to an "errata" in your processor that the BIOS is correcting by applying a micro-code patch.
http://cseweb.ucsd.edu/~calder/papers/ICCD-06-HWPatch.pdf
you may be right, Michael - that would make sense
i didn't see a specific mention of STD in that paper - that doesn't mean it isn't one of the cases he is speaking of
i was thinking it may be an overhead problem in the out-of-order operation scheme
the processor is looking through the code-stream for something else to chew on
and - STD causes that mechanism to operate more slowly due to the number of possible affected instructions
i am not a big fan of out-of-order instructions - lol
it kind of takes some of the fun out of programming in assembler, even if does speed things up
EDIT - btw - that guy has several other interesting papers, as well - great site :U
EDIT again - if i run the STD in a timer loop alone - i get 13 cycles (CLD after the timer loop)
Hutch ? Astro ? Jochen ?
i thought you guys wanted to run this for me
i linked it again in case you missed that post
thanks
http://www.masm32.com/board/index.php?action=dlattach;topic=12368.0;id=6694
Dave,
I don't think this will help you, but I ran dftime on my machine (WIN XP SP2) with following "funny" results:
CPU 0: AMD Athlon(TM) XP 2000+ MMX+ SSE 3DNow!+ Cores: 1
...CLD
1 clock cycles
0 clock cycles
0 clock cycles
CLD...CLD
2 clock cycles
2 clock cycles
2 clock cycles
STD...CLD
0 clock cycles
0 clock cycles
0 clock cycles
Press any key to continue ...
rgds
cobold
actually, it does help - lol
i can see that it is fast in all three cases
Quote from: dedndave on September 28, 2009, 06:00:50 PM
Hutch ? Astro ? Jochen ?
i thought you guys wanted to run this for me
i linked it again in case you missed that post
thanks
http://www.masm32.com/board/index.php?action=dlattach;topic=12368.0;id=6694
Here it is :bg
CPU 0: Intel(R) Celeron(R) M CPU 420 @ 1.60GHz MMX SSE3 Cores: 1
...CLD
10 clock cycles
10 clock cycles
10 clock cycles
CLD...CLD
19 clock cycles
19 clock cycles
19 clock cycles
STD...CLD
19 clock cycles
19 clock cycles
19 clock cycles
CPU 0: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz MMX SSE4.1 Cores: 2
...CLD
1 clock cycles
1 clock cycles
1 clock cycles
CLD...CLD
4 clock cycles
4 clock cycles
4 clock cycles
STD...CLD
52 clock cycles
52 clock cycles
52 clock cycles
CPU 0: Intel(R) Pentium(R) M processor 1.70GHz MMX SSE2 Cores: 1
...CLD
10 clock cycles
10 clock cycles
10 clock cycles
CLD...CLD
19 clock cycles
19 clock cycles
19 clock cycles
STD...CLD
19 clock cycles
19 clock cycles
19 clock cycles
Press any key to continue ...
CPU 0: Fam 6 Mod 8 xFam 0 xMod 0 Type 0 Step 3 MMX SSE Cores: 1
...CLD
10 clock cycles
10 clock cycles
10 clock cycles
CLD...CLD
18 clock cycles
18 clock cycles
18 clock cycles
STD...CLD
18 clock cycles
18 clock cycles
18 clock cycles
Press any key to continue ...
HTH,
Steve N.
thanks all
i am surprised to see it act up on the core duo
Added a couple of tests, just for completeness.
CPU 0: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz MMX SSSE3 Cores: 4
...STD
15 clock cycles
15 clock cycles
15 clock cycles
...CLD
2 clock cycles
2 clock cycles
2 clock cycles
CLD...CLD
6 clock cycles
6 clock cycles
6 clock cycles
STD...CLD
53 clock cycles
52 clock cycles
52 clock cycles
STD...STD
28 clock cycles
28 clock cycles
28 clock cycles
thanks Sinsi
50 cycles is still longer than it should be - although not nearly as bad as my p4 prescott - lol
I ran that too.
CPU 0: Intel(R) Atom(TM) CPU N270 @ 1.60GHz MMX SSSE3 Cores: 2
...CLD
30 clock cycles
19 clock cycles
22 clock cycles
CLD...CLD
30 clock cycles
30 clock cycles
29 clock cycles
STD...CLD
89 clock cycles
87 clock cycles
90 clock cycles
My CPU usage is at 30% minimal, mostly around 45%. If that matters, lol.
What are STD and CLD for, anyway? Are they for hooking interrupts or something?
STD sets the direction flag and CLD clears the direction flag.
They are commonly used when searching.
Andy
Hi,
The direction flag controls how the string instructions are used.
The string instructions use DI (EDI) and SI (ESI) to access memory.
MOVSB ; Move String Byte.
Is equivalent to
MOV BYTE PTR [DI],[SI] ; Move (copy) the byte from where DS:SI
; points to where ES:DI is pointing.
; Of couse that kind of move is normally illegal,
; which is why the MOVS is useful.
INC DI ; If the direction flag is clear.
INC SI ; If the flag is set these would be DECrements.
HTH,
Steve N.
Oh I think I understand now. :bg
Sorry - chaos here as usual...
CPU 0: Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz MMX SSSE3 Cores: 2
...CLD
2 clock cycles
2 clock cycles
2 clock cycles
CLD...CLD
6 clock cycles
6 clock cycles
6 clock cycles
STD...CLD
52 clock cycles
52 clock cycles
52 clock cycles
STD...CLD pair is a bit variable - between 52 and 56 (e.g. 56, 52, 52).
Best regards,
Astro.
CPU 0: Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz MMX SSE4.1 Cores: 2
...CLD
1 clock cycles
1 clock cycles
1 clock cycles
CLD...CLD
4 clock cycles
4 clock cycles
4 clock cycles
STD...CLD
52 clock cycles
53 clock cycles
51 clock cycles
and
CPU 0: AMD Athlon(tm) Processor MMX+ 3DNow!+ Cores: 1
...CLD
1 clock cycles
1 clock cycles
1 clock cycles
CLD...CLD
2 clock cycles
2 clock cycles
2 clock cycles
STD...CLD
1 clock cycles
0 clock cycles
0 clock cycles
and
C:\mytest>DFtime.exe
CPU 0: Genuine Intel(R) CPU T2400 @ 1.83GHz MMX SSE3 Cores: 2
...CLD
10 clock cycles
10 clock cycles
10 clock cycles
CLD...CLD
20 clock cycles
20 clock cycles
20 clock cycles
STD...CLD
20 clock cycles
20 clock cycles
20 clock cycles
Press any key to continue ...
C:\mytest>
This is quite an old thread, but, I've to say it is the first time a testing
routine gives me the correct SSE version, so here I post nevertheless.
Compliment Dave, you got my correct CPU :U
CPU 0: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz MMX SSSE3 Cores: 2
...CLD
2 clock cycles
2 clock cycles
2 clock cycles
CLD...CLD
6 clock cycles
6 clock cycles
6 clock cycles
STD...CLD
52 clock cycles
52 clock cycles
52 clock cycles
Press any key to continue ...
By the way, what happened to this preliminary version?
Is it still preliminary or it is now a grown one?
Frank
if you use the forum search tool, you can find newer versions
however, i had to put the overall project on hold
a complete implementation requires that i learn KMD's - it's on my list :bg
CPU 0: Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz MMX SSSE3 Cores: 2
...CLD
2 clock cycles
2 clock cycles
2 clock cycles
CLD...CLD
6 clock cycles
6 clock cycles
6 clock cycles
STD...CLD
52 clock cycles
52 clock cycles
52 clock cycles
Press any key to continue ...
Things havent changed on AMD's:
CPU 0: AMD Phenom(tm) II X6 1055T Processor MMX+ SSE4a 3DNow!+ Cores: 6
...CLD
-1 clock cycles
0 clock cycles
-1 clock cycles
CLD...CLD
-1 clock cycles
-1 clock cycles
-1 clock cycles
STD...CLD
-1 clock cycles
-1 clock cycles
-1 clock cycles
Quote from: dedndave on September 28, 2009, 01:42:38 PM
ok guys - i got a test that shows the issue on my machine...
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
...CLD
49 clock cycles
49 clock cycles
49 clock cycles
CLD...CLD
104 clock cycles
104 clock cycles
104 clock cycles
STD...CLD
239 clock cycles
238 clock cycles
238 clock cycles
program and source attached...
Dave,
Here are my times on my P4:
CPU 0: Intel(R) Pentium(R) 4 CPU 3.20GHz MMX SSE2 Cores: 2
...CLD
45 clock cycles
45 clock cycles
45 clock cycles
CLD...CLD
93 clock cycles
93 clock cycles
93 clock cycles
STD...CLD
93 clock cycles
94 clock cycles
94 clock cycles
Press any key to continue ...
Dave
May i ask what u use to check how much cycles are spent?
Thanks and bye
i used MichaelW's timing macros
i found a solution that works pretty well
in this example, i wanted a cleared DF
if you want a set DF, you still need STD :tdown
pushfd
pop edx
test dh,40h
jz @F
cld
@@:
;do string operations here
test dh,40h
jz @F
std
@@:
the second test is only needed if you want to return the DF to it's original state
if you just want to leave it cleared, omit that part
the net effect of the above code is the same as pushf/popf, but can be faster
PUSHFD is ok, but POPFD, CLD, and STD are slow
this code, when modified to 16-bit, solves the issue mentioned in the 16-bit sub-forum thread, as well
http://www.masm32.com/board/index.php?topic=14699.msg119416#msg119416
i still don't really have a handle on that problem yet - lol
dedndave,
Do you have any words of wisdom about the following two timings, note they are both P4's? Any reason that the last "STD...CLD" should be so different? For your P4, each test doubled in time, but for my P4, the last two tests were the same?
Quote
Posted by: dedndave
ok guys - i got a test that shows the issue on my machine...
CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2
...CLD
49 clock cycles
49 clock cycles
49 clock cycles
CLD...CLD
104 clock cycles
104 clock cycles
104 clock cycles
STD...CLD
239 clock cycles
238 clock cycles
238 clock cycles
program and source attached...
Dave,
Quote
Posted by: KeepingRealBusy
Here are my times on my P4:
Code:
CPU 0: Intel(R) Pentium(R) 4 CPU 3.20GHz MMX SSE2 Cores: 2
...CLD
45 clock cycles
45 clock cycles
45 clock cycles
CLD...CLD
93 clock cycles
93 clock cycles
93 clock cycles
STD...CLD
93 clock cycles
94 clock cycles
94 clock cycles
Press any key to continue ...
Dave
Dave.
i don't know about any words of wisdom - lol
but - the last one is better - not worse
it is possible that the OS has something to do with it
i am not sure about the precise mechanics - maybe Clive or one of the other guys can shed some light
it may have something to do with the OS checking for priviledge level
some time ago, MichaelW and i were playing with the IRET instruction
we wanted to use it as a serializing instruction to replace CPUID in the timing macros
i was surprised by how long it takes
but, i guess it is a similar case
the OS needs to verify that the target address of the IRET is allowable
this appears to be the same kind of issue
altering some of the flags requires lower (logically higher) priviledge
so, when you change any flags directly, tests have to be made to insure it is allowed
this could be upgraded in newer CPU's, as they only test the priviledge-critical flags