News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

STD instruction

Started by dedndave, September 24, 2009, 12:36:21 PM

Previous topic - Next topic

dedndave

well - i tried repositioning the instruction in several places - no help

dedndave

well - i am wondering if this is a P4 only issue
Michael already assured us that it isn't a problem on P3's
what about the newer processors ? (duos quads etc)

Magnum

Quote from: dedndave on September 24, 2009, 01:16:35 PM
this is odd also...

std
cld
220 cycles

cld
5 cycles

cld
cld
100 cycles

i will play with the instruction placement

Something does not make any sense.
Using 2 cld statements caused a 20 fold increase in cycles?

Andy
Have a great day,
                         Andy

dedndave

i know - lol
funky, huh
if you have a p4 processor - try it out
i am using xp mce2005 (pretty much the same as xp pro), sp2
i have a p4 prescott cpu

Magnum

I don't think that the std instruction is the problem.

I have several 16 bit programs that run fast that use that instruction.

Andy
Have a great day,
                         Andy

dedndave

we are talking 32-bit code
apples and oranges
it has been confirmed by others - at least on a p4

FORTRANS

Hi,

   Have you booted to another OS, or is this only with one specific
OS?  I.e. is this a processor or OS problem?

Regards,

Steve N.

dedndave

no - i haven't Steve
i have too much crapolla on my drives at the moment, so it isn't practical for me to mess with that
i was hoping a few others might try it out in here
MichaelW says it is no problem for him - he is using a p3 under win2K, i think
i am only guessing that it is just one more "p4 handicap" to go with all the rest - lol
or - maybe the OS traps that instruction so it knows the direction has been changed
if that were the case, it shouldn' hiccup when you leave the flag set
who knows - i have a good work-around in mind, at least

jj2007

Dave, these are Celeron M Win XP SP2 values:

13      cycles for std cld
6       cycles for cld
13      cycles for cld cld

dedndave

thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?

jj2007

Quote from: dedndave on September 26, 2009, 11:20:08 PM
thanks Jochen
if i am not mistaken, a celeron is derived from a p4, no ?

The Celeron M "Yonah" is a Core but not Core Duo. Definitely later than P4.

Astro

Hi,

Can you post your full code that you wrote? I'll test here.

Core2 Duo E6700, Win XP Pro SP3 and Vista Ultimate SP2.

Best regards,
Robin.

hutch--

Dave,

If you can pop a small test piece, I have a real single core PIV 3.8 running win2k and a core series quad running XP sp3 to test it on.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i attempted to make a simple timing program
the problem does not arise
the time i was getting was from the initialization section of my bignum to ascii routine
i measured the entire init code at ~245 cycles with std (i had commented out the repz scasb)
then, when the std was commented out, i measured about 30 cycles
thus, my conclusion that std was slow
this damn machine gives me such odd numbers
they jump around a lot too - very difficult for me to time things and learn optimization
so - now i have to go and figure out what other instructions, combined with std, are giving me trouble

as another example of my machine's inconsistancy.....
i have a multiple-precision multiply-by-constant-to-divide snippet
in it, there are 5 large constant values (3 actually - 2 of them are the same value loaded into register twice)
the last one wants to be loaded as an immediate value "mov     edx,3906250"
but, with the others, i have placed the constant on the stack frame, and can load them via "mov     edx,[ebp-20]" or similar
so - loading the other 4 constants as either immediates, or from the stack frame, yields wide and varied results
4 constants - 2 ways to load - 16 possible combinations
the snippet can take from ~40 to ~80 cycles, depending on how i load these variables
if i load them all as immediates, it is the 80 cycles
if i load them all from the stack, it is the 80 cycles
if i load 2 of them from the stack frame and 2 of them as immediates, i get the ~40
slightly better results are obtained if i load the two constants immediate one time and off the stack the other
other combinations aren't as good
i have also tried pushing them, as well as a few other methods of loading them
i isolated that one piece of code and selected the loads that yielded the best times
i also re-ordered several instructions several ways to try and get the best time
then - put the code back in the loop and got the worst time ever - lol
i feel like i have to be fricken Karnac the Magnificent to optimize code - lol


Pee Wee Herman, Michael Jackson, and Tom Cruise.......

(name two fruits and a vegetable)

dedndave

ok guys - i got a test that shows the issue on my machine...

CPU 0: Intel(R) Pentium(R) 4 CPU 3.00GHz MMX SSE3 Cores: 2

...CLD
49      clock cycles
49      clock cycles
49      clock cycles

CLD...CLD
104     clock cycles
104     clock cycles
104     clock cycles

STD...CLD
239     clock cycles
238     clock cycles
238     clock cycles

program and source attached...