News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Regarding Stack

Started by theunknownguy, June 23, 2010, 06:55:46 PM

Previous topic - Next topic

theunknownguy

Just thinking in the idea of x64 call convention of instead of pushing, moving the args to the new register (If i understand it well...)
So what is faster? (ofc we dont have new regs on x32):

mov [esp], Inmmend
sub esp, 4


or just:


push Inmmend


I think push is faster, but what happen if i move args like this:


mov [esp], Inmmend
mov [esp-4], Inmmend2
mov [esp-8], Inmmend3
sub esp, 0Ch


Against:

push Inmmend
push Inmmend2
push Inmmend3


Not trying to use this like method for set arguments. Its just curiosity to see wich is faster.

Thanks.



redskull

The PUSH would probably be faster; new CPU's have dedicated hardware for doing stack manipulation since they do it so much, as well as special circuity to protect against stalls for instructions after stack instructions.  If you do it manually, you bypass all the optimizations.  Besides, breaking it down into the MOV and the SUB is essentially how the CPU executes it internaly anyway.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Quote from: redskull on June 23, 2010, 07:23:10 PM
The PUSH would probably be faster; new CPU's have dedicated hardware for doing stack manipulation since they do it so much, as well as special circuity to protect against stalls for instructions after stack instructions.  If you do it manually, you bypass all the optimizations.  Besides, breaking it down into the MOV and the SUB is essentially how the CPU executes it internaly anyway.

-r

Yes the MOV and SUB is how is done internally thats what i thought. But:

1 PUSH = 1 MOV + 1 SUB

100 PUSH = 100 MOV + 100 SUB

100 MANUAL MOV = 100 MOV + 1 SUB


I mean i will never use 100 push i think, but there could be some performance by avoiding the SUB after each push and just adding it at the final.

Does i make a sense?   :dazzled:

PS: If there are protection and other integrated fuction when you do a PUSH then why manual MOV should be slower, CPU should be just moving to memory address and bypass many of the checks (or probably not...)

dedndave

you might save some clock cycles if you were to load the stack with several values using REP MOVSD
        mov     esi,offset SomeData
        sub     esp,256
        mov     ecx,64
        mov     edi,esp
        rep     movsd

theunknownguy

Quote from: dedndave on June 23, 2010, 07:32:06 PM
you might save some clock cycles if you were to load the stack with several values using REP MOVSD
        mov     esi,offset SomeData
        sub     esp,256
        mov     ecx,64
        mov     edi,esp
        rep     movsd


That was my point. But somebody could do an speed test (i work all day u.u)...

Also it would be more faster if you have the need to PUSH many times the same value and just adding the SUB at the ending.

Instead of PUSH 0 100h times:

Mov Edi, StackPointer
Mov Ecx, 40h
Xor Eax, Eax
Rep StoSD
        Sub esp, 400h


But like i say who uses 100h times a push no matter the value...

dedndave

the speed advantage will depend largely on how many dwords you intend to load onto the stack this way
if you are only moving a few, it would be faster to PUSH
but, at some size, the advantage of REP MOVD will take over
this will also vary with different processors

then, at some larger size, there is a problem that may arise that you should look out for
the stack only has so many pages of memory commited to it
you may have to probe down the stack in order to activate more pages of memory
i think E^Cube made a macro for that someplace   :P
it should be worth it - if you are moving that much data onto the stack, it would seem that REP MOVSD would have an advantage

theunknownguy

Quote from: dedndave on June 23, 2010, 07:44:50 PM
the speed advantage will depend largely on how many dwords you intend to load onto the stack this way
if you are only moving a few, it would be faster to PUSH
but, at some size, the advantage of REP MOVD will take over
this will also vary with different processors

then, at some larger size, there is a problem that may arise that you should look out for
the stack only has so many pages of memory commited to it
you may have to probe down the stack in order to activate more pages of memory
i think E^Cube made a macro for that someplace   :P
it should be worth it - if you are moving that much data onto the stack, it would seem that REP MOVSD would have an advantage

Thanks, bad luck for me i only use at most 6 arguments per procedure...

But in 6 arguments i guess i will have to do an speed test. 1 PUSH will be faster but i will try against:

mov [esp+XX], Inmmend
mov [esp+XX], Inmmend2
etc...
Sub esp, XX


Ill do the test when i have time... We should have a procedure for get more time...

Thanks everybody for the answers.  :U

clive

Quote from: theunknownguy
Does i make a sense? 

Only if you assume that the operations are serialized, where as most of the reg-to-reg stuff occurs in parallel (or at the very least pipelined), and your ability to stuff data to memory is bounded by the depth of the write buffers, and the memory sitting behind them.

As another note, you should really be decrementing the stack point before writing data into the space you have allocated.

Want to time it, then add some RDTSC's, would take a second to test, working or not.
It could be a random act of randomness. Those happen a lot as well.

dedndave

for smaller data sizes, PUSH is your friend   :bg
if you only moving 6 dwords, i am pretty sure PUSH is faster
also - good practice to adjust ESP before moving data onto the stack, not after (Clive beat me - lol)

theunknownguy

Thanks dedndave and clive i didnt knew about adjust ESP before moving data whys that?.

Also in x64 call convention i can see ESP is fixed after the procedure.

I will time it when i end working, with the clocker macros (but i know the answer already)...

redskull

It's just not as simple as "100 MOVs + 100 SUBs"; the stack engine keeps track of the "potential" esp value during the decoding process, so you end up with the the same thing; 100 MOV's with an offset thats added during the calculation.  If we're over simplifying things, you get either 100 MOV'S using PUSH, or 100 MOV's and 1 SUB, plus any necessary synchronation ops it has to insert to keep the stack engine honest with the true value using your way.  It's way, way, way more complicated than you might think.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Quote from: redskull on June 23, 2010, 08:10:18 PM
It's just not as simple as "100 MOVs + 100 SUBs"; the stack engine keeps track of the "potential" esp value during the decoding process, so you end up with the the same thing; 100 MOV's with an offset thats added during the calculation.  If we're over simplifying things, you get either 100 MOV'S using PUSH, or 100 MOV's and 1 SUB, plus any necessary synchronation ops it has to insert to keep the stack engine honest with the true value using your way.  It's way, way, way more complicated than you might think.

-r

Thanks redskull you know why on x64 instead of using stack they use new regs?. Always thinked it was for some speed relation...

dedndave

Quote...adjust ESP before moving data whys that?

traditionally, the address space above the stack pointer (ESP) is "preserved" - the space below is not
now, there has been a lot of discussion in the forum about how this actually works under Win32 - lol
some will say it is ok to use space below ESP and some will say it is not
it's a good habit to only use stack space above ESP, no matter how windows works
that way, if you start programming for linux or some other OS, you will have the right habit   :bg

QuoteAlso in x64 call convention i can see ESP is fixed after the procedure

i am guessing that has more to do with stack alignment
in the 64-bit world, the stack should always be 64-aligned
some procedures may not leave it that way, so adjustments are made

:bg  you now know about as much as  i do about the stack - lol

qWord

Quote from: dedndave on June 23, 2010, 08:37:43 PMin the 64-bit world, the stack should always be 64-aligned
no, 16 Byte aligned (for API's) :P
FPU in a trice: SmplMath
It's that simple!

theunknownguy

Quote from: dedndave on June 23, 2010, 08:37:43 PM
Quote...adjust ESP before moving data whys that?

traditionally, the address space above the stack pointer (ESP) is "preserved" - the space below is not
now, there has been a lot of discussion in the forum about how this actually works under Win32 - lol
some will say it is ok to use space below ESP and some will say it is not
it's a good habit to only use stack space above ESP, no matter how windows works
that way, if you start programming for linux or some other OS, you will have the right habit   :bg

QuoteAlso in x64 call convention i can see ESP is fixed after the procedure

i am guessing that has more to do with stack alignment
in the 64-bit world, the stack should always be 64-aligned
some procedures may not leave it that way, so adjustments are made

:bg  you now know about as much as  i do about the stack - lol

Thanks dedndave great answer. I was trying to find some documents and papers to read about how stack work internally but you know nothing found by Mr Google...

Still i would love to have new regs on x32 like in x64...  :(

If you have any document or paper that explain internally how stack works, please dont doubt on post it. Thanks again.

PS: Only good document i found http://www.ece.cmu.edu/~koopman/stack_computers/sec1_2.html