Print Page - Regarding Stack

Title: Regarding Stack
Post by: theunknownguy on June 23, 2010, 06:55:46 PM

Just thinking in the idea of x64 call convention of instead of pushing, moving the args to the new register (If i understand it well...)
So what is faster? (ofc we dont have new regs on x32):

Code Select

mov [esp], Inmmend
sub esp, 4

or just:

Code Select


push Inmmend

I think push is faster, but what happen if i move args like this:

Code Select


mov [esp], Inmmend
mov [esp-4], Inmmend2
mov [esp-8], Inmmend3
sub esp, 0Ch

Against:

Code Select

push Inmmend
push Inmmend2
push Inmmend3

Not trying to use this like method for set arguments. Its just curiosity to see wich is faster.

Thanks.

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 23, 2010, 07:23:10 PM

The PUSH would probably be faster; new CPU's have dedicated hardware for doing stack manipulation since they do it so much, as well as special circuity to protect against stalls for instructions after stack instructions. If you do it manually, you bypass all the optimizations. Besides, breaking it down into the MOV and the SUB is essentially how the CPU executes it internaly anyway.

-r

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 07:27:38 PM

Quote from: redskull on June 23, 2010, 07:23:10 PM
The PUSH would probably be faster; new CPU's have dedicated hardware for doing stack manipulation since they do it so much, as well as special circuity to protect against stalls for instructions after stack instructions. If you do it manually, you bypass all the optimizations. Besides, breaking it down into the MOV and the SUB is essentially how the CPU executes it internaly anyway.

-r

Yes the MOV and SUB is how is done internally thats what i thought. But:

Code Select

1 PUSH = 1 MOV + 1 SUB

100 PUSH = 100 MOV + 100 SUB

100 MANUAL MOV = 100 MOV + 1 SUB

I mean i will never use 100 push i think, but there could be some performance by avoiding the SUB after each push and just adding it at the final.

Does i make a sense? :dazzled:

PS: If there are protection and other integrated fuction when you do a PUSH then why manual MOV should be slower, CPU should be just moving to memory address and bypass many of the checks (or probably not...)

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 07:32:06 PM

you might save some clock cycles if you were to load the stack with several values using REP MOVSD

Code Select

        mov     esi,offset SomeData
        sub     esp,256
        mov     ecx,64
        mov     edi,esp
        rep     movsd

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 07:35:46 PM

Quote from: dedndave on June 23, 2010, 07:32:06 PM
you might save some clock cycles if you were to load the stack with several values using REP MOVSD
Code Select Expand
mov esi,offset SomeData sub esp,256 mov ecx,64 mov edi,esp rep movsd

That was my point. But somebody could do an speed test (i work all day u.u)...

Also it would be more faster if you have the need to PUSH many times the same value and just adding the SUB at the ending.

Instead of PUSH 0 100h times:

Code Select

	Mov Edi, StackPointer
	Mov Ecx, 40h
	Xor Eax, Eax
	Rep StoSD
        Sub esp, 400h

But like i say who uses 100h times a push no matter the value...

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 07:44:50 PM

the speed advantage will depend largely on how many dwords you intend to load onto the stack this way
if you are only moving a few, it would be faster to PUSH
but, at some size, the advantage of REP MOVD will take over
this will also vary with different processors

then, at some larger size, there is a problem that may arise that you should look out for
the stack only has so many pages of memory commited to it
you may have to probe down the stack in order to activate more pages of memory
i think E^Cube made a macro for that someplace :P
it should be worth it - if you are moving that much data onto the stack, it would seem that REP MOVSD would have an advantage

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 07:48:21 PM

Quote from: dedndave on June 23, 2010, 07:44:50 PM
the speed advantage will depend largely on how many dwords you intend to load onto the stack this way
if you are only moving a few, it would be faster to PUSH
but, at some size, the advantage of REP MOVD will take over
this will also vary with different processors

then, at some larger size, there is a problem that may arise that you should look out for
the stack only has so many pages of memory commited to it
you may have to probe down the stack in order to activate more pages of memory
i think E^Cube made a macro for that someplace :P
it should be worth it - if you are moving that much data onto the stack, it would seem that REP MOVSD would have an advantage

Thanks, bad luck for me i only use at most 6 arguments per procedure...

But in 6 arguments i guess i will have to do an speed test. 1 PUSH will be faster but i will try against:

Code Select

mov [esp+XX], Inmmend
mov [esp+XX], Inmmend2
etc...
Sub esp, XX

Ill do the test when i have time... We should have a procedure for get more time...

Thanks everybody for the answers. :U

Title: Re: PUSH vs Manual Stack
Post by: clive on June 23, 2010, 07:56:39 PM

Quote from: theunknownguy
Does i make a sense?

Only if you assume that the operations are serialized, where as most of the reg-to-reg stuff occurs in parallel (or at the very least pipelined), and your ability to stuff data to memory is bounded by the depth of the write buffers, and the memory sitting behind them.

As another note, you should really be decrementing the stack point before writing data into the space you have allocated.

Want to time it, then add some RDTSC's, would take a second to test, working or not.

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 07:57:54 PM

for smaller data sizes, PUSH is your friend :bg
if you only moving 6 dwords, i am pretty sure PUSH is faster
also - good practice to adjust ESP before moving data onto the stack, not after (Clive beat me - lol)

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 08:03:28 PM

Thanks dedndave and clive i didnt knew about adjust ESP before moving data whys that?.

Also in x64 call convention i can see ESP is fixed after the procedure.

I will time it when i end working, with the clocker macros (but i know the answer already)...

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 23, 2010, 08:10:18 PM

It's just not as simple as "100 MOVs + 100 SUBs"; the stack engine keeps track of the "potential" esp value during the decoding process, so you end up with the the same thing; 100 MOV's with an offset thats added during the calculation. If we're over simplifying things, you get either 100 MOV'S using PUSH, or 100 MOV's and 1 SUB, plus any necessary synchronation ops it has to insert to keep the stack engine honest with the true value using your way. It's way, way, way more complicated than you might think.

-r

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 08:14:21 PM

Quote from: redskull on June 23, 2010, 08:10:18 PM
It's just not as simple as "100 MOVs + 100 SUBs"; the stack engine keeps track of the "potential" esp value during the decoding process, so you end up with the the same thing; 100 MOV's with an offset thats added during the calculation. If we're over simplifying things, you get either 100 MOV'S using PUSH, or 100 MOV's and 1 SUB, plus any necessary synchronation ops it has to insert to keep the stack engine honest with the true value using your way. It's way, way, way more complicated than you might think.

-r

Thanks redskull you know why on x64 instead of using stack they use new regs?. Always thinked it was for some speed relation...

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 08:37:43 PM

Quote...adjust ESP before moving data whys that?

traditionally, the address space above the stack pointer (ESP) is "preserved" - the space below is not
now, there has been a lot of discussion in the forum about how this actually works under Win32 - lol
some will say it is ok to use space below ESP and some will say it is not
it's a good habit to only use stack space above ESP, no matter how windows works
that way, if you start programming for linux or some other OS, you will have the right habit :bg

QuoteAlso in x64 call convention i can see ESP is fixed after the procedure

i am guessing that has more to do with stack alignment
in the 64-bit world, the stack should always be 64-aligned
some procedures may not leave it that way, so adjustments are made

:bg you now know about as much as i do about the stack - lol

Title: Re: PUSH vs Manual Stack
Post by: qWord on June 23, 2010, 08:41:18 PM

Quote from: dedndave on June 23, 2010, 08:37:43 PMin the 64-bit world, the stack should always be 64-aligned

no, 16 Byte aligned (for API's) :P

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 08:44:11 PM

Quote from: dedndave on June 23, 2010, 08:37:43 PM
Quote...adjust ESP before moving data whys that?

traditionally, the address space above the stack pointer (ESP) is "preserved" - the space below is not
now, there has been a lot of discussion in the forum about how this actually works under Win32 - lol
some will say it is ok to use space below ESP and some will say it is not
it's a good habit to only use stack space above ESP, no matter how windows works
that way, if you start programming for linux or some other OS, you will have the right habit :bg

QuoteAlso in x64 call convention i can see ESP is fixed after the procedure

i am guessing that has more to do with stack alignment
in the 64-bit world, the stack should always be 64-aligned
some procedures may not leave it that way, so adjustments are made

:bg you now know about as much as i do about the stack - lol

Thanks dedndave great answer. I was trying to find some documents and papers to read about how stack work internally but you know nothing found by Mr Google...

Still i would love to have new regs on x32 like in x64... :(

If you have any document or paper that explain internally how stack works, please dont doubt on post it. Thanks again.

PS: Only good document i found http://www.ece.cmu.edu/~koopman/stack_computers/sec1_2.html

Title: Re: PUSH vs Manual Stack
Post by: qWord on June 23, 2010, 08:46:23 PM

you know this one:
x64 Software Conventions (http://msdn.microsoft.com/en-us/library/7kcdt6fy(v=VS.80).aspx)

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 23, 2010, 08:51:34 PM

Quote from: theunknownguy on June 23, 2010, 08:44:11 PM
If you have any document or paper that explain internally how stack works, please dont doubt on post it.

"Internally" there is no stack; the stack is just an area in memory, the same as any other. All the CPU does is automatically adjust the stack pointer as a convience to you. All the same circuity is used, whether you MOV to memory or PUSH to it. If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

-r

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 08:57:33 PM

Quoteyou know this one:
x64 Software Conventions

Ye read it many times xD

Code Select

"Internally" there is no stack; the stack is just an area in memory, the same as any other.  All the CPU does is automatically adjust the stack pointer as a convience to you.  All the same circuity is used, whether you MOV to memory or PUSH to it.  If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

Yes i was meaning in details like how the operations are done in hardware and many other in depth stuff.

But found this one, i think it explained everything very good:

http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html

I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Title: Re: PUSH vs Manual Stack
Post by: jj2007 on June 23, 2010, 08:58:00 PM

It seems not to matter a lot:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1049    cycles for mov
1003    cycles for push

Code Select

.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm			; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]
LOOP_COUNT	= 100000		; 1000000 would be a typical value

.data
Src	db "This is a string, 100 characters long, that serves for a variety of purposes, such as testing algos.", 0

.data?
Dest	db 100 dup(?)

.code
start:
	push 1
	call ShowCpu	; print brand string and SSE level
	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		sub esp, 40		; ten dwords
		mov dword ptr [esp+0], eax
		mov dword ptr [esp+4], ebx
		mov dword ptr [esp+8], ecx
		mov dword ptr [esp+12], edx
		mov dword ptr [esp+16], edi
		mov dword ptr [esp+20], esi
		mov dword ptr [esp+24], ebp
		mov dword ptr [esp+28], eax
		mov dword ptr [esp+32], ebx
		mov dword ptr [esp+36], ecx
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for mov reg", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		sub esp, 40		; ten dwords
		mov dword ptr [esp+0], 100
		mov dword ptr [esp+4], 100
		mov dword ptr [esp+8], 100
		mov dword ptr [esp+12], 100
		mov dword ptr [esp+16], 100
		mov dword ptr [esp+20], 100
		mov dword ptr [esp+24], 100
		mov dword ptr [esp+28], 100
		mov dword ptr [esp+32], 100
		mov dword ptr [esp+36], 100
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for mov 100", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		REPEAT 2
			push eax
			push ecx
			push edx
			push edi
			push esi
		ENDM
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for push reg", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		REPEAT 10
			push 100
		ENDM
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for push 100", 13, 10

	inkey chr$(13, 10, "--- ok ---", 13)
	exit

ShowCpu proc	; mode:DWORD
COMMENT @ Usage: 
  push 0, call ShowCpu	; simple, no printing, just returns SSE level
  push 1, call ShowCpu	; prints the brand string and returns SSE level@
  pushad
  sub esp, 80	; create a buffer for the brand string
  mov edi, esp		; point edi to it
  xor ebp, ebp
  .Repeat
  	lea eax, [ebp+80000002h]
	db 0Fh, 0A2h	; cpuid 80000002h-80000004h
	stosd
	mov eax, ebx
	stosd
	mov eax, ecx
	stosd
	mov eax, edx
	stosd
	inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h		; cpuid 1
  xor ebx, ebx		; CpuSSE
  xor esi, esi		; add zero plus the carry flag
  bt edx, 25		; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26		; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi		; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9			; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80]	; dec mode in stack
  .if Zero?
	mov edi, esp	; restore pointer to brand string
  	.Repeat				
		.Break .if byte ptr [edi]!=32	; mode was 1, so show a string but skip leading blanks
		inc edi
	.Until 0
	.if byte ptr [edi]<32
		print chr$("pre-P4")
	.else
		print edi	; CpuBrand
	.endif
	.if ebx
		print chr$(32, 40, "SSE")	; info on SSE level, 40=(
		print str$(ebx), 41, 13, 10	; 41=)
	.endif
  .endif
  add esp, 80		; discard brand buffer (after printing!)
  mov [esp+32-4], ebx	; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
	call MbBufferInit  endif
  popad
  ret 4
ShowCpu endp

end start

Title: Re: PUSH vs Manual Stack
Post by: clive on June 23, 2010, 09:13:01 PM

Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 09:18:27 PM

Quote16 Byte aligned (for API's)

oops - qWord got me on that one - i dunno what i was thinking - lol

as for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 09:21:26 PM

Quote from: clive on June 23, 2010, 09:13:01 PM
Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.

This same explanation goes for CALL opcode too?

Code Select

PUSH RetnOff
PUSH Procedure
Ret
RetnOff:

Instead of just CALL Procedure

Quoteas for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

Lol got me on that havent thinked on it... :cheekygreen:

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 09:24:05 PM

if pushing the return address and branching like that were more efficient, we'd have macros to do it for us :P
............ and we'd all be using them, too

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 09:27:49 PM

Quote from: dedndave on June 23, 2010, 09:24:05 PM
if pushing the return address and branching like that were more efficient, we'd have macros to do it for us :P
............ and we'd all be using them, too

Ye i knew CALL was faster, but wanted to know if clive explanation fit the CALL opcode since at my point of view you can emulate it with PUSH and RET.

You know i cant find many docs where things like how CALL or PUSH opcode work in depth (at hardware level) so i am just killing you guys with the questions... sorry.

Quoteit looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this

Code Select Expand
Code: PUSH RetnOff JMP Procedure RetnOff:

Got me again... :lol :lol or could be any other conditional jump for avoid the JMP if some flag where mod before...

Title: Re: PUSH vs Manual Stack
Post by: dedndave on June 23, 2010, 09:32:44 PM

Quote
Code Select Expand
PUSH RetnOff PUSH Procedure Ret RetnOff:

it looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this

Code Select

        PUSH    RetnOff
        JMP     Procedure
RetnOff:

Title: Re: PUSH vs Manual Stack
Post by: clive on June 23, 2010, 09:39:49 PM

Quote from: theunknownguy
This same explanation goes for CALL opcode too?

Code Select Expand
PUSH RetnOff PUSH Procedure Ret RetnOff:

Instead of just CALL Procedure

Well you have to be careful there, as redskull has hinted, there are architecture issues with that. The CALL/RET are easier for the branch prediction to follow. Whenever you cause a mis-predict you end up eating some 20-30 cycles, depending on the CPU, as it refills the execution pipeline. It is often quite easy to do.

There are valid reasons to use that construction, especially with segmented memory, or protected mode, or situations where the assembler/linker/loader can't handle dynamic run time behaviour.

How about? oh crap dave's in my head

Code Select

PUSH RetnOff
JMP Procedure
RetnOff:

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 09:45:38 PM

Just non topic queston:

How old are you dave and clive? :eek

And thanks for clarify my questions

Title: Re: PUSH vs Manual Stack
Post by: clive on June 23, 2010, 09:48:39 PM

Here's some context switching code I wrote yesterday running some different FLAT memory code in some virtual space from another FLAT memory host. Fun and joy with segments in FLAT land, and using NEAR/FAR calls and segment overrides.

Code Select

SysExec PROC near c public SelCode:DWORD, SelData:DWORD
        push    ebx
        push    esi
        push    edi

        push    ds
        push    es

        mov     ecx,SelCode
        mov     edx,SelData

        mov     ds,edx          ; DS = Data Segment
        mov     es,edx          ; ES = Data Segment

        mov     eax,ss
        mov     ebx,esp

        mov     dword ptr ds:[0200h],ebx ; Original ESP
        mov     dword ptr ds:[0204h],eax ; Original SS

        mov     eax,010000h     ; EIP
        mov     ss,edx          ; SS = Data Segment
        lea     esp,[eax - 4]   ; ESP within GHS arena

        push    ecx     ; Segment
        push    eax     ; Offset

        retf    ; Jump to Segment:Offset, setting CS:EIP

; Doesn't get here

SysExec ENDP

Title: Re: PUSH vs Manual Stack
Post by: qWord on June 23, 2010, 10:07:56 PM

just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:

Code Select

push: 510
mov: 489
push const: 505
mov const: 495
Press any key to continue ...

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 23, 2010, 10:09:53 PM

Quote from: qWord on June 23, 2010, 10:07:56 PM
just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:
Code Select Expand
push: 510 mov: 489 push const: 505 mov const: 495 Press any key to continue ...

:eek :eek :eek :eek Danm... i want those regs on x32 :(

Cant switch yet to x64 need to finish my work on x32 and later pass to x64 but god i will love to avoid the PUSH for security reasons...

Title: Re: PUSH vs Manual Stack
Post by: Rockoon on June 23, 2010, 10:25:19 PM

Phenom II x6 1055T @3.36ghz:

push: 236
mov: 236
push const: 236
mov const: 236
Press any key to continue ...

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 23, 2010, 11:39:50 PM

Here is some info i wrote up on the stack engines; if anyone knows different, please point them out.

First, some preliminaries:

Modern Intel chips (anything after PII, and Itanium), use a radically different approach than earlier era chips; in short, while still a "CISC" chip on the outside, they operate as "RISC" chips under the hood. They do this by breaking up instructions into "micro ops", or just "uops" (where 'u' is supposed to be the Greek letter 'mu', the metric prefix for micro).
One "complicated" instruction is broken down (by the decoding stage) into several, simpler uops. For example:

CALL MyFunction

can be thought of (conceptually) as:

PUSH EIP
JMP MyFunction

which can be further broken down into

SUB ESP,4
MOV [ESP],EIP
MOV EIP, OFFSET MyFunction

and so on. All these uops from the instruction stream are then directed to the appropriate part of the chip, called an *Execution Unit*, or EU. Normally, Intel chips have around 4 or 5 different EU's, which each handle a different type of instruction (depending on your particular chip, most have more than one ALU):

1) Arithmetic Logic Unit (ALU)
2) Memory Reads
3) Address calculation
4) Memory Write

So, for example, the three hypothetical uops from above would be sent to the ALU (for the SUB), the Memory Writer (for the MOV), and again to the ALU (for the second reg-reg MOV). Each EU can operate independently of the other, so different uops from different instructions can execute "out of order"; this allows the CPU to work on other parts of other instructions, without having to wait while slow, unrelated ones finish.
There is much more to be said about this, and this barely scratches the surface. It's an extremely complex system which determines what uops can be executed, which ones have to wait for others to finish, and when an entire instruction is complete. The trick to *real* optimization is making sure that all EU's are filled with uops, all the time.
Anyway, onto the stack engine itself:

The "Stack Engine"

Older CPU's basically work like above: manipulating the stack pointer (ESP) with ALU uops, which perform the adding or subtracting. Newer CPU's have what's called a STACK ENGINE, which is special circuitry dedicating only to adjusting the stack.
The stack engine lives as part of the decoder (which generates uops). It exists to optimize just four different instructions: PUSH, POP, CALL, and RET (but *not* RET n). These have the unique property that they all adjust ESP by *exactly* 4 bytes, every time, no matter what.
It does this by keeping track of the "stack delta", which is just the relative difference between the stack pointer "now" and the stack pointer "later". Each time it detects one of these four special instructions, it alters it's delta number up or down by four as needed.
The magic happens, though, when it comes time to generate the uops; instead of generating one for the stack pointer math and one for the move, it generates just the one for the write, *but inserts the delta number into the address*. Because all memory writes must go to through the address calculation, there is no performance loss, and an entire ALU uop is avoided.
For example, consider the above example, where our 'CALL' was turned into three (purely hypothetical) uops:

SUB ESP,4
MOV [ESP],EIP
MOV EIP, OFFSET MyFunction

We'll assume the current delta in the stack engine is 0. It notices that this is a "CALL", and adjusts its stack delta to -4. Then it removes the "SUB" uop entirely, and adds it's current delta into the MOV instruction:

MOV [ESP-4],EIP ; Current Delta value inserted here
MOV EIP, OFFSET MyFunction

This cuts down an entire uop! Considering that most programming is made of CALLs, and most functions have PUSHed arguments, it's a non-trivial speedup. The stack engine is fast enough to keep pace with the decoder as well, so there is no bottleneck in doing this conversion.
To extend the example, imagine 3 PUSHes and 1 CALL; The first PUSH would use the delta of -4, the next of -8, then -12, then -16. RET and POP work opposite; they increase the delta, and add that value to the memory access uop.
The first problem, however, is that now the "real" value of ESP (inside the CPU) is no longer correct. We never actually modified it, so if another instruction wants to use it, it would be out of sync. Continuing with the above, imagine that after our CALL, or function sets up a stack frame

MOV EBP,ESP

This is troublesome, because ESP was never changed! When the stack engine detects another instruction (not PUSH/POP/CALL/RET) is using the stack pointer in some way, and the delta is non-zero, it must "synchronize" the two. It does this by merely inserting a "synchronization uop", or just synch op. It does nothing but add the value of the delta to ESP.

ADD ESP,STACK_DELTA ; this uop corrects ESP (and the engine zeros the delta)
MOV EBP,ESP ; now ESP is correct, and can be used safely in this uop

The second problem, however, is that the stack engine delta is only an 8-bit signed interger, which rolls over at +/- 128 (that's 32 consecutive stack operations in one direction). To avoid this, when the stack delta gets high enough, the engine inserts the same style of synch op to reset the delta back to zero. So, in the rare case that you do 100 PUSHes in a row, you will get 3 extra synchronization uops inserted into the stream to prevent this rollover.

So, to sum up, if you use "MOV [ESP-n], reg", followed by a "SUB ESP,m", you merely do what the CPU is programmed to do automatically when you use PUSH. The only real difference is that you explicitly set the value of ESP at the end, which the CPU will delay from doing until necessary (when the first direct access to ESP occurs, via a synch uop). However, you will almost undoubtedly suffer another synch op at the *start* of the MOVs, because most likely the delta is not zero at that point. In the rare case that you have over 32 arguments to PUSH, you would actually save one or two synch ops, which would be needed to prevent rollover. However, in either case, the difference is only a couple uops either way, and certainly not anything other than negligible.

Hope this helps clear things up

-r

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 24, 2010, 12:13:42 AM

Just amazing info redskull, thats what i was trying to understand by "in depth".

Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

QuoteThese have the unique property that they all adjust ESP by *exactly* 4 bytes, every time, no matter what.

What happen with the PUSH WORD?

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 24, 2010, 01:02:52 AM

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

All that is fron Agner Fogs simply amazing microarchecture manuals. It's pretty dense, but well worth the read, and totally free.

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
What happen with the PUSH WORD?

Your Program crashes :8)

But really, I have no idea. I would presume it would just do it the "normal" way, by adjusting ESP via an normal ALU uop.

-r

Title: Re: PUSH vs Manual Stack
Post by: jMerliN on June 24, 2010, 01:05:57 AM

To clarify where this question came from and just why the OP asked anything about this here, I'll bring up the argument he made (and lost, clearly misunderstanding anything about what I said) and how he contorted the argument into something about push vs a high speed string instruction to move a large amount of data onto the stack, and for some reason or another he felt justified in talking about x64 calling conventions, when we're really only talking about a local stack frame.

On RZ, he comes asking a question of "high level minds," (for full disclosure, view the thread here: http://forum.ragezone.com/f144/testing-high-level-minds-669035/) wanting to know what we think certain things are. He gets the English horribly wrong, and then proceeds to bash any answer anyone gives as if he's some all knowing superior intellect (he hangs around the MASM32 forums, but this doesn't necessitate any level of knowledge of how things work, he's probably never even had a 10K project before).

So in arguing with another user on that forum, he posts this code, claiming this was an "implementation" of someone's abstract definition of a stack (which was for all intents and purposes, correct):

Code Select


push ebp
mov ebp, esp
mov eax, NumberOfVariables  ;Number of variables you have
imul eax, eax, DWORD        ;Initialise each variable to 4 bytes
add esp, eax
mov [esp], 1                ;Lets put a variable with value 1
pop edx                     ;Restore it on EDX
mov [esp+4], 1              ;Next variable with value 1
pop eax                     ;Restore on EAX
add eax, edx                ;1+1 is so simple has knowing you are a noob
push eax                    ;Push it for save into stack

I reply to this, later in the thread while flaming the kid for what I can only interpret as utter stupidity, as in his example he USES a stack to set up a common stack frame with a dynamic number of local variables and then proceeds to do this stupid arithmetic.

Do note though, the errors present if we assume the traditional call model, he's adding to esp for local vars, not subbing, so he'd overwrite some value in the local space of the calling function (wut?). Further, when he "sets the variables" to value 1, pop is adding 4 intrinsically to esp each call, so [esp+4] in the second variable will be the original esp + 8, so he's skipping 4 bytes there. But loosely trying to interpret what he meant, we'll continue (that is, from his comments).

My response was if this was the result of a compiler producing ASM from C code ("high level" right, hence it's clearly vastly inferior to anything he can write), it would look something like.. this:

Code Select


int myfunc(){
  int one = 1;
  int one2 = 1;
  one2 = one2 + one;
  // ...
}

My argument was that any compiler worth its salt with a decent optimizer would produce assembly for this code, with only what we know about it, as "push 2" (in accordance with his last comment, "save into stack", and because the frame he sets up is unnecessary to the end goal, which is computing 1+1 statically). I claimed that this single instruction was more efficient and produced the same exact effect as the code he posted (assuming it does what he meant), and clearly, not to insult anyone's intelligence, it is.

His response to this was some nonsensical bullshit I can't comprehend:

Quote
Thats kind of sad really, so the mov [esp], XX its not faster than a push 2?...

Curious since push inside CPU would do:

Code Select Expand
mov [esp], XX add esp, 4
A little more lecture kid...

You see he completely misunderstood what I said. So this thread was to try to prove his precious "mov [esp], xx" is faster than "push 2". He further demonstrates what intrinsically happens with a push (as explained in a good bit of detail above), which I respond is faster because of optimizations made in hardware (which he apparently did not believe, so I had assumed he doesn't read much). The argument was never about whether pushing a large amount of data on the stack is less efficient than using a high speed string instruction to fill in a large chunk of the stack followed by adjusting esp, that was borne of his own mind and ignorance.

This very thread culminated in the flame posted here: http://forum.ragezone.com/f144/x32-64-push-vs-mov-test-671634/ .

I know you're not the minions of a 15 year old kid who barely knows English, and you don't intend to empower an astonishing level of ignorance on the internet, but I just thought I'd point out to you where this came from, and that this kid is just another leech.

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 24, 2010, 01:08:11 AM

Quote from: redskull on June 24, 2010, 01:02:52 AM
Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

All that is fron Agner Fogs simply amazing microarchecture manuals. It's pretty dense, but well worth the read, and totally free.

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
What happen with the PUSH WORD?

Your Program crashes :8)

But really, I have no idea. I would presume it would just do it the "normal" way, by adjusting ESP via an normal ALU uop.

-r

push 3131
pop word []
push 3131
pop word []

I use this has a method for generate an obfuscated NULL in stack and no crash, at first mess stack viewer under debuggers...

I will read more agner fog manuals =D

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 24, 2010, 01:12:50 AM

Quote from: jMerliN on June 24, 2010, 01:05:57 AM
To clarify where this question came from and just why the OP asked anything about this here, I'll bring up the argument he made (and lost, clearly misunderstanding anything about what I said) and how he contorted the argument into something about push vs a high speed string instruction to move a large amount of data onto the stack, and for some reason or another he felt justified in talking about x64 calling conventions, when we're really only talking about a local stack frame.

On RZ, he comes asking a question of "high level minds," (for full disclosure, view the thread here: http://forum.ragezone.com/f144/testing-high-level-minds-669035/) wanting to know what we think certain things are. He gets the English horribly wrong, and then proceeds to bash any answer anyone gives as if he's some all knowing superior intellect (he hangs around the MASM32 forums, but this doesn't necessitate any level of knowledge of how things work, he's probably never even had a 10K project before).

So in arguing with another user on that forum, he posts this code, claiming this was an "implementation" of someone's abstract definition of a stack (which was for all intents and purposes, correct):

Code Select Expand
push ebp mov ebp, esp mov eax, NumberOfVariables ;Number of variables you have imul eax, eax, DWORD ;Initialise each variable to 4 bytes add esp, eax mov [esp], 1 ;Lets put a variable with value 1 pop edx ;Restore it on EDX mov [esp+4], 1 ;Next variable with value 1 pop eax ;Restore on EAX add eax, edx ;1+1 is so simple has knowing you are a noob push eax ;Push it for save into stack

I reply to this, later in the thread while flaming the kid for what I can only interpret as utter stupidity, as in his example he USES a stack to set up a common stack frame with a dynamic number of local variables and then proceeds to do this stupid arithmetic.

Do note though, the errors present if we assume the traditional call model, he's adding to esp for local vars, not subbing, so he'd overwrite some value in the local space of the calling function (wut?). Further, when he "sets the variables" to value 1, pop is adding 4 intrinsically to esp each call, so [esp+4] in the second variable will be the original esp + 8, so he's skipping 4 bytes there. But loosely trying to interpret what he meant, we'll continue (that is, from his comments).

My response was if this was the result of a compiler producing ASM from C code ("high level" right, hence it's clearly vastly inferior to anything he can write), it would look something like.. this:

Code Select Expand
int myfunc(){ int one = 1; int one2 = 1; one2 = one2 + one; // ... }

My argument was that any compiler worth its salt with a decent optimizer would produce assembly for this code, with only what we know about it, as "push 2" (in accordance with his last comment, "save into stack", and because the frame he sets up is unnecessary to the end goal, which is computing 1+1 statically). I claimed that this single instruction was more efficient and produced the same exact effect as the code he posted (assuming it does what he meant), and clearly, not to insult anyone's intelligence, it is.

His response to this was some nonsensical bullshit I can't comprehend:

Quote
Thats kind of sad really, so the mov [esp], XX its not faster than a push 2?...

Curious since push inside CPU would do:

Code Select Expand
mov [esp], XX add esp, 4
A little more lecture kid...

You see he completely misunderstood what I said. So this thread was to try to prove his precious "mov [esp], xx" is faster than "push 2". He further demonstrates what intrinsically happens with a push (as explained in a good bit of detail above), which I respond is faster because of optimizations made in hardware (which he apparently did not believe, so I had assumed he doesn't read much). The argument was never about whether pushing a large amount of data on the stack is less efficient than using a high speed string instruction to fill in a large chunk of the stack followed by adjusting esp, that was borne of his own mind and ignorance.

This very thread culminated in the flame posted here: http://forum.ragezone.com/f144/x32-64-push-vs-mov-test-671634/ .

I know you're not the minions of a 15 year old kid who barely knows English, and you don't intend to empower an astonishing level of ignorance on the internet, but I just thought I'd point out to you where this came from, and that this kid is just another leech.

Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless

Title: Re: PUSH vs Manual Stack
Post by: redskull on June 24, 2010, 01:19:51 AM

Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions. Also, as a courtesy, please keep your other-fourm flame wars in the other forums. Do like the rest of us and start your own here with our own members :toothy

-r

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 24, 2010, 01:23:44 AM

Quote from: redskull on June 24, 2010, 01:19:51 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions. Also, as a courtesy, please keep your other-fourm flame wars in the other forums. Do like the rest of us and start your own here with our own members :toothy

-r

I run a test for fill up large space under stack, tough you right the "REP" is one of the slowest sh*ts, but works fairly enough.

Also what about C++ compilers? Is fair to think that using direct stack MOV could avoid the frame stack creation isn't? (and save some clocks)

And i don't involve the forum in other forum fights, if you see the thread well i was making a survey i need to do for present in another forum dedicated to understatement of high level logic.

PS: I cant do my survey here, i mean this is a low level forum coding... Though i could do a research of low level logic abstraction with a few questions but damn i would be the one questioned... (on this forum) :toothy

Title: Re: PUSH vs Manual Stack
Post by: jMerliN on June 24, 2010, 01:29:28 AM

Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless

The discussion wasn't about the use of registers to pass data to function calls at all. You still don't understand, I see.

You have a horrible grasp of optimizations that can be made from a syntax slightly higher than one of a macro assembly. You should read http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886 , I suggest this only because of the price at which you can get a used one (practically free), the newer edition is quite expensive still, but may cover more recent optimization techniques. You don't seem to understand what exactly a compiler does, and how it can easily reduce code like that to highly optimized assembly with just as much (if not more) knowledge than you have of hand optimizing assembly.

Let me give you a demonstration, as this is not a "high level" vs "low level" fight, the point was just that even a "high level" language (which to you means completely inferior in every way) such as C could produce output that would do the same thing with far less work than your assembly was doing.

Code Select


#include <stdio.h>

int main(int argc, char** argv){
  int a = 1;
  int b = 1;
  b = a + b;
  printf("a+b = %d",b);
  return 0;
}

Compiling this then opening it with OllyDbg yields:

Code Select


01221001   6A 02            PUSH 2
01221003   68 F4202201      PUSH OFFSET test.??_C@_08CLODKBON@a?$CLb>; ASCII "a+b = %d"
01221008   FF15 A0202201    CALL DWORD PTR DS:[<&MSVCR100.printf>]   ; MSVCR100.printf
0122100E   83C4 08          ADD ESP,8
01221011   33C0             XOR EAX,EAX
01221013   C3               RETN

As you can see, my point has been proved. Thanks much for the disbelief.

Quote from: redskull on June 24, 2010, 01:19:51 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions. Also, as a courtesy, please keep your other-fourm flame wars in the other forums. Do like the rest of us and start your own here with our own members :toothy

-r

I will gladly, if you'll keep your members from starting flamewars on other forums then coming back here for arguments.

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 24, 2010, 01:37:38 AM

Quote from: jMerliN on June 24, 2010, 01:29:28 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless

The discussion wasn't about the use of registers to pass data to function calls at all. You still don't understand, I see.

You have a horrible grasp of optimizations that can be made from a syntax slightly higher than one of a macro assembly. You should read http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886 , I suggest this only because of the price at which you can get a used one (practically free), the newer edition is quite expensive still, but may cover more recent optimization techniques. You don't seem to understand what exactly a compiler does, and how it can easily reduce code like that to highly optimized assembly with just as much (if not more) knowledge than you have of hand optimizing assembly.

Let me give you a demonstration, as this is not a "high level" vs "low level" fight, the point was just that even a "high level" language (which to you means completely inferior in every way) such as C could produce output that would do the same thing with far less work than your assembly was doing.

Code Select Expand
#include <stdio.h> int main(int argc, char** argv){ int a = 1; int b = 1; b = a + b; printf("a+b = %d",b); return 0; }

Compiling this then opening it with OllyDbg yields:

Code Select Expand
01221001 6A 02 PUSH 2 01221003 68 F4202201 PUSH OFFSET test.??_C@_08CLODKBON@a?$CLb>; ASCII "a+b = %d" 01221008 FF15 A0202201 CALL DWORD PTR DS:[<&MSVCR100.printf>] ; MSVCR100.printf 0122100E 83C4 08 ADD ESP,8 01221011 33C0 XOR EAX,EAX 01221013 C3 RETN

As you can see, my point has been proved. Thanks much for the disbelief.

I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

My point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

Also things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

Title: Re: PUSH vs Manual Stack
Post by: jMerliN on June 25, 2010, 12:36:50 AM

Quote from: theunknownguy on June 24, 2010, 01:37:38 AM
I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

This is not true. There are several things you can do to make the popular compilers use REP to fill in stack data in C/C++ if you take advantage of their optimization engines.

QuoteMy point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

The stack isn't being used to emulate registers. It's using to hold data, that's its purpose. Even with the 16 gen purpose registers in x64, most functions will still need a stack frame for stack allocated objects (rule #1: don't put on the heap what you can put on the stack). Call overhead will be reduced by a great deal if you make good use of the new registers.

QuoteAlso things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

You should reverse more C++ applications. I spend a lot of time reverse engineering the ASM output of the various C++ compilers on different optimization levels and there's very little I've seen that they won't produce. Despite what you think, the people writing compilers aren't complete morons, and they have access to all the information about writing efficient ASM programs that you do, so to believe they don't try to optimize their output is an insulting mistake.

I will say though, the people who wrote the VB compilers pre-.NET were stupid and should never be hired for serious programming again. By far the worst assembly generation I've ever seen in my life.

Title: Re: PUSH vs Manual Stack
Post by: theunknownguy on June 25, 2010, 12:42:28 AM

Quote from: jMerliN on June 25, 2010, 12:36:50 AM
Quote from: theunknownguy on June 24, 2010, 01:37:38 AM
I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

This is not true. There are several things you can do to make the popular compilers use REP to fill in stack data in C/C++ if you take advantage of their optimization engines.

QuoteMy point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

The stack isn't being used to emulate registers. It's using to hold data, that's its purpose. Even with the 16 gen purpose registers in x64, most functions will still need a stack frame for stack allocated objects (rule #1: don't put on the heap what you can put on the stack). Call overhead will be reduced by a great deal if you make good use of the new registers.

QuoteAlso things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

You should reverse more C++ applications. I spend a lot of time reverse engineering the ASM output of the various C++ compilers on different optimization levels and there's very little I've seen that they won't produce. Despite what you think, the people writing compilers aren't complete morons, and they have access to all the information about writing efficient ASM programs that you do, so to believe they don't try to optimize their output is an insulting mistake.

I will say though, the people who wrote the VB compilers pre-.NET were stupid and should never be hired for serious programming again. By far the worst assembly generation I've ever seen in my life.

Not feeling like discuss for today, getting back from office.

But i meaned this with the emulation register in base of stack:

From agner Microarchitecture:

Code Select


It may be possible to avoid stack synchronization µops completely in a critical function if all
function parameters are transferred in registers and all local variables are stored in registers
or with PUSH and POP. This is most realistic with the calling conventions of 64-bit Linux. Any
necessary alignment of the stack can be done with a dummy PUSH instruction in this case.

In x64 you dont need them, since you already have new regs...

And i dont want to discuss about the C++ compiler just pointless, they cant even align code automatic to 16 bytes in a critical loop...

PS: "This is most realistic with calling conventions of 64-bit Linux"... (And x64 microsoft?) ::)

The MASM Forum Archive 2004 to 2012

General Forums => The Workshop => Topic started by: theunknownguy on June 23, 2010, 06:55:46 PM