Regarding Stack

qWord · June 23, 2010, 08:46:23 PM

you know this one:
x64 Software Conventions

redskull · June 23, 2010, 08:51:34 PM

Quote from: theunknownguy on June 23, 2010, 08:44:11 PM
If you have any document or paper that explain internally how stack works, please dont doubt on post it.

"Internally" there is no stack; the stack is just an area in memory, the same as any other. All the CPU does is automatically adjust the stack pointer as a convience to you. All the same circuity is used, whether you MOV to memory or PUSH to it. If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

-r

theunknownguy · June 23, 2010, 08:57:33 PM

Quoteyou know this one:
x64 Software Conventions

Ye read it many times xD

Code Select

"Internally" there is no stack; the stack is just an area in memory, the same as any other.  All the CPU does is automatically adjust the stack pointer as a convience to you.  All the same circuity is used, whether you MOV to memory or PUSH to it.  If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

Yes i was meaning in details like how the operations are done in hardware and many other in depth stuff.

But found this one, i think it explained everything very good:

http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html

I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

jj2007 · June 23, 2010, 08:58:00 PM

It seems not to matter a lot:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1049    cycles for mov
1003    cycles for push

Code Select

.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm			; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]
LOOP_COUNT	= 100000		; 1000000 would be a typical value

.data
Src	db "This is a string, 100 characters long, that serves for a variety of purposes, such as testing algos.", 0

.data?
Dest	db 100 dup(?)

.code
start:
	push 1
	call ShowCpu	; print brand string and SSE level
	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		sub esp, 40		; ten dwords
		mov dword ptr [esp+0], eax
		mov dword ptr [esp+4], ebx
		mov dword ptr [esp+8], ecx
		mov dword ptr [esp+12], edx
		mov dword ptr [esp+16], edi
		mov dword ptr [esp+20], esi
		mov dword ptr [esp+24], ebp
		mov dword ptr [esp+28], eax
		mov dword ptr [esp+32], ebx
		mov dword ptr [esp+36], ecx
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for mov reg", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		sub esp, 40		; ten dwords
		mov dword ptr [esp+0], 100
		mov dword ptr [esp+4], 100
		mov dword ptr [esp+8], 100
		mov dword ptr [esp+12], 100
		mov dword ptr [esp+16], 100
		mov dword ptr [esp+20], 100
		mov dword ptr [esp+24], 100
		mov dword ptr [esp+28], 100
		mov dword ptr [esp+32], 100
		mov dword ptr [esp+36], 100
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for mov 100", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		REPEAT 2
			push eax
			push ecx
			push edx
			push edi
			push esi
		ENDM
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for push reg", 13, 10

	invoke Sleep, 100
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
	REPEAT 100
		REPEAT 10
			push 100
		ENDM
		add esp, 40
	ENDM
	counter_end
	print str$(eax), 9, "cycles for push 100", 13, 10

	inkey chr$(13, 10, "--- ok ---", 13)
	exit

ShowCpu proc	; mode:DWORD
COMMENT @ Usage: 
  push 0, call ShowCpu	; simple, no printing, just returns SSE level
  push 1, call ShowCpu	; prints the brand string and returns SSE level@
  pushad
  sub esp, 80	; create a buffer for the brand string
  mov edi, esp		; point edi to it
  xor ebp, ebp
  .Repeat
  	lea eax, [ebp+80000002h]
	db 0Fh, 0A2h	; cpuid 80000002h-80000004h
	stosd
	mov eax, ebx
	stosd
	mov eax, ecx
	stosd
	mov eax, edx
	stosd
	inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h		; cpuid 1
  xor ebx, ebx		; CpuSSE
  xor esi, esi		; add zero plus the carry flag
  bt edx, 25		; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26		; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi		; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9			; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80]	; dec mode in stack
  .if Zero?
	mov edi, esp	; restore pointer to brand string
  	.Repeat				
		.Break .if byte ptr [edi]!=32	; mode was 1, so show a string but skip leading blanks
		inc edi
	.Until 0
	.if byte ptr [edi]<32
		print chr$("pre-P4")
	.else
		print edi	; CpuBrand
	.endif
	.if ebx
		print chr$(32, 40, "SSE")	; info on SSE level, 40=(
		print str$(ebx), 41, 13, 10	; 41=)
	.endif
  .endif
  add esp, 80		; discard brand buffer (after printing!)
  mov [esp+32-4], ebx	; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
	call MbBufferInit  endif
  popad
  ret 4
ShowCpu endp

end start

clive · June 23, 2010, 09:13:01 PM

Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.

dedndave · June 23, 2010, 09:18:27 PM

Quote16 Byte aligned (for API's)

oops - qWord got me on that one - i dunno what i was thinking - lol

as for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

theunknownguy · June 23, 2010, 09:21:26 PM

Quote from: clive on June 23, 2010, 09:13:01 PM
Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.

This same explanation goes for CALL opcode too?

Code Select

PUSH RetnOff
PUSH Procedure
Ret
RetnOff:

Instead of just CALL Procedure

Quoteas for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

Lol got me on that havent thinked on it... :cheekygreen:

dedndave · June 23, 2010, 09:24:05 PM

if pushing the return address and branching like that were more efficient, we'd have macros to do it for us :P
............ and we'd all be using them, too

theunknownguy · June 23, 2010, 09:27:49 PM

Quote from: dedndave on June 23, 2010, 09:24:05 PM
if pushing the return address and branching like that were more efficient, we'd have macros to do it for us :P
............ and we'd all be using them, too

Ye i knew CALL was faster, but wanted to know if clive explanation fit the CALL opcode since at my point of view you can emulate it with PUSH and RET.

You know i cant find many docs where things like how CALL or PUSH opcode work in depth (at hardware level) so i am just killing you guys with the questions... sorry.

Quoteit looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this

Code Select Expand
Code: PUSH RetnOff JMP Procedure RetnOff:

Got me again... :lol :lol or could be any other conditional jump for avoid the JMP if some flag where mod before...

dedndave · June 23, 2010, 09:32:44 PM

Quote
Code Select Expand
PUSH RetnOff PUSH Procedure Ret RetnOff:

it looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this

Code Select

        PUSH    RetnOff
        JMP     Procedure
RetnOff:

clive · June 23, 2010, 09:39:49 PM

Quote from: theunknownguy
This same explanation goes for CALL opcode too?

Code Select Expand
PUSH RetnOff PUSH Procedure Ret RetnOff:

Instead of just CALL Procedure

Well you have to be careful there, as redskull has hinted, there are architecture issues with that. The CALL/RET are easier for the branch prediction to follow. Whenever you cause a mis-predict you end up eating some 20-30 cycles, depending on the CPU, as it refills the execution pipeline. It is often quite easy to do.

There are valid reasons to use that construction, especially with segmented memory, or protected mode, or situations where the assembler/linker/loader can't handle dynamic run time behaviour.

How about? oh crap dave's in my head

Code Select

PUSH RetnOff
JMP Procedure
RetnOff:

theunknownguy · June 23, 2010, 09:45:38 PM

Just non topic queston:

How old are you dave and clive? :eek

And thanks for clarify my questions

clive · June 23, 2010, 09:48:39 PM

Here's some context switching code I wrote yesterday running some different FLAT memory code in some virtual space from another FLAT memory host. Fun and joy with segments in FLAT land, and using NEAR/FAR calls and segment overrides.

Code Select

SysExec PROC near c public SelCode:DWORD, SelData:DWORD
        push    ebx
        push    esi
        push    edi

        push    ds
        push    es

        mov     ecx,SelCode
        mov     edx,SelData

        mov     ds,edx          ; DS = Data Segment
        mov     es,edx          ; ES = Data Segment

        mov     eax,ss
        mov     ebx,esp

        mov     dword ptr ds:[0200h],ebx ; Original ESP
        mov     dword ptr ds:[0204h],eax ; Original SS

        mov     eax,010000h     ; EIP
        mov     ss,edx          ; SS = Data Segment
        lea     esp,[eax - 4]   ; ESP within GHS arena

        push    ecx     ; Segment
        push    eax     ; Offset

        retf    ; Jump to Segment:Offset, setting CS:EIP

; Doesn't get here

SysExec ENDP

qWord · June 23, 2010, 10:07:56 PM

just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:

Code Select

push: 510
mov: 489
push const: 505
mov const: 495
Press any key to continue ...

theunknownguy · June 23, 2010, 10:09:53 PM

Quote from: qWord on June 23, 2010, 10:07:56 PM
just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:
Code Select Expand
push: 510 mov: 489 push const: 505 mov const: 495 Press any key to continue ...

:eek :eek :eek :eek Danm... i want those regs on x32 :(

Cant switch yet to x64 need to finish my work on x32 and later pass to x64 but god i will love to avoid the PUSH for security reasons...

News:

Regarding Stack

theunknownguy

theunknownguy

theunknownguy

theunknownguy

theunknownguy