News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Regarding Stack

Started by theunknownguy, June 23, 2010, 06:55:46 PM

Previous topic - Next topic

qWord

FPU in a trice: SmplMath
It's that simple!

redskull

Quote from: theunknownguy on June 23, 2010, 08:44:11 PM
If you have any document or paper that explain internally how stack works, please dont doubt on post it.

"Internally" there is no stack; the stack is just an area in memory, the same as any other.  All the CPU does is automatically adjust the stack pointer as a convience to you.  All the same circuity is used, whether you MOV to memory or PUSH to it.  If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Quoteyou know this one:
x64 Software Conventions

Ye read it many times xD

"Internally" there is no stack; the stack is just an area in memory, the same as any other.  All the CPU does is automatically adjust the stack pointer as a convience to you.  All the same circuity is used, whether you MOV to memory or PUSH to it.  If you are referring to how it keeps track of the pointer (ie, the stack engine I mentioned), Anger Fogs stuff is the go-to-guide.

Yes i was meaning in details like how the operations are done in hardware and many other in depth stuff.

But found this one, i think it explained everything very good:

http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html

I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

jj2007

It seems not to matter a lot:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1049    cycles for mov
1003    cycles for push


.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm ; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]
LOOP_COUNT = 100000 ; 1000000 would be a typical value

.data
Src db "This is a string, 100 characters long, that serves for a variety of purposes, such as testing algos.", 0

.data?
Dest db 100 dup(?)

.code
start:
push 1
call ShowCpu ; print brand string and SSE level
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
sub esp, 40 ; ten dwords
mov dword ptr [esp+0], eax
mov dword ptr [esp+4], ebx
mov dword ptr [esp+8], ecx
mov dword ptr [esp+12], edx
mov dword ptr [esp+16], edi
mov dword ptr [esp+20], esi
mov dword ptr [esp+24], ebp
mov dword ptr [esp+28], eax
mov dword ptr [esp+32], ebx
mov dword ptr [esp+36], ecx
add esp, 40
ENDM
counter_end
print str$(eax), 9, "cycles for mov reg", 13, 10

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
sub esp, 40 ; ten dwords
mov dword ptr [esp+0], 100
mov dword ptr [esp+4], 100
mov dword ptr [esp+8], 100
mov dword ptr [esp+12], 100
mov dword ptr [esp+16], 100
mov dword ptr [esp+20], 100
mov dword ptr [esp+24], 100
mov dword ptr [esp+28], 100
mov dword ptr [esp+32], 100
mov dword ptr [esp+36], 100
add esp, 40
ENDM
counter_end
print str$(eax), 9, "cycles for mov 100", 13, 10

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
REPEAT 2
push eax
push ecx
push edx
push edi
push esi
ENDM
add esp, 40
ENDM
counter_end
print str$(eax), 9, "cycles for push reg", 13, 10

invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
REPEAT 10
push 100
ENDM
add esp, 40
ENDM
counter_end
print str$(eax), 9, "cycles for push 100", 13, 10

inkey chr$(13, 10, "--- ok ---", 13)
exit

ShowCpu proc ; mode:DWORD
COMMENT @ Usage:
  push 0, call ShowCpu ; simple, no printing, just returns SSE level
  push 1, call ShowCpu ; prints the brand string and returns SSE level@
  pushad
  sub esp, 80 ; create a buffer for the brand string
  mov edi, esp ; point edi to it
  xor ebp, ebp
  .Repeat
  lea eax, [ebp+80000002h]
db 0Fh, 0A2h ; cpuid 80000002h-80000004h
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
inc ebp
  .Until ebp>=3
  push 1
  pop eax
  db 0Fh, 0A2h ; cpuid 1
  xor ebx, ebx ; CpuSSE
  xor esi, esi ; add zero plus the carry flag
  bt edx, 25 ; edx bit 25, SSE1
  adc ebx, esi
  bt edx, 26 ; edx bit 26, SSE2
  adc ebx, esi
  bt ecx, esi ; ecx bit 0, SSE3
  adc ebx, esi
  bt ecx, 9 ; ecx bit 9, SSE4
  adc ebx, esi
  dec dword ptr [esp+4+32+80] ; dec mode in stack
  .if Zero?
mov edi, esp ; restore pointer to brand string
  .Repeat
.Break .if byte ptr [edi]!=32 ; mode was 1, so show a string but skip leading blanks
inc edi
.Until 0
.if byte ptr [edi]<32
print chr$("pre-P4")
.else
print edi ; CpuBrand
.endif
.if ebx
print chr$(32, 40, "SSE") ; info on SSE level, 40=(
print str$(ebx), 41, 13, 10 ; 41=)
.endif
  .endif
  add esp, 80 ; discard brand buffer (after printing!)
  mov [esp+32-4], ebx ; move ebx into eax stack position - returns eax to main for further use
  ifdef MbBufferInit
call MbBufferInit   endif
  popad
  ret 4
ShowCpu endp

end start

clive

Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.
It could be a random act of randomness. Those happen a lot as well.

dedndave

Quote16 Byte aligned (for API's)
oops - qWord got me on that one - i dunno what i was thinking - lol

as for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

theunknownguy

Quote from: clive on June 23, 2010, 09:13:01 PM
Quote from: theunknownguy
I dont want to fail again but... using registers is faster than using stack isnt? (For holding arguments)

Yes, but there are less of them. The stacked data will ultimately make it to memory (as the write buffers flush, and the write back through the caches propagates), but if the values are used quickly they are likely to be very close to the processor either being forwarded directly to unit requesting the data, or in the L1 cache.

As JJ notes there isn't much difference in speed, basically it is constrained by the write buffers flushing to the memory subsystem. You can always generate data faster than the memory can absorb it.

This same explanation goes for CALL opcode too?

PUSH RetnOff
PUSH Procedure
Ret
RetnOff:


Instead of just CALL Procedure


Quoteas for speed - take a look at the size of the code generated
you may find the code that uses the stack is smaller
this can make a difference when writing larger loops
it is nice if you can keep a loop under 128 bytes so the branch at the end is SHORT instead of NEAR
the point is - code size may be more important sometimes

Lol got me on that havent thinked on it...  :cheekygreen:

dedndave

if pushing the return address and branching like that were more efficient, we'd have macros to do it for us   :P
............ and we'd all be using them, too

theunknownguy

Quote from: dedndave on June 23, 2010, 09:24:05 PM
if pushing the return address and branching like that were more efficient, we'd have macros to do it for us   :P
............ and we'd all be using them, too

Ye i knew CALL was faster, but wanted to know if clive explanation fit the CALL opcode since at my point of view you can emulate it with PUSH and RET.

You know i cant find many docs where things like how CALL or PUSH opcode work in depth (at hardware level) so i am just killing you guys with the questions... sorry.

Quoteit looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this


Code:
        PUSH    RetnOff
        JMP     Procedure
RetnOff:


Got me again...  :lol :lol or could be any other conditional jump for avoid the JMP if some flag where mod before...


dedndave

QuotePUSH RetnOff
PUSH Procedure
Ret
RetnOff:

it looks like you have a pretty good handle on it, already - lol
although, the micro-code probably goes something like this
        PUSH    RetnOff
        JMP     Procedure
RetnOff:

clive

Quote from: theunknownguy
This same explanation goes for CALL opcode too?

PUSH RetnOff
PUSH Procedure
Ret
RetnOff:


Instead of just CALL Procedure

Well you have to be careful there, as redskull has hinted, there are architecture issues with that. The CALL/RET are easier for the branch prediction to follow. Whenever you cause a mis-predict you end up eating some 20-30 cycles, depending on the CPU, as it refills the execution pipeline. It is often quite easy to do.

There are valid reasons to use that construction, especially with segmented memory, or protected mode, or situations where the assembler/linker/loader can't handle dynamic run time behaviour.

How about? oh crap dave's in my head

PUSH RetnOff
JMP Procedure
RetnOff:

It could be a random act of randomness. Those happen a lot as well.

theunknownguy

Just non topic queston:

How old are you dave and clive?  :eek

And thanks for clarify my questions

clive

Here's some context switching code I wrote yesterday running some different FLAT memory code in some virtual space from another FLAT memory host. Fun and joy with segments in FLAT land, and using NEAR/FAR calls and segment overrides.

SysExec PROC near c public SelCode:DWORD, SelData:DWORD
       push    ebx
       push    esi
       push    edi

       push    ds
       push    es

       mov     ecx,SelCode
       mov     edx,SelData

       mov     ds,edx          ; DS = Data Segment
       mov     es,edx          ; ES = Data Segment

       mov     eax,ss
       mov     ebx,esp

       mov     dword ptr ds:[0200h],ebx ; Original ESP
       mov     dword ptr ds:[0204h],eax ; Original SS

       mov     eax,010000h     ; EIP
       mov     ss,edx          ; SS = Data Segment
       lea     esp,[eax - 4]   ; ESP within GHS arena

       push    ecx     ; Segment
       push    eax     ; Offset

       retf    ; Jump to Segment:Offset, setting CS:EIP

; Doesn't get here

SysExec ENDP
It could be a random act of randomness. Those happen a lot as well.

qWord

just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:
push: 510
mov: 489
push const: 505
mov const: 495
Press any key to continue ...
FPU in a trice: SmplMath
It's that simple!

theunknownguy

Quote from: qWord on June 23, 2010, 10:07:56 PM
just for the lucky x64 users: an small test bed.
For assembling you need jwasm and Japhet's Windows.inc (Win32Inc).

result on my c2d:
push: 510
mov: 489
push const: 505
mov const: 495
Press any key to continue ...


:eek :eek :eek :eek Danm... i want those regs on x32  :(

Cant switch yet to x64 need to finish my work on x32 and later pass to x64 but god i will love to avoid the PUSH for security reasons...