we're allways talking about avoiding setting up stack frames so as an optimization on here and wondered what you guys thought of this:
say we have a routine that needed arguments as in:
PrintSum proc alpha:DWORD,beta:DWORD
...
PrintSum endp
alpha and beta are then addressed as an offset to ebp
but what if we had a macro that:
MyProc PrintSum,alpha:DWORD,beta:DWORD
would be interpreted as:
.data?
PrintSum_alpha dd ?
PrintSum_beta dd ?
alpha equ PrintSum_alpha
beta equ PrintSum_beta
.code
PrintSum:
...
then undefine alpha & beta at end of routine to release the namespace for other routines.
so now we have no need to set up stack frame, instead we are using uninitialized memory to pass our params onto a routine.
we could also have a new invoke macro that:
MyInvoke PrintSum,2,4
interpretes as:
push 2
pop PrintSum_Alpha
push 4
pop PrintSum_beta
I know there this needs tweaking here and there but what do you think in principle?
it sounds good, but i dunno how to "undefine" data labels - lol
this code assumes the labels are already declared
push 2
pop PrintSum_Alpha
push 4
pop PrintSum_beta
for that matter, the values could just as well be permanently declared
but, i think the stack frame method turns out to be faster
i dunno if this is faster or not...
mov dword ptr PrintSum_Alpha,2
mov dword ptr PrintSum_beta,4
on an 8088, it would be faster because there are fewer memory references, but that rule doesn't apply for pentiums, i guess
(well, for word-sized values, at least)
personally, i like to pass parms in register, provided there are only a couple (as in most cases)
but, i am a dinosaur programmer - i have dinosaur thoughts and i write dinosaur code - lol
these guys are used to procs that get INVOKEd
they all like to be C-compatible and, let's face it, windows seems to have been designed around C
as for me, i dislike C and that is why i write in assembler - lol
i was playing with another method that has some potential
you may find it interesting
http://www.masm32.com/board/index.php?topic=11671.msg87985#msg87985
as you can see, noone seems to be interested in my dinosaur ideas - lol
(http://hewlettroad.com/Animated%20Gifs/T%20rex%20walk.gif)
on a similar note, one of the other guys in here had a good idea (i forget who it was and can't locate the thread)
instead of using the EBP register to reference locals, use ESP directly and design the
assembler so that it keeps track of the PUSH's and POP's to calculate the offsets
this frees up the EBP register and is a little faster and smaller than the regular stack frame
AProc PROC
sub esp,8 ;2 local dword variables
mov dword ptr [esp+4],1 ;first local var
mov dword ptr [esp],2 ;second local var
.
.
.
push eax ;assembler maintains PUSH count
.
.
.
mov edx,[esp+8] ;first local var new offset
mov ecx,[esp+4] ;second local var new offset
.
.
.
pop eax
.
.
.
add esp,8
ret
AProc ENDP
Damos,
Using global memory in the .DATA or .DATA? section is an old trick from the days when stack space was very limited but there is no reason not to use it today if it does what you want. Stack based local variables have the advantage that you can call another proc from the current one and the values in the first will be the same when the called proc returns which limits nesting of procedures. In most instances this would not matter and you could handle it with a few different sets of variables but you could not perform recursion by this method.
Sometimes global variables are good sometimes they're bad.
Let's say you want to create a custom control &
use global variables as temporal variables for better speed.
That would be very unwise. 'Cuz if you are to support
multiple instances of the control on the same window,
you gotta take into account concurrent reads/writes...
So stack-based variables suit better here.
Quote from: dedndave on July 03, 2009, 01:32:44 PM
on a similar note, one of the other guys in here had a good idea (i forget who it was and can't locate the thread)
instead of using the EBP register to reference locals, use ESP directly
On a P4, using ESP directly is 5 cycles faster but becomes a bit longer with every local variable:
1891 cycles for 100*call stack_frame_on
1417 cycles for 100*call stack_frame_OFF
Code sizes:
Frame on: 42
Frame off: 46
Test yourself...
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm ; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]
LOOP_COUNT = 1000000
.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_on
ENDM
counter_end
print str$(eax), 9, "cycles for 100*call stack_frame_on", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_OFF
ENDM
counter_end
print str$(eax), 9, "cycles for 100*stack_frame_OFF", 13, 10, 10, "Code sizes:", 13, 10, "Frame on: ", 9
mov eax, stack_frame_on_END
sub eax, stack_frame_on
print str$(eax), 13, 10, "Frame off: ", 9
mov eax, stack_frame_OFF_END
sub eax, stack_frame_OFF
print str$(eax)
inkey chr$(13, 10, "--- ok ---", 13)
exit
stack_frame_on proc
LOCAL v1, v2, v3, v4, v5, v6
mov v1, eax
mov v2, eax
mov v3, 1234h
mov v4, 5678h
mov v5, 5555h
mov v6, 6666h
ret
stack_frame_on endp
stack_frame_on_END:
stack_frame_OFF proc
; LOCAL v1, v2, v3, v4, v5, v6
add esp, -4*6
mov [esp], eax ; v1
mov [esp+4], eax ; v2
mov dword ptr [esp+8], 1234h ; v3
mov dword ptr [esp+12], 5678h ; v4
mov dword ptr [esp+16], 5555h ; v5
mov dword ptr [esp+20], 6666h ; v6
sub esp, -4*6
ret
stack_frame_OFF endp
stack_frame_OFF_END:
end start
i think that's because the instruction set is optimized for using EBP
[ebp+4] uses a byte offset
[esp+4] uses a word offset
of course, the LEAVE saves a couple bytes for you
quite a big difference in speed, don't you think?
Quote from: dedndave on July 03, 2009, 04:48:34 PM
quite a big difference in speed, don't you think?
5 cycles on a P4, we'll see on others. But the code becomes very difficult to read and maintain, unless you revert to a pair of macros
and do not use esp:
MyLocal MACRO args:VARARG
LOCAL tmp$
.if 1
MyLocEsp = 0
FOR arg, <args>
tmp$ CATSTR <arg>, < equ !<dword ptr [esp+>, %MyLocEsp, <]!>>
tmp$
MyLocEsp = MyLocEsp + 4
ENDM
tmp$ CATSTR <add esp, ->, %MyLocEsp
tmp$
ENDM
MyRet MACRO
tmp$ CATSTR <sub esp, ->, %MyLocEsp
tmp$
ret
.endif
ENDM
Usage (dwords only, names can be used only once because they are global):
stack_frame_OFF proc
MyLocal LocV1, LocV2, LocV3, LocV4, LocV5, LocV6
mov LocV1, eax
mov LocV2, eax
mov LocV3, 1234h
mov LocV4, 5678h
mov LocV5, 5555h
mov LocV6, 6666h
MyRet
stack_frame_OFF endp
i see over 400 cycles diff - am i lookin in the wrong spot Jochen ? - lol
Quote1891 cycles for 100*call stack_frame_on
1417 cycles for 100*call stack_frame_OFF
Code sizes:
Frame on: 42
Frame off: 46
Quote from: dedndave on July 03, 2009, 05:26:21 PM
i see over 400 cycles diff - am i lookin in the wrong spot Jochen ? - lol
Quote1891 cycles for 100*call stack_frame_on
1417 cycles for 100*call stack_frame_OFF
Code sizes:
Frame on: 42
Frame off: 46
Divide by 100 :bg
Celeron M:
980 cycles for 100*call stack_frame_on
871 cycles for 100*stack_frame_OFF
i.e. 1 (one) cycle faster
smokin ! - lol
Just for fun, here a more complex example. On a Celeron M, the proc without frame is about 0.7 cycles faster, a bit longer and definitely trickier - see the print str$(LocV2).
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
LOOP_COUNT = 200000
.code
start:
print "Test for correctness:", 13, 10
mov eax, 123456/2 ; magic number
call stack_frame_OFF
mov eax, 123456/2 ; magic number
call stack_frame_on
print chr$(13, 10, "Timings:", 13, 10)
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_on
ENDM
counter_end
print str$(eax), 9, "cycles for 100*call stack_frame_on", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_OFF
ENDM
counter_end
print str$(eax), 9, "cycles for 100*stack_frame_OFF", 13, 10, 10, "Code sizes:", 13, 10, "Frame on: ", 9
mov eax, stack_frame_on_END
sub eax, stack_frame_on
print str$(eax), 13, 10, "Frame off: ", 9
mov eax, stack_frame_OFF_END
sub eax, stack_frame_OFF
print str$(eax)
inkey chr$(13, 10, "--- ok ---", 13)
exit
MyPush MACRO arg
.if 1
push arg
MyBase = MyBase + 4
ENDM
MyPop MACRO arg
pop arg
MyBase = MyBase - 4
.endif
ENDM
MyLocal MACRO args:VARARG
LOCAL tmp$
.if 1
MyLocEsp = 0
MyBase = 0
FOR arg, <args>
tmp$ CATSTR <arg>, < equ !<dword ptr [esp+MyBase+>, %MyLocEsp, <]!>>
tmp$
MyLocEsp = MyLocEsp + 4
ENDM
tmp$ CATSTR <add esp, ->, %MyLocEsp
tmp$
ENDM
MyRet MACRO
tmp$ CATSTR <sub esp, ->, %MyLocEsp
tmp$
ret
.endif
ENDM
stack_frame_OFF proc
MyLocal LocV1, LocV2, LocV3, LocV4, LocV5, LocV6
mov LocV1, eax
add eax, eax
mov LocV2, eax
mov LocV3, 1234h
.if eax==123456
MyPush eax
MyPush eax
MyPush eax
MyPush eax
print chr$("Frame OFF: ")
mov ecx, LocV2
print str$(ecx), 9
print str$(LocV2), 13, 10 ; wrong variable because we are pushing [eSp+X]
MyPop ecx
MyPop ecx
MyPop ecx
MyPop ecx
.endif
mov LocV4, 5678h
mov LocV5, 5555h
mov LocV6, 6666h
MyRet
stack_frame_OFF endp
stack_frame_OFF_END:
stack_frame_on proc
LOCAL v1, v2, v3, v4, v5, v6
mov v1, eax
add eax, eax
mov v2, eax
mov v3, 1234h
.if eax==123456
Push eax
Push eax
Push eax
Push eax
print chr$("Frame ON: ")
mov ecx, v2
print str$(ecx), 9
print str$(v2), 13, 10 ; right variable because we are pushing [eBp+X]
Pop ecx
Pop ecx
Pop ecx
Pop ecx
.endif
mov v4, 5678h
mov v5, 5555h
mov v6, 6666h
ret
stack_frame_on endp
stack_frame_on_END:
end start
Timers.asm in the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)