Here is a macro that can be used with OPTION PROLOGUE in order to allow stack probing (when LOCALs are more than 4Kb).
<EDITED>
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Allows a procedure to safely use LOCAL variables with a total size of 4kb or more,
; using an unrolled stack probing method by default.
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Usage:
;
; OPTION PROLOGUE:STACKPROBE
; MyProcedure PROC ; ...
; ; ...
; MyProcedure ENDP
; OPTION PROLOGUE:PROLOGUEDEF
;
; The ROLLED macro argument generates a loop rather than the default unrolled code:
;
; MyProcedure PROC <ROLLED> ; ...
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Notes:
; - When the total size of the LOCAL variables is less than 4kb, the code generated is
; identical to PROLOGUEDEF, so there is no drawback using this macro
; - See "OPTION PROLOGUE" and "PROC" topics in MASM32.HLP for the macro specifications
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Limitations compared to PROLOGUEDEF:
; - Stack probing is relevant only for Windows, ie FLAT model, so it won't accept other models
; - Due to the FLAT model restriction, LOADDS is not supported
; - FORCEFRAME argument doesn't generate a correct epilogue when no LOCAL variables are defined
; So it is not supported for now :(
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
I finally gave up making this macro fully compatible with MASM's PROLOGUEDEF.
That's because stack probing is needed only for Windows (ie FLAT model), so LOADDS argument is not supported as it concerns only 16 bit code.
Also, when using FORCEFRAME argument with EPILOGUEDEF, it doesn't generate any epilogue (ie. leave instruction) although it generates the pop instructions corresponding to the USES directive. I really can't figure where this problem comes from, so I also gave up trying to implement FORCEFRAME... :red
Enjoy!
[attachment deleted by admin]
At last a useable version.
Unfortunately I don't have the time to dig into the FORCEFRAME bug for the moment, nor to add the looped probing option (only unrolled probing for now...).
Anyway, here it is... (The first post has been updated) ::)
You still have to make the stack frame, when you have no locals but you do have proc arguments.
So something like:
;; Set up stack frame
IF localbytes GT 0
...
ELSEIF argbytes GT 0
push ebp
mov ebp, esp
ENDIF
EDIT: Also, wouldn't it be better to use 'mov dword ptr [esp], eax' instead of 'mov byte ptr [esp], 0'?
It's one bytes smaller, and faster..
Quote from: Petroizki on June 22, 2005, 04:45:05 PM
You still have to make the stack frame, when you have no locals but you do have proc arguments.
You're perfectly right, I forgot that! :red :red :red
And your proposed fix is perfect.
Quote from: Petroizki on June 22, 2005, 04:45:05 PM
Also, wouldn't it be better to use 'mov dword ptr [esp], eax' instead of 'mov byte ptr [esp], 0'?
It's one bytes smaller, and faster..
You're perfectly right also! :wink
Source code updated...
Thanks for pointing this out! :U :thumbu :thumbu
I added stack probing to my own prologue macros (http://www.masmforum.com/simple/index.php?topic=1063.0), it generates code like this on the beginning:
push ebp
mov ebp, esp
sub esp, 3A98 ; reserve stack space
mov dword ptr [ebp-1000], eax ; probe first page
mov dword ptr [ebp-2000], eax ; probe second page
mov dword ptr [ebp-3000], eax ; probe the last page
...
Seems to work, at least it produces less code..
You're right (again :green).
A few thoughts however, correct me if I'm wrong :
- I would adjust esp *after* probing the pages, just in case "something else" would use the stack before the last page is probed. Well, that's how VCToolkit's probing function works anyway. (Ok, I should have looked at it before writing my macro, but well... ::))
- Shouldn't add esp, (-size) be slightly faster than sub ? (I suppose it is, as MASM as well as VC use it rather than sub. I didn't make any tests though)
- In the example you give, I think [ebp-3000h] is not the last page. The last probed page should be [ebp-3A98h] (or [ebp-4000h] for instance, it doesn't change anything) :
If the final esp lands on a page boundary (ie. a multiple of 1000h), it will land on the last DWORD of the guard page. But when the next push is made, it will try to access uncommitted memory, and the app will be killed. I'm not sure about this as I haven't managed to produce the "crashing case", but that's how I understand the VCToolkit probing function :
msvcrt_probe proc ; argument : eax = localbytes
cmp eax, 1000h
jnb probe_stack
neg eax ; this part is for localbytes < 4k so it's not relevant in our case
add eax, esp
add eax, 4
test [eax], eax
xchg eax, esp
mov eax, [eax]
push eax
ret
probe_stack: ; the interesting part...
push ecx
lea ecx, [esp+8]
probepages:
sub ecx, 1000h
sub eax, 1000h
test [ecx], eax
cmp eax, 1000h
jnb probepages
probelastpage:
sub ecx, eax
mov eax, esp
test [ecx], eax
mov esp, ecx
mov ecx, [eax]
mov eax, [eax+4]
push eax
ret
msvcrt_probe endp
; ... in main() :
push ebp
mov ebp, esp
mov eax, 2328h
call msvcrt_probe
; ...
This has been generated from the following C code (statically linked) :
int main()
{
char test[9000];
// ...
}
Well, anyway I have updated the code in the first post.
- What would you mean by "something else", a debugger? The probing could be easily done before adjusting esp, but you would have to use instruction that would not change any values in the negative offsets of esp (test, cmp, ...), this would probably make it slightly slower.
- I don't think add and sub have any speed differences, at least according to the optimization guides i have. They are basically the same instruction, on pentium that is.
- I guess your right, but i couldn't get it GPF on Windows XP. Actually you can remove the last two probes, and make it work. I will do some testing on 9x later.
Quote from: Petroizki on June 24, 2005, 06:34:08 AM
- What would you mean by "something else", a debugger?
Indeed. I guess a user-mode debugger is the only thing that could tamper the program's stack.
Quote from: Petroizki on June 24, 2005, 06:34:08 AM
but you would have to use instruction that would not change any values in the negative offsets of esp (test, cmp, ...)
I don't understand why?
In fact I simply meant swapping the probing mov instructions and the esp adjustment :
mov DWORD PTR [ebp-1000h], eax
...
mov DWORD PTR [ebp-4000h], eax
sub esp, 4000h
That seems to work fine.
It may not be safe to mess with outside the stack; http://board.win32asmcommunity.net/index.php?topic=20128.0.
At least debugging with Whidbey may cause a problem..
Ok, I understand now.
But in our case I guess we don't mind if the stack is overwritten by a debugger before esp is adjusted, as we are writing dummy values just to make sure each page is probed.
On the contrary it's more likely a problem could arise if we adjust esp before probing the stack, as a debugger could then hit a non probed, unguarded page, thus leading to a GPF.
Am I wrong?
You are a genious, thankyou so much for this
This was from the post I originally pointed you to:
http://board.win32asmcommunity.net/index.php?topic=19497.15
Code from KetilO:
MainDlgProc proc hWin:HWND,uMsg:UINT,wParam:WPARAM,lPar am:LPARAM
LOCAL buffer[4096]:byte
LOCAL buffer2[256]:byte
LOCAL buffer3[256]:byte
LOCAL printout[4096]:byte
LOCAL pos:dword
LOCAL hdi:HD_ITEM
;Touching the stack frame
mov eax,ebp
.while eax>esp
mov dword ptr [eax],0
sub eax,4
.endw
push edx
push esi
push edi
I found that if you replace
sub eax, 4
with
sub eax, 4096
it works just as well and faster! Since we only need to touch each page and not each DWORD.
My point is, that the touching took place after all the stack adjustments were made.
The first problem in the above post was when the uses function was used, the push of the "used" registers, caused the guard page errors.
farrier
farrier,
Compliments, that is a good technique. :thumbu
Here is a first hack at a "probelogue" macro.
probelogue MACRO szProcName, flags, cbParams, cbLocals, rgRegs, rgUserParams
push ebp
mov ebp, esp
sub esp, cbLocals
mov eax, ebp
.while eax > esp
mov dword ptr [eax], 0
sub eax, 4096
.endw
FOR usesreg, rgRegs
push usesreg
ENDM
EXITM <0>
ENDM
It should be useable with the "OPTION PROLOGUE:probelogue" command.
I've not tested it though, and it will not deal with all the fiddly bits that the default prologue does (near, far, calling convention, and the so on).
Mirno
New and improved (it produces better code in some cases):
probelogue MACRO szProcName, flags, cbParams, cbLocals, rgRegs, rgUserParams
LOCAL counter
LOCAL alignedLocals
LOCAL whileBias
alignedLocals = (cbLocals + 3) AND NOT(3)
whileBias = 2
IFNB <rgUserParams>
whileBias = rgUserParams
ENDIF
push ebp
mov ebp, esp
IF alignedLocals NE 0
sub esp, alignedLocals
ENDIF
IF alignedLocals GT (4096 * whileBias)
.while ebp > esp
mov DWORD PTR [ebp], 0
sub ebp, 4096
.endw
add ebp, alignedLocals AND NOT(4096 - 1)
ELSEIF alignedLocals GE 4096
counter = 0
WHILE alignedLocals GE counter
mov DWORD PTR [ebp + counter], 0
counter = counter + 4096
ENDM
ENDIF
FOR usesreg, rgRegs
push usesreg
ENDM
EXITM <0>
ENDM
Note that the while bias comes from the user parameters, the value 2 was chosen because it gives smallest code.
.code
start:
option PROLOGUE:probelogue
blah PROC <8>, a:DWORD, b:DWORD
LOCAL zyx[4096]:BYTE
ret
blah ENDP
end start
The "<8>" overrides the default whileBias, allowing you to generate unrolled stack probes for locals greater than 8192 bytes.
Assembling with the default prologue is fine, but with a warning about an unknown prologue user argument.
If someone has code they can test this on I'd be greatful, also if you can test with the wierd and wonderful combinations of near, far, public, private, uses, calling convention, and so on as I've not had the chance (or the knowledge of how they should affect the assembly generated on the default prologue).
This is all untested, I've been looking at the list code generated by MASM so there will almost certainly be errors.
Mirno
You guys are reinventing what we have already figured out... :eek
- Yes, every page only needs to be probed on one DWORD.
- Use 'mov dword ptr [ebp], eax' instead of 'mov dword ptr [ebp], 0', to make probing smaller annd faster.
- Just reserve the local stack at once (sub/add only once), and then probe the pages (or vice versa), it makes less code this way.
Quote from: chep on June 29, 2005, 06:17:22 PMBut in our case I guess we don't mind if the stack is overwritten by a debugger before esp is adjusted, as we are writing dummy values just to make sure each page is probed.
On the contrary it's more likely a problem could arise if we adjust esp before probing the stack, as a debugger could then hit a non probed, unguarded page, thus leading to a GPF.
Am I wrong?
I don't know. What if we overwrite some important value the debugger is currently using? It might be possible that both ways would crash on some debuggers. But i guess your way might be better.
Mirno,
All the tests I have done show that only USES has an effect on the generated code. NEAR/FAR/calling convention etc do not affect the generation of the stack frame.
Also, you don't need the alignedLocals thing as MASM automatically rounds up the value for you:
TestProc PROC
LOCAL odd[3]:BYTE
...
TestProc ENDP
generates the following stack frame:
push ebp
mov ebp, esp
add esp, 0FFFFFFFCh ; -4
(even when using a custom prologue, the localbytes argument is already rounded)
Quote from: Petroizki on July 04, 2005, 06:11:12 PM
What if we overwrite some important value the debugger is currently using?
My understanding here is that debuggers don't (or at least shouldn't) leave important values on the stack: as soon as the control is returned to the debugged program, the debugger should assume that the program will mess up the stack (after all, it's the program's stack, not the debugger's).
Well, anyway I guess we'll have hard time really sorting this out... unless a Visual Studio team member shows up to clarify everything! :P
I finally added an option for looped probing, using a macro argument (ROLLED) :
OPTION PROLOGUE:STACKPROBE
TestProc PROC <ROLLED> USES esi edi prm:DWORD
; ...
TestProc ENDP
It generates the following code:
push ebp
mov ebp, esp
add ebp, (-max_probe) ; [1]
@rolled:
mov DWORD PTR [ebp], eax
add ebp, page_size
cmp ebp, esp
jne @rolled ; [2]
add esp, (-localbytes)
The loop body itself (from [1] to [2] included) takes 19 bytes, while the unrolled version takes 6 bytes *per page*. So it becomes space-efficient to use the rolled version starting at 4 probed pages, ie. strictly more than 12Kb of LOCALs.
Q: maybe it could be useful to have FORCEUNROLLED / FORCEROLLED arguments, and by default let the macro decide of the most efficient version?