Hey all,
So ive started trying to port my code and work into 64bit (felt it was finally time), especially considering I have some projects I'm working on that really could benefit from the extra ram/registers etc.
Things started off pretty smoothly, did all the reading up and experimenting. I decided to switch to jwasm + wininc instead of ml64 because I really can't live without the high-level syntax etc.
I still use Visual Studio 2010 to debug as I've always done.
So here is where I've run into some wierd issues and have a few questions around the calling convention.
According to my understanding the following is the case:
1) The caller is responsible for decrementing and incrementing the stack pre/post call.
2) fastcall calling convention will pass the first 4 integer/ptr arguments in rcx,rdx,r8,r9 and the first 4 float args in xmm0-xmm3.
3) Shadow space is reserved on the stack in accordance with the parameters passed (their sizes) + any local variables.
Here is where i start getting confused:
Why does the stack need to be 16 byte aligned? I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?
What is the point of fastcall if the space is a) reserved on the stack anyway and then b) the assembler copies rcx,rdx etc into the shadow space automatically??? It seems to me like this is slower now than stdcall and is just wasting those registers?
Based on that calling convention.. I would traditionally use ecx as a loop counter, suppose now i have a look with a call/invoke in the middle... if rcx was my loop counter the call generation doesn't save rcx etc.. so now my registers are trashed?
Is the intention that i should simply not ever use rcx,rdx,r8,r9,xmm0-xmm3 ??? or must I manually save these around every call? seems ridiculous to me as now the calling convention doesn't live up to it's name of fast at all.. plus you can't push an XMM register, so that means a lot of effort involved in saving them around a call.
What is the actual difference between a PROC and a PROC with FRAME specified? I know that it's 64bit SEH compliant in terms of the prolog/epilog.. but how does this affect the entry and exit code in the procedure?
So now for my other major issue.. I think my code is breaking because of the calling convention issues mentioned above, but what is strange is if I take for example win64_3e from the jwasm samples.. build it in debug mode and debug it under VS2010, I can see the local variables, but their values never update IE:
WinMain proc FRAME hInst:HINSTANCE, hPrevInst:HINSTANCE, CmdLine:LPSTR, CmdShow:UINT
LOCAL wc:WNDCLASSEXA
LOCAL msg:MSG
LOCAL hwnd:HWND
In VS2010 I can see all 7 locals.. however jwasm doesn't generate the automatic copy of the registers to shadow space?? The example code manually does mov hInst,rcx which doesn't update hInst in the locals view.
If I build my project/code and go into VS2010, I can see no locals AT ALL??
Any help would be hugely appreciated!
Thanks
John
I'm just having a look at the WinInc includes.. which are supposedly for 32 and 64bit... but the typedefs and structs don't look right to me for 64bit.. they're still full of ptr's as DWORDS ???
Quote from: johnsa on January 19, 2012, 08:17:45 AM
...
According to my understanding the following is the case:
1) The caller is responsible for decrementing and incrementing the stack pre/post call.
2) fastcall calling convention will pass the first 4 integer/ptr arguments in rcx,rdx,r8,r9 and the first 4 float args in xmm0-xmm3.
3) Shadow space is reserved on the stack in accordance with the parameters passed (their sizes) + any local variables.
Correct, local variables remain as always. Shadow space is always reserved for arguments (away/independent from local variables).
If I recall correctly USES have changed position because of stack alignment.
Quote
Here is where i start getting confused:
Why does the stack need to be 16 byte aligned?
Because it is looking "cool" :P. Well the CPU kind of needs 8 bytes alignment in 64 bits long mode but 16 bytes is required because the compiler sometimes uses SSE code to move XMM registers around. If the stack address is not 16 bytes aligned the SSE MOV would crash... (MOVAPS etc) and the compiler is no wise enough to know when to use unaligned SSE moves.
Quote
I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?
GoASM :D ?
Yeah it is also my "impression" that sometimes it is possible (but i might be wrong).
I would never do an AND RSP,something ... my assembler does not do it in 64 bits prologue/epilogue generation.
Quote
What is the point of fastcall if the space is a) reserved on the stack anyway and then b) the assembler copies rcx,rdx etc into the shadow space automatically??? It seems to me like this is slower now than stdcall and is just wasting those registers?
The point is that this convention was created by people that have NO CLUE about ASM programming BUT unfortunately decided about ABI and ASM stuff.
Apparently they are under the influence of old MS DOS days when API transferred params by registers and "this is faster" TM :P (and copy cat from UNIX 64 bits ELF "ways")
Yes, sometimes (in inner loops) it is faster but it is STUPID in API calling API calling API ... they jut do not understand it...
Wrong decision they made... we will have to live with it it ... unfortunately is done... accept it young Jedi :D
Quote
Based on that calling convention.. I would traditionally use ecx as a loop counter, suppose now i have a look with a call/invoke in the middle... if rcx was my loop counter the call generation doesn't save rcx etc.. so now my registers are trashed?
Unfortunately yes. But note that ECX was also trashed by API's with STDCALL.
Be "wise" and only use FASTCALL for API and try to still use STDCALL for your own functions :D if possible ...
Quote
Is the intention that i should simply not ever use rcx,rdx,r8,r9,xmm0-xmm3 ???
Yes, we add registers ONLY in order to LOOSE them and trash them and have more code saving and restoring them BECAUSE we are so "cool" about wasting registers ;)
You are correct in your sad observations.
Quote
or must I manually save these around every call? seems ridiculous to me as now the calling convention doesn't live up to it's name of fast at all.. plus you can't push an XMM register, so that means a lot of effort involved in saving them around a call.
Yes but the compiler will do this easy; it is just hard for humans hnece stop programming in ASM :D (irony)
Basically compilers will only do it once at start of PROC code leaving enough space for ALL invokes inside a PROC ;) There is no real need to do it before each API invoke.
Yes it is not "fast" at all. It just looks like it.
You cannot PUSH XMM but you can MOV them to [RSP] and now you see why RSP has to be 16 bytes aligned ;)
Quote
What is the actual difference between a PROC and a PROC with FRAME specified? I know that it's 64bit SEH compliant in terms of the prolog/epilog.. but how does this affect the entry and exit code in the procedure?
JWASM specific, sorry....
Basically the epilogue / prologue is fixed in order for the "unwind" code to recognize it because the "cool" guys lost the easy way to do this ;)
Info about each PROC with FRAME will be stored in an PE section /directory and this makes your code easy to reverse and unwind if an exception occurs in your PROC
Quote
So now for my other major issue.. I think my code is breaking because of the calling convention issues mentioned above, but what is strange is if I take for example win64_3e from the jwasm samples.. build it in debug mode and debug it under VS2010, I can see the local variables, but their values never update IE:
WinMain proc FRAME hInst:HINSTANCE, hPrevInst:HINSTANCE, CmdLine:LPSTR, CmdShow:UINT
LOCAL wc:WNDCLASSEXA
LOCAL msg:MSG
LOCAL hwnd:HWND
In VS2010 I can see all 7 locals.. however jwasm doesn't generate the automatic copy of the registers to shadow space?? The example code manually does mov hInst,rcx which doesn't update hInst in the locals view.
If I build my project/code and go into VS2010, I can see no locals AT ALL??
I assume from my experience with my own 64 bits SOL_Asm ...(but I might be wrong) that JWASM does not generate the PDB files directly (undocumented).
Instead it generates the old CodeView format (documented) and converts it to PDB.
Unfortunately in this case you loose / not have full LOCALS and ARGS information for debug ... sorry :D
QuoteI see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?
if the epilogue uses a LEAVE instruction, or otherwise restores ESP from a stored value, this is ok
QuoteI'm just having a look at the WinInc includes.. which are supposedly for 32 and 64bit....
i am reasonalby certain that you use different versions for 32 or 64 bit
I would disgaree with you Bogdan about using stdcall for your own functions, it is easier to track RSP if only you change it.
Example, my stack is always para aligned so when my function starts it knows it is out by 8 bytes, a simple push rbx aligns it and makes rbx available.
I must admit that I use ml64 but it makes you think about where the stack is.
Another thing about the alignment is that you can load the registers (rcx/rdx/r8/r9) as you have to and then push them to make the shadow space, maybe
with a superfluous push for odd numbered params.
You can also go upto r15, so rbx rdi rsi rbp r10-r15 are available, 10 registers that windows api's don't trash.
The hardest part for me is structures, not knowing C and its love of re-defining everything it is hard to keep up with pointers/dwords and alignment...
I honestly cannot find a correct definition in WinInc for 64bit.. unless i'm being really slow today :)
Ok.. so from looking at the disasembly from jwasm for a PROC FRAME with frame:auto set.. we have
;move regs to shadow space
push rbp
mov rbp,rsp
; Here I'd do and rsp,-16
; which means further down things which access the shadow parameter space are all wrong in the disassembly assuming RSP was changed...
mov dword ptr [rsp+20h],0
it then does an add rsp,20h.. i don't see it restoring RSP from anything safe? .. even so the code in between that addresses the shadow space would be wrong after the AND..
to be honest .. This stack alignment thing makes no sense... even with an AND rsp,-16 (which will break references to shadow space).. there is no guarantee that the stack will align correctly ever... imagine:
xyz PROC a:QWORD, b:QWORD, c:QWORD, d:QWORD, e:BYTE, f:REAL4 ....
a,b,c,d will be loaded into rcx,rdx,r8,r9 ... byte E will be pushed to the stack.. and f would have to be a MOVSS [rsp+x],xmm0 ...
I can only imagine that parameters would be either REAL4 or REAL8 using MOVSS or MOVSD which don't require alignment like MOVAPS does... ?
I guess one could implement a NEW proc macro for your own code which reverts to stdcall, as there isn't a way in the assembler by default to have it use fastcall for one and not the other... and this would mean you'd need a new invoke too... :( :(
As for the PDB.. I really cannot work without proper locals/args debugging, I guess Japeth would need to confirm the status on this one...
you have declared the _WIN64-equate before including windows.inc?
UNICODE EQU 1
WIN32_LEAN_AND_MEAN EQU 1
_WIN64 EQU 1
include windows.inc
I have.. although I don't see that being used in WINGUI1.ASM example in WinInc for the 64bit gui sample app.. plus I cannot find anything inside windows.inc or it's children where it actually changes the definitions depending on that equate.. IE: HWND = dword / qword.
On a side note... PDB file format should be necessary as I still link using MS link ??
Link generates the PDB file from the OBJ file which is all JWASM has to produce.. and that format is open?
Quote from: BogdanOntanu on January 19, 2012, 11:50:25 AM
Quote
I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?
GoASM :D ?
Not GoAsm, it uses
OR SPL,8 to align the stack.
Invoke
push rsp
push q[rsp]
or spl,8
<params>
sub rsp,20h
call function
Parameters are optimized, for example a parameter of zero uses XOR instead of MOV if it is passed in a register. It also takes advantage of the zero extension behaviour when moving smaller numbers into registers to reduce code size.
Stack frame:
mov [rsp+8],rcx
mov [rsp+10h],rdx
mov [rsp+18h],r8
mov [rsp+20h],r9
mov rbp,rsp
So I've tried using jwlink instead now.. it doesn't seem to be able to link a file created by jwasm when the -Zi switch is on.... this is degenerating quickly.. i may be forced back to 32bit code :)
Ok.. so a small update... I stipulated /machine and /subsystem on link .. which wasn't there and it wasn't complaining.
I also installed jwasm 2.07 pre.
I then changed the prototype of my WinMain.. which didn't work at all it seems to be pre-defined, so I renamed it to WinMainX.
And now.. I can see args and locals in VS2010!!!
BUT..
They're not getting assigned the right values... lol
hInstance is null, hPrevInstance is getting what should be hInstance, and CmdLinePtr is getting SW_SHOWDEFAULT....
one step closer...
The first problem seems to be...
WinMainX proto :QWORD, :QWORD, :QWORD, :DWORD
000000013F9B10EE mov qword ptr [rsp+8],rcx
000000013F9B10F3 mov qword ptr [rsp+10h],rdx
000000013F9B10F8 mov qword ptr [rsp+18h],r8
000000013F9B10FD mov qword ptr [rsp+20h],r9
From VS2010 disasm... the last parameter is still being stored as r9 (qword) .. when the parameter passed should be r9d only...
The parameters in order should be :
hInstance, hPrevInstance, CmdLine, nShow
but in the VS locals view the order is
hInstance, CmdLine, hPrevInstance, nShow ....
I don't know if this is ok, but the values are not going into the right locals... even tho the above code seems to be putting them onto the stack correctly.
Narrowed it down....
in the disasm view.. the args and locals ONLY update correctly when the push rbp is executed, now they all line up.
If you assemble and the PROC FRAME is used, it seems to do the ordering in such a way that you can see the right values from disasm, but not from source view.. if you leave FRAME off.. then F10 in the proc heading brings it all into line as expected.
So i think this is definitely a bug in jwasm ?
Something about the code ordering when FRAME is specified causes VS debugger to not put the cursor on the procedure heading but rather the first instruction in the proc...
[EDIT]... it gets worse.. during execution of code inside WinMain as other procs are called RSP is adjusted, which totally buggers up the locals/args .... It would seem that VS2010 uses the current RSP to determine the values for locals which is moving around constantly...
It seems like the code generated by JWASM doesn't conform and to me just isn't correct.
Based on looking at the 64bit disasm of Visual C++ apps I draw the following conclusions:
1) There should be NO need to AND or align the stack pointer, the prolog should deal with all of this... especially considering you cannot pass xmmN as a param, the assembler would only allow a real4/real8 which would be moved using MOVSS/SD... no need for alignment. In addition if somewhere somehow you needed to mov a full XMM onto the stack I believe there was a new opcode (can't remember what its called now) that will automatically handle between movaps/movups.
2) the generated prologue should not be modifying RBP?
3) RSP should be sub'ed/added to INSIDE the callee not the caller..
The code should look something like:
000000013F7D1040 mov dword ptr [rsp+20h],r9d
000000013F7D1045 mov qword ptr [rsp+18h],r8
000000013F7D104A mov qword ptr [rsp+10h],rdx
000000013F7D104F mov qword ptr [rsp+8],rcx
000000013F7D1054 push rdi
000000013F7D1055 sub rsp,70h
000000013F7D1059 mov rdi,rsp
....
000000013F7D1152 add rsp,70h
000000013F7D1156 pop rdi
000000013F7D1157 ret
NB> there is still that bug with jwasm prolog moving the full R9 to stack instead of just R9d
Does someone have a contact for Japeth so we could get resolution on this? Or at least a way to work around it.. perhaps a new set of macros to avoid the built-in prologue..invoke...?
Ok.. I think we can work around this problem... I posted in the laboratory earlier with a new typed struct to represent XMMWORD (_m128) ..
If we combine that with a new proc/endproc and invoke macro loosely based on the following:
option prologue:none
option epilogue:none
align 16
testproc PROC a:_mm128
LOCAL myVar:DWORD
push rbp
sub rsp,2ch ;28h 4 params + return addr
mov rbp,rsp
mov myVar,10
add rsp,2ch
pop rbp
ret
testproc ENDP
option prologue:PrologueDef
option epilogue:EpilogueDef
I think we could avoid the register->shadow space bug and fix the code generation by not breaking RSP outside of the callee... In addition we ONLY and RSP,-16 once at the beginning of an application.. then the new PROC macro ensures the stack stays aligned by correctly inserting padding and rolling out the params + locals to the stack in aligned increments....
The above seems to work perfectly in VS, just like the C++ code..
Ok.. anyone want to help with the macros? :)
If your proc doesn't call any windows functions there is no need to align the stack.
"sub rsp,2ch" will cause problems since the stack is usually used in qword chunks, so a dword local should be 'promoted' to 8 bytes, with the upper dword only used for alignment to 8.
also, it is common practice to PUSH the base pointer register, then load it with the value of the stack pointer
adjustments for local variables are then made to the stack pointer
when you exit the routine, the stack pointer may be restored by using the base pointer value
i would imagine that the LEAVE instruction works under 64-bit, and is the same as MOV RSP,RBP/POP RBP
if other registers are preserved, they are pushed before the base pointer and popped afer the stack and base pointers have been restored
this seems like a logical sequence
OPTION PROLOGUE:None
OPTION EPILOGUE:None
TestProc PROC ParmA:_mm128
push rbx
push rsi
push rdi
push rbp
mov rbp,rsp
sub rsp,4 ;create a dword local at [rbp-4]
and rsp,-16 ;align the stack to create 16-aligned locals
sub rsp,32 ;create any simd locals
;
;
;
leave ;this restores RSP and RBP, discarding any locals
pop rdi
pop rsi
pop rbx
ret 16 ;return, discarding the _mm128 parm
TestProc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
I tried getting the _m128 onto the stack, but jwasm always sends it by reference rather than value... it might be nice to somehow update invoke to cater for that, so it could BY VAL or BY REF.. so currently it would just be a ret 8(instead of a ret 16) I'm guessing as it's a ptr.
Quote from: sinsi on January 20, 2012, 11:37:01 AM
If your proc doesn't call any windows functions there is no need to align the stack.
"sub rsp,2ch" will cause problems since the stack is usually used in qword chunks, so a dword local should be 'promoted' to 8 bytes, with the upper dword only used for alignment to 8.
It makes sense, but VC++ disagrees :) It pushes R9d onto the stack instead of the full R9. Also then what would happen to a byte/word parameter?
Even under win32 stdcall you'd have stack alignment issues if your params were say dword, byte, dword... ? So not having a qword aligned as a parameter I guess wouldn't be worse than the old case. Alternatively I'm sure it would be possible for the prologue
to correctly guage the params and just do one large 16n subtract of RSP to ensure that all reference locals are aligned natively.. even if it means wasting a few bytes of stack space here and there?
"ParmA:_mm128" should make space on the stack for the data, not a pointer
however, it is probably easier to pass a pointer whenever the data is larger than the machine width :P
Agreed :) But it would be nice to have control over it.. in any event jwasms invoke automatically makes it a ptr... more reason for a serious update to proc/invoke :)
well, you created a new type with your _mm128 stuff
you can reference the pointer as a pVoid (PVOID) type (i guess that's the 64-bit hungarian)
i might add....
if you pass simd data as a parameter, you have to be concerned with stack alignment of the parm(s)
this could be tricky, maybe even very difficult for INVOKE to verify and implement
whereas, if you pass a pointer to the simd data, you don't have to worry about such things :U
To be honest.. I have never yet needed to pass an XMM as a parameter. I've lived for years in MASM32 only passing pointers to SIMD data.. So although nice, my last few suggestions are purely *WISH LIST* things.
Right now the only problem I have is getting JWASMS built code to work in the debugger... that's an absolute show stopper for me, to be able to use VS2010 for debugging and see args/locals..For this I think we need a macro patch for PROC and INVOKE... ?
Hey,
After some discussion with Japeth we've worked out the problem, found specs around the API and how VC handles all of this correctly..
Basically what happens in VC is that it analyzes the call graph within a procedure by looking at every call that procedure makes and how much stack to allocate to accommodate this.
It appears to be something along the lines of sub RSP,32+(MAX_CALLED_PARAMS*16)+(MAX_PRIMITIVE_LOCALS*16)+ROUND_UP_TO_PARA(MAX_LOCAL_STRUCTS)... it then reserves this in the caller.
In addition VC doesn't write full QWORD registers to the stack as it's way of promoting .. it simply aligns the slots to QWORDS and writes in the respective type byte,word,dword etc.
The ABI clearly states that structs which aren't sized the same as one of these simple types should be passed by reference/pointer and not value (so that idea I've shelved).
No 64bit assembler currently does this... and it has many benefits
1) Allows one to debug 64bit applications with Visual Studio and use tools like VTune correctly with locals/arguments.
2) Follows the ABI more closely
3) Improved performance of generated code for two reasons, lower call overhead and by keeping the stack fixed throughout the duration of a proc regardless of how many calls it makes would improve the likelihood of stack data being cached.
What this means for the assembler:
1) modify INVOKE code generation - remove add,sup RSP around the call..
2) Add the difficult part - tracks calls inside a proc when rolling-out invokes and their locals/parameters to plug in the prologue for the caller..
This is quite a bit of work to change, but I think the benefits are worth it.. in terms of JWASM it would make it THE ONLY choice for 64bit Windows asm development.. as it's already near perfect and fast.
Hopefully Japeth agrees :) I've even offered to put money towards the effort if required as it really is worth it to me to be able to use my full tool-set in 64bit and Visual Studio is my debugger of choice.
It's an incredible piece of software and deserves it... I'd rather spend money on helping get it 100% than paying more money to MS for any other tool!
John
I agree, I will try to implement this kind of stuff to my SOL_ASM also ;)
So I thought I should bash away at replacing the prologue,invoke etc to at least be able to use 64bit in the meantime... and i've run into more issues...
; A proper type definition for an XMMWORD that allows debugger to see sub-elements/types.
__mm128i struct
i0 DWORD ?
i1 DWORD ?
i2 DWORD ?
i3 DWORD ?
__mm128i ends
_mm128i typedef __mm128i
__mm128f struct
f0 real4 ?
f1 real4 ?
f2 real4 ?
f3 real4 ?
__mm128f ends
_mm128f typedef __mm128f
_mm128 union
i32 _mm128i <>
f32 _mm128f <>
_mm128 ends
BNT MACRO
db 2eh
ENDM
BTK MACRO
db 3eh
ENDM
option casemap:none
option win64:1
option frame:auto
.nolist
.nocref
WIN32_LEAN_AND_MEAN equ 1
_WIN64 EQU 1
include c:\jwasm\WinInc\Include\windows.inc
.list
.cref
includelib <kernel32.lib>
includelib <user32.lib>
;myproc proto a:DWORD, b:REAL4, cc:QWORD
.const
.data?
.data
.code
NewPrologue MACRO procname, flags, argbytes, localbytes, <reglist>, userparms:VARARG
;mov [rsp+8],rcx
;mov [rsp+16],rdx
;mov [rsp+24],r8
;mov [rsp+32],r9
ECHO localbytes
IF localbytes GT 0
push rbp
mov rbp,rsp
sub rsp,(8*16)
ELSE ; If there are no locals, simply reserve space for 16 parameters.
push rbp
mov rbp,rsp
sub rsp,(8*16)
ENDIF
IFNB <reglist>
FOR reg,reglist
push reg
ENDM
ENDIF
exitm <(8*16)>
endm
NewEpilogue MACRO procname, flags, argbytes, localbytes, <reglist>, userparms:VARARG
IFNB <reglist>
FOR reg,reglist
pop reg
ENDM
ENDIF
leave
retn 0
endm
main proc
;invoke myproc , 1 , 10.2 , 4
mov ecx,1
pxor xmm0,xmm0
mov rdx,4
call myproc
ret
main endp
mainCRTStartup proc
invoke main
invoke ExitProcess,0
mainCRTStartup endp
option PROLOGUE:NewPrologue
option EPILOGUE:NewEpilogue
myproc proc a:DWORD, b:REAL4, cc:QWORD
LOCAL var1:DWORD
LOCAL var2:BYTE
LOCAL var3:REAL4
LOCAL var4:QWORD
LOCAL var5:_mm128
mov eax,var1
ret
myproc endp
end mainCRTStartup
If you look at that code, the localbytes that is sent into the prologue macro is reporting itself as 40bytes (28h) upon assembly... which is completely wrong...
it seems like its just taking 5 locals * 8.. and isn't respecting the size of the struct type _mm128.
Further more there seems to be a problem in the actually ABI specification, which states that 4*8 must be reserved as a minimum for shadow space, but they're neglecting that there could be 4 integer register parameters AS WELL as 4 float params in xmm0-xmm3.
So really the minimum reservation should be 8*8 (64 bytes) not 32 to ensure that a proc can copy all params from reg to stack...
I'm assembling with jwasm, going to see if ML64 reports the same incorrect localbytes value.
It appears ML64 also outputs 28h for local bytes...