News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Regarding Stack

Started by theunknownguy, June 23, 2010, 06:55:46 PM

Previous topic - Next topic

Rockoon

Phenom II x6 1055T @3.36ghz:

push: 236
mov: 236
push const: 236
mov const: 236
Press any key to continue ...
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

redskull

#31
Here is some info i wrote up on the stack engines; if anyone knows different, please point them out.

First, some preliminaries:

    Modern Intel chips (anything after PII, and Itanium), use a radically different approach than earlier era chips; in short, while still a "CISC" chip on the outside, they operate as "RISC" chips under the hood.  They do this by breaking up instructions into "micro ops", or just "uops" (where 'u' is supposed to be the Greek letter 'mu', the metric prefix for micro).
    One "complicated" instruction is broken down (by the decoding stage) into several, simpler uops.  For example:
   
    CALL MyFunction
   
    can be thought of (conceptually) as:
   
    PUSH EIP
    JMP  MyFunction
   
    which can be further broken down into

    SUB ESP,4
    MOV [ESP],EIP
    MOV EIP, OFFSET MyFunction
   
    and so on.  All these uops from the instruction stream are then directed to the appropriate part of the chip, called an *Execution Unit*, or EU.  Normally, Intel chips have around 4 or 5 different EU's, which each handle a different type of instruction (depending on your particular chip, most have more than one ALU):
   
    1) Arithmetic Logic Unit (ALU)
    2) Memory Reads
    3) Address calculation
    4) Memory Write
   
    So, for example, the three hypothetical uops from above would be sent to the ALU (for the SUB), the Memory Writer (for the MOV), and again to the ALU (for the second reg-reg MOV).  Each EU can operate independently of the other, so different uops from different instructions can execute "out of order"; this allows the CPU to work on other parts of other instructions, without having to wait while slow, unrelated ones finish.
    There is much more to be said about this, and this barely scratches the surface.  It's an extremely complex system which determines what uops can be executed, which ones have to wait for others to finish, and when an entire instruction is complete.  The trick to *real* optimization is making sure that all EU's are filled with uops, all the time.
    Anyway, onto the stack engine itself:
   
The "Stack Engine"

    Older CPU's basically work like above: manipulating the stack pointer (ESP) with ALU uops, which perform the adding or subtracting.  Newer CPU's have what's called a STACK ENGINE, which is special circuitry dedicating only to adjusting the stack.
    The stack engine lives as part of the decoder (which generates uops).  It exists to optimize just four different instructions: PUSH, POP, CALL, and RET (but *not* RET n).  These have the unique property that they all adjust ESP by *exactly* 4 bytes, every time, no matter what.
    It does this by keeping track of the "stack delta", which is just the relative difference between the stack pointer "now" and the stack pointer "later".  Each time it detects one of these four special instructions, it alters it's delta number up or down by four as needed.
    The magic happens, though, when it comes time to generate the uops; instead of generating one for the stack pointer math and one for the move, it generates just the one for the write, *but inserts the delta number into the address*.  Because all memory writes must go to through the address calculation, there is no performance loss, and an entire ALU uop is avoided.
    For example, consider the above example, where our 'CALL' was turned into three (purely hypothetical) uops:
   
    SUB ESP,4
    MOV [ESP],EIP
    MOV EIP, OFFSET MyFunction
   
   We'll assume the current delta in the stack engine is 0.  It notices that this is a "CALL", and adjusts its stack delta to -4.  Then it removes the "SUB" uop entirely, and adds it's current delta into the MOV instruction:
   
    MOV [ESP-4],EIP             ; Current Delta value inserted here
    MOV EIP, OFFSET MyFunction
   
    This cuts down an entire uop!  Considering that most programming is made of CALLs, and most functions have PUSHed arguments, it's a non-trivial speedup.  The stack engine is fast enough to keep pace with the decoder as well, so there is no bottleneck in doing this conversion.
    To extend the example, imagine 3 PUSHes and 1 CALL; The first PUSH would use the delta of -4, the next of -8, then -12, then -16.  RET and POP work opposite; they increase the delta, and add that value to the memory access uop.
    The first problem, however, is that now the "real" value of ESP (inside the CPU) is no longer correct.  We never actually modified it, so if another instruction wants to use it, it would be out of sync.  Continuing with the above, imagine that after our CALL, or function sets up a stack frame
   
    MOV EBP,ESP
   
    This is troublesome, because ESP was never changed!  When the stack engine detects another instruction (not PUSH/POP/CALL/RET) is using the stack pointer in some way, and the delta is non-zero, it must "synchronize" the two.  It does this by merely inserting a "synchronization uop", or just synch op.  It does nothing but add the value of the delta to ESP.
   
    ADD ESP,STACK_DELTA ; this uop corrects ESP (and the engine zeros the delta)
    MOV EBP,ESP         ; now ESP is correct, and can be used safely in this uop
   
    The second problem, however, is that the stack engine delta is only an 8-bit signed interger, which rolls over at +/- 128 (that's 32 consecutive stack operations in one direction).  To avoid this, when the stack delta gets high enough, the engine inserts the same style of synch op to reset the delta back to zero.  So, in the rare case that you do 100 PUSHes in a row, you will get 3 extra synchronization uops inserted into the stream to prevent this rollover.
   
    So, to sum up, if you use "MOV [ESP-n], reg", followed by a "SUB ESP,m", you merely do what the CPU is programmed to do automatically when you use PUSH.  The only real difference is that you explicitly set the value of ESP at the end, which the CPU will delay from doing until necessary (when the first direct access to ESP occurs, via a synch uop).  However, you will almost undoubtedly suffer another synch op at the *start* of the MOVs, because most likely the delta is not zero at that point.  In the rare case that you have over 32 arguments to PUSH, you would actually save one or two synch ops, which would be needed to prevent rollover.  However, in either case, the difference is only a couple uops either way, and certainly not anything other than negligible.
   
    Hope this helps clear things up
   
    -r
   
Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Just amazing info redskull, thats what i was trying to understand by "in depth".

Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

QuoteThese have the unique property that they all adjust ESP by *exactly* 4 bytes, every time, no matter what.

What happen with the PUSH WORD?

redskull

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

All that is fron Agner Fogs simply amazing microarchecture manuals.  It's pretty dense, but well worth the read, and totally free.

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
What happen with the PUSH WORD?

Your Program crashes  :8)

But really, I have no idea.  I would presume it would just do it the "normal" way, by adjusting ESP via an normal ALU uop.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

jMerliN

To clarify where this question came from and just why the OP asked anything about this here, I'll bring up the argument he made (and lost, clearly misunderstanding anything about what I said) and how he contorted the argument into something about push vs a high speed string instruction to move a large amount of data onto the stack, and for some reason or another he felt justified in talking about x64 calling conventions, when we're really only talking about a local stack frame.

On RZ, he comes asking a question of "high level minds," (for full disclosure, view the thread here: http://forum.ragezone.com/f144/testing-high-level-minds-669035/) wanting to know what we think certain things are.  He gets the English horribly wrong, and then proceeds to bash any answer anyone gives as if he's some all knowing superior intellect (he hangs around the MASM32 forums, but this doesn't necessitate any level of knowledge of how things work, he's probably never even had a 10K project before).

So in arguing with another user on that forum, he posts this code, claiming this was an "implementation" of someone's abstract definition of a stack (which was for all intents and purposes, correct):


push ebp
mov ebp, esp
mov eax, NumberOfVariables  ;Number of variables you have
imul eax, eax, DWORD        ;Initialise each variable to 4 bytes
add esp, eax
mov [esp], 1                ;Lets put a variable with value 1
pop edx                     ;Restore it on EDX
mov [esp+4], 1              ;Next variable with value 1
pop eax                     ;Restore on EAX
add eax, edx                ;1+1 is so simple has knowing you are a noob
push eax                    ;Push it for save into stack


I reply to this, later in the thread while flaming the kid for what I can only interpret as utter stupidity, as in his example he USES a stack to set up a common stack frame with a dynamic number of local variables and then proceeds to do this stupid arithmetic.  

Do note though, the errors present if we assume the traditional call model, he's adding to esp for local vars, not subbing, so he'd overwrite some value in the local space of the calling function (wut?).  Further, when he "sets the variables" to value 1, pop is adding 4 intrinsically to esp each call, so [esp+4] in the second variable will be the original esp + 8, so he's skipping 4 bytes there.  But loosely trying to interpret what he meant, we'll continue (that is, from his comments).

My response was if this was the result of a compiler producing ASM from C code ("high level" right, hence it's clearly vastly inferior to anything he can write), it would look something like.. this:


int myfunc(){
 int one = 1;
 int one2 = 1;
 one2 = one2 + one;
 // ...
}


My argument was that any compiler worth its salt with a decent optimizer would produce assembly for this code, with only what we know about it, as "push 2" (in accordance with his last comment, "save into stack", and because the frame he sets up is unnecessary to the end goal, which is computing 1+1 statically).  I claimed that this single instruction was more efficient and produced the same exact effect as the code he posted (assuming it does what he meant), and clearly, not to insult anyone's intelligence, it is.

His response to this was some nonsensical bullshit I can't comprehend:

Quote
Thats kind of sad really, so the mov [esp], XX its not faster than a push 2?...

Curious since push inside CPU would do:


mov [esp], XX
add esp, 4

A little more lecture kid...

You see he completely misunderstood what I said.  So this thread was to try to prove his precious "mov [esp], xx" is faster than "push 2".  He further demonstrates what intrinsically happens with a push (as explained in a good bit of detail above), which I respond is faster because of optimizations made in hardware (which he apparently did not believe, so I had assumed he doesn't read much).  The argument was never about whether pushing a large amount of data on the stack is less efficient than using a high speed string instruction to fill in a large chunk of the stack followed by adjusting esp, that was borne of his own mind and ignorance.

This very thread culminated in the flame posted here:  http://forum.ragezone.com/f144/x32-64-push-vs-mov-test-671634/ .

I know you're not the minions of a 15 year old kid who barely knows English, and you don't intend to empower an astonishing level of ignorance on the internet, but I just thought I'd point out to you where this came from, and that this kid is just another leech.

theunknownguy

Quote from: redskull on June 24, 2010, 01:02:52 AM
Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
Can i know how you learn this stuff? i really like to understand everything the most detailed possible.

All that is fron Agner Fogs simply amazing microarchecture manuals.  It's pretty dense, but well worth the read, and totally free.

Quote from: theunknownguy on June 24, 2010, 12:13:42 AM
What happen with the PUSH WORD?

Your Program crashes  :8)

But really, I have no idea.  I would presume it would just do it the "normal" way, by adjusting ESP via an normal ALU uop.

-r

push 3131
pop word []
push 3131
pop word []

I use this has a method for generate an obfuscated NULL in stack and no crash, at first mess stack viewer under debuggers...

I will read more agner fog manuals =D

theunknownguy

Quote from: jMerliN on June 24, 2010, 01:05:57 AM
To clarify where this question came from and just why the OP asked anything about this here, I'll bring up the argument he made (and lost, clearly misunderstanding anything about what I said) and how he contorted the argument into something about push vs a high speed string instruction to move a large amount of data onto the stack, and for some reason or another he felt justified in talking about x64 calling conventions, when we're really only talking about a local stack frame.

On RZ, he comes asking a question of "high level minds," (for full disclosure, view the thread here: http://forum.ragezone.com/f144/testing-high-level-minds-669035/) wanting to know what we think certain things are.  He gets the English horribly wrong, and then proceeds to bash any answer anyone gives as if he's some all knowing superior intellect (he hangs around the MASM32 forums, but this doesn't necessitate any level of knowledge of how things work, he's probably never even had a 10K project before).

So in arguing with another user on that forum, he posts this code, claiming this was an "implementation" of someone's abstract definition of a stack (which was for all intents and purposes, correct):


push ebp
mov ebp, esp
mov eax, NumberOfVariables  ;Number of variables you have
imul eax, eax, DWORD        ;Initialise each variable to 4 bytes
add esp, eax
mov [esp], 1                ;Lets put a variable with value 1
pop edx                     ;Restore it on EDX
mov [esp+4], 1              ;Next variable with value 1
pop eax                     ;Restore on EAX
add eax, edx                ;1+1 is so simple has knowing you are a noob
push eax                    ;Push it for save into stack


I reply to this, later in the thread while flaming the kid for what I can only interpret as utter stupidity, as in his example he USES a stack to set up a common stack frame with a dynamic number of local variables and then proceeds to do this stupid arithmetic.  

Do note though, the errors present if we assume the traditional call model, he's adding to esp for local vars, not subbing, so he'd overwrite some value in the local space of the calling function (wut?).  Further, when he "sets the variables" to value 1, pop is adding 4 intrinsically to esp each call, so [esp+4] in the second variable will be the original esp + 8, so he's skipping 4 bytes there.  But loosely trying to interpret what he meant, we'll continue (that is, from his comments).

My response was if this was the result of a compiler producing ASM from C code ("high level" right, hence it's clearly vastly inferior to anything he can write), it would look something like.. this:


int myfunc(){
 int one = 1;
 int one2 = 1;
 one2 = one2 + one;
 // ...
}


My argument was that any compiler worth its salt with a decent optimizer would produce assembly for this code, with only what we know about it, as "push 2" (in accordance with his last comment, "save into stack", and because the frame he sets up is unnecessary to the end goal, which is computing 1+1 statically).  I claimed that this single instruction was more efficient and produced the same exact effect as the code he posted (assuming it does what he meant), and clearly, not to insult anyone's intelligence, it is.

His response to this was some nonsensical bullshit I can't comprehend:

Quote
Thats kind of sad really, so the mov [esp], XX its not faster than a push 2?...

Curious since push inside CPU would do:


mov [esp], XX
add esp, 4

A little more lecture kid...

You see he completely misunderstood what I said.  So this thread was to try to prove his precious "mov [esp], xx" is faster than "push 2".  He further demonstrates what intrinsically happens with a push (as explained in a good bit of detail above), which I respond is faster because of optimizations made in hardware (which he apparently did not believe, so I had assumed he doesn't read much).  The argument was never about whether pushing a large amount of data on the stack is less efficient than using a high speed string instruction to fill in a large chunk of the stack followed by adjusting esp, that was borne of his own mind and ignorance.

This very thread culminated in the flame posted here:  http://forum.ragezone.com/f144/x32-64-push-vs-mov-test-671634/ .

I know you're not the minions of a 15 year old kid who barely knows English, and you don't intend to empower an astonishing level of ignorance on the internet, but I just thought I'd point out to you where this came from, and that this kid is just another leech.

Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless


redskull

Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions.  Also, as a courtesy, please keep your other-fourm flame wars in the other forums.  Do like the rest of us and start your own here with our own members   :toothy

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

theunknownguy

Quote from: redskull on June 24, 2010, 01:19:51 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions.  Also, as a courtesy, please keep your other-fourm flame wars in the other forums.  Do like the rest of us and start your own here with our own members   :toothy

-r

I run a test for fill up large space under stack, tough you right the "REP" is one of the slowest sh*ts, but works fairly enough.

Also what about C++ compilers? Is fair to think that using direct stack MOV could avoid the frame stack creation isn't? (and save some clocks)

And i don't involve the forum in other forum fights, if you see the thread well i was making a survey i need to do for present in another forum dedicated to understatement of high level logic.

PS: I cant do my survey here, i mean this is a low level forum coding... Though i could do a research of low level logic abstraction with a few questions but damn i would be the one questioned... (on this forum)  :toothy

jMerliN

Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless

The discussion wasn't about the use of registers to pass data to function calls at all.  You still don't understand, I see.

You have a horrible grasp of optimizations that can be made from a syntax slightly higher than one of a macro assembly.  You should read http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886 , I suggest this only because of the price at which you can get a used one (practically free), the newer edition is quite expensive still, but may cover more recent optimization techniques.  You don't seem to understand what exactly a compiler does, and how it can easily reduce code like that to highly optimized assembly with just as much (if not more) knowledge than you have of hand optimizing assembly.

Let me give you a demonstration, as this is not a "high level" vs "low level" fight, the point was just that even a "high level" language (which to you means completely inferior in every way) such as C could produce output that would do the same thing with far less work than your assembly was doing.


#include <stdio.h>

int main(int argc, char** argv){
 int a = 1;
 int b = 1;
 b = a + b;
 printf("a+b = %d",b);
 return 0;
}


Compiling this then opening it with OllyDbg yields:


01221001   6A 02            PUSH 2
01221003   68 F4202201      PUSH OFFSET test.??_C@_08CLODKBON@a?$CLb>; ASCII "a+b = %d"
01221008   FF15 A0202201    CALL DWORD PTR DS:[<&MSVCR100.printf>]   ; MSVCR100.printf
0122100E   83C4 08          ADD ESP,8
01221011   33C0             XOR EAX,EAX
01221013   C3               RETN


As you can see, my point has been proved.  Thanks much for the disbelief.



Quote from: redskull on June 24, 2010, 01:19:51 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

Be careful what you say, few things on the intel are as slow as REP instructions.  Also, as a courtesy, please keep your other-fourm flame wars in the other forums.  Do like the rest of us and start your own here with our own members   :toothy

-r

I will gladly, if you'll keep your members from starting flamewars on other forums then coming back here for arguments.

theunknownguy

Quote from: jMerliN on June 24, 2010, 01:29:28 AM
Quote from: theunknownguy on June 24, 2010, 01:12:50 AM
Has you can read this post is serious, the idea comes from X64 reg usage has i post under Ragezone forum (you just have a problem with understand me)

And dont dare to use this forum has a discussion matter.

Still using a large data under stack has i say over and over is more efficient using "REP" prefix than do PUSH 2 thons of time.

And i still say it COMPILER cant do it.

PS: Also has i argue on that thread using direct mov under stack could avoid using the "stack frame" generation by compilers like C++ ones.

I think this is more like a High level lang fight agaisnt low level wich is pointless

The discussion wasn't about the use of registers to pass data to function calls at all.  You still don't understand, I see.

You have a horrible grasp of optimizations that can be made from a syntax slightly higher than one of a macro assembly.  You should read http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886 , I suggest this only because of the price at which you can get a used one (practically free), the newer edition is quite expensive still, but may cover more recent optimization techniques.  You don't seem to understand what exactly a compiler does, and how it can easily reduce code like that to highly optimized assembly with just as much (if not more) knowledge than you have of hand optimizing assembly.

Let me give you a demonstration, as this is not a "high level" vs "low level" fight, the point was just that even a "high level" language (which to you means completely inferior in every way) such as C could produce output that would do the same thing with far less work than your assembly was doing.


#include <stdio.h>

int main(int argc, char** argv){
 int a = 1;
 int b = 1;
 b = a + b;
 printf("a+b = %d",b);
 return 0;
}


Compiling this then opening it with OllyDbg yields:


01221001   6A 02            PUSH 2
01221003   68 F4202201      PUSH OFFSET test.??_C@_08CLODKBON@a?$CLb>; ASCII "a+b = %d"
01221008   FF15 A0202201    CALL DWORD PTR DS:[<&MSVCR100.printf>]   ; MSVCR100.printf
0122100E   83C4 08          ADD ESP,8
01221011   33C0             XOR EAX,EAX
01221013   C3               RETN


As you can see, my point has been proved.  Thanks much for the disbelief.

I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

My point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

Also things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

jMerliN

Quote from: theunknownguy on June 24, 2010, 01:37:38 AM
I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

This is not true.  There are several things you can do to make the popular compilers use REP to fill in stack data in C/C++ if you take advantage of their optimization engines.

QuoteMy point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

The stack isn't being used to emulate registers.  It's using to hold data, that's its purpose.  Even with the 16 gen purpose registers in x64, most functions will still need a stack frame for stack allocated objects (rule #1: don't put on the heap what you can put on the stack).  Call overhead will be reduced by a great deal if you make good use of the new registers.

QuoteAlso things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

You should reverse more C++ applications.  I spend a lot of time reverse engineering the ASM output of the various C++ compilers on different optimization levels and there's very little I've seen that they won't produce.  Despite what you think, the people writing compilers aren't complete morons, and they have access to all the information about writing efficient ASM programs that you do, so to believe they don't try to optimize their output is an insulting mistake. 

I will say though, the people who wrote the VB compilers pre-.NET were stupid and should never be hired for serious programming again.  By far the worst assembly generation I've ever seen in my life.

theunknownguy

Quote from: jMerliN on June 25, 2010, 12:36:50 AM
Quote from: theunknownguy on June 24, 2010, 01:37:38 AM
I cant see the point of what you writing to what i am saying.

I am saying that C++ compilers wont do ever (by themself) the "REP" prefix in case you wanted to fill stack with some data...

This is not true.  There are several things you can do to make the popular compilers use REP to fill in stack data in C/C++ if you take advantage of their optimization engines.

QuoteMy point is by using stack has "emulated" regs of x64 you could avoid doing the "stack frame" and at the end you ll save some clocks.
(Under x64 you can do this with the new regs)

The stack isn't being used to emulate registers.  It's using to hold data, that's its purpose.  Even with the 16 gen purpose registers in x64, most functions will still need a stack frame for stack allocated objects (rule #1: don't put on the heap what you can put on the stack).  Call overhead will be reduced by a great deal if you make good use of the new registers.

QuoteAlso things like this in high level like C++ i will never seen them without _ASM directive.

I keep saying C++ compilers have their limitations on "optimising".

You should reverse more C++ applications.  I spend a lot of time reverse engineering the ASM output of the various C++ compilers on different optimization levels and there's very little I've seen that they won't produce.  Despite what you think, the people writing compilers aren't complete morons, and they have access to all the information about writing efficient ASM programs that you do, so to believe they don't try to optimize their output is an insulting mistake. 

I will say though, the people who wrote the VB compilers pre-.NET were stupid and should never be hired for serious programming again.  By far the worst assembly generation I've ever seen in my life.

Not feeling like discuss for today, getting back from office.

But i meaned this with the emulation register in base of stack:

From agner Microarchitecture:


It may be possible to avoid stack synchronization µops completely in a critical function if all
function parameters are transferred in registers and all local variables are stored in registers
or with PUSH and POP. This is most realistic with the calling conventions of 64-bit Linux. Any
necessary alignment of the stack can be done with a dummy PUSH instruction in this case.


In x64 you dont need them, since you already have new regs...

And i dont want to discuss about the C++ compiler just pointless, they cant even align code automatic to 16 bytes in a critical loop...


PS: "This is most realistic with calling conventions of 64-bit Linux"... (And x64 microsoft?)  ::)