News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

No-op sequences inserted by MASM for alignment

Started by MichaelW, May 13, 2005, 08:22:31 AM

Previous topic - Next topic

Jimg

QuoteI think it is not a fair comparsion. We should compare the length of padding. Like for instance compare 4 nops with one lea  esp,[esp+00h] and so on...
Yeah, and compare a jump
jmp @f
db x dup (0)
@@:

MichaelW

roticv,

The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

I updated the previous AlignTiming attachment to include a test of jmp near ptr $+5. Here are the results for my P3:

1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 95 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles


I'd like to see the P4 timings.
eschew obfuscation

Jimg

amd
1 byte,  nop                      : 29 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 197 cycles
4 bytes, lea  esp,[esp+00h]       : 197 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 901 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 197 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea  esp,[esp+00000000h] : 197 cycles

Athlons seem to be really sensitive to alignment, which is what we're testing.  by repeating the 5 byte sequence over and over you get every possible bad alignment that can be. 7 bad ones for 1 good one. The normal thing we would align for is to jmp to a byte at alignment 4 or 8 or 16 because we want the code at that point aligned on one of these good locations.  I added two tests where the destination was always on an 8 byte alignment.

   time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp short $+8           "
   time "90h,0EBh,05,00,00,00,00,00","8 bytes, nop,jmp short $+7       "

6 bytes, lea  ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea  esp,[esp+00000000h] : 196 cycles
8 bytes, jmp short $+8            : 198 cycles
8 bytes, nop,jmp short $+7        : 199 cycles

So the moral to this story is don't jump to oddly aligned addresses with an Athlon. :wink

MichaelW

To me, the moral is that a jmp of any form is not a reasonable choice for a 5-byte alignment filler, and for general-purpose use, probably not for any size filler.

Changed the tests to this:

    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "0E9h,00,00,00,00","5 bytes, jmp  near ptr $+5       "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "0EBh,03,00,00,00","5 bytes, jmp  short $+5          "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"
    time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp  short $+8          "
    time "90h,90h,90h,0E9h,00,00,00,000","8 bytes, 3 nops, jmp near ptr $+5"

Results for my P3:

1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
5 bytes, jmp  short $+5           : 205 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles

Edit:
Something that did not occur to me until after I posted, the last timing seems to me to indicate that the three nops are executing in parallel with the jmp.

eschew obfuscation

roticv

Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

Jimg

Michael-
QuoteTo me, the moral is that a jmp of any form is not a reasonable choice for a 5-byte alignment filler, and for general-purpose use, probably not for any size filler.
I agree in principle, especially on a P4.  I was mostly saying that the jmp to a non-aligned address for the test was not an equivalent comparison as that's the purpose of using align.  901cycles is a ridiculous figure for the test.

I have found, however, that on my screwy Athlon, in the normal loop many times timing tests, jmps followed by zeros are often quicker.  There is no reason they should be, but in actual testing, they often seem to be.  Hopefully this doesn't say anything about the validity of this type of 'loop a million times' testing. 

MichaelW

Quote from: roticv on May 16, 2005, 12:46:47 PM
Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

I agree. I failed to consider paring.

More tests:

    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "90h,90h","2 bytes, nop nop                 "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "90h,90h,90h","3 bytes, nop nop nop             "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "90h,90h,90h,90h","4 bytes, nop nop nop nop         "
    time "90h,8Dh,49h,00","4 bytes, nop lea  ecx,[ecx]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "0E9h,00,00,00,00","5 bytes, jmp  near ptr $+5       "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "0EBh,03,00,00,00","5 bytes, jmp  short $+5          "
    time "90h,90h,8Dh,49h,00","5 bytes, nop nop lea  ecx,[ecx]  "
    time "90h,8Dh,64h,24h,00","5 bytes, nop lea  esp,[esp+00h]  "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "90h,36h,8Dh,64h,24h,00","6 bytes, nop lea esp,ss:[esp+00h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"
    time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp  short $+8          "
    time "90h,90h,90h,0E9h,00,00,00,000","8 bytes, 3 nops, jmp near ptr $+5"

Results on my P3:

1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
2 bytes, nop nop                  : 96 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
3 bytes, nop nop nop              : 146 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
4 bytes, nop nop nop nop          : 196 cycles
4 bytes, nop lea  ecx,[ecx]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
5 bytes, jmp  short $+5           : 205 cycles
5 bytes, nop nop lea  ecx,[ecx]   : 146 cycles
5 bytes, nop lea  esp,[esp+00h]   : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
6 bytes, nop lea esp,ss:[esp+00h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles

eschew obfuscation

AeroASM

Quote from: MichaelW on May 14, 2005, 09:27:36 AM
Hi MazeGen,

I didn't recognize the problem until Mirno pointed it out. I just tested 6.15.8803 and 7.00.9466, and both generated the same "add eax,0". Surely Microsoft knows about this ??

I found two 5-byte encodings that do not affect the flags, but both might be somewhat slow and ML will not encode the second (although this might be true for some of the other encodings).

jmp   near ptr $
lea   esp,ss:[esp+0]
db    36h,8dh,64h,24h,00h

00401001                    loc_00401001:
00401001 E9FBFFFFFF             jmp     loc_00401001
00401006 8D2424                 lea     esp,[esp]
00401009 368D642400             lea     esp,ss:[esp]




Shouldn't the first one be:

00401001 E900000000          jmp loc_00401006
00401006                       loc_00401006

MichaelW

#23
Yes, it should be.

Correction:

No it should not be. The value of the location counter ($) is the address of the current instruction. When the instruction executes (E)IP will be set to the address of the next instruction, and the processor will make the jump by adding the encoded displacement to (E)IP. Since the destination, which is the address of the jmp instruction, is 5 less than the value (E)IP will have when the instruction executes, the encoded displacement must be -5.

eschew obfuscation

Mark_Larson

Quote from: MichaelW on May 16, 2005, 08:21:44 PM
[
Quote from: roticv on May 16, 2005, 12:46:47 PM
Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

I agree. I failed to consider paring.


I can get rid of the pairing on the three byte nops using black magic.  I learned a neat trick to make instructions longer, so I can make the nop 3 bytes (which is what you want), and only 1 instruction.  I learned this trick at Centaur.


db 66h,66h
nop


  The 66h are prefixes, but since they don't affect the nop nothing happens from them.  The whole instruction is 3 bytes, so you won't get pairing problems.  Let me know if it works ok, and if it solves your problem.

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

Hi Mark,

Adding the prefixes doubles the execution time on my P3. Is this a paring effect, or is because the prefixes themselves add to the execution time?

1 byte,  nop                      : 46 cycles
2 bytes, mov  edi,edi             : 95 cycles
2 bytes, nop nop                  : 96 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
3 bytes, nop nop nop              : 146 cycles
3 bytes, 66h 66h nop              : 298 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
4 bytes, nop nop nop nop          : 196 cycles
4 bytes, nop lea  ecx,[ecx]       : 96 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
5 bytes, jmp  short $+5           : 204 cycles
5 bytes, nop nop lea  ecx,[ecx]   : 147 cycles
5 bytes, nop lea  esp,[esp+00h]   : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
6 bytes, nop lea esp,ss:[esp+00h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles
8 bytes, 66h 66h nop, jmp near ptr $+5 : 297 cycles


So now you are going to add Black Magic to your Resume?
eschew obfuscation

Mark_Larson

Quote from: MichaelW on May 20, 2005, 06:14:32 PM
Hi Mark,

Adding the prefixes doubles the execution time on my P3. Is this a paring effect, or is because the prefixes themselves add to the execution time?


darn I was hoping it wouldn't.  I wonder if adding it to another "do nothing" instruction wouldn't give such poor execution time.  I am guessing it's related to the fact Intel has problems decoding it since it technically isn't a valid instruction. 



Quote from: MichaelW on May 20, 2005, 06:14:32 PM
So now you are going to add Black Magic to your Resume?


hehee.  I was feeling silly today, so made the comment about black magic.  ;)  I had a friend yesterday think that when she logged out of her computer and the screen blanked, it was going into standby and saving power.  I had to explain to her that that is just the monitor blanking and the rest of the system is still at running full power.  So I showed her how to go into standby to save power.  I realized that a lot of the things that Windows XP and computers do are probably black magic to people like her.  Thus I made the comment this morning about black magic.  :)

  However I did think adding the prefixes is a neat trick.  Too bad it's so slow.  I wonder if there is a way to extend an instruction with a valid prefix for that instruction to get rid of pairing, while at the same time still having a valid instruction.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

dioxin

Mark,
   <<I can get rid of the pairing on the three byte nops using black magic>>

   I thought (at least on the Athlon) that NOPs weren't "executed" anyway but were removed from the instruction stream before consuming resources.

From the Athlon Optimisation Guide:
QuoteThese instructions {NOPs} have an effective latency of that which is listed {zero}. They map to internal NOPs that can be executed at a rate of three per cycle and do not occupy execution resources.

   So pairing shouldn't be a problem for 1,2 or 3 byte NOPs. I don't know if Pentiums behave differently.

Paul.

Mark_Larson


  Dioxin, Intel P4 optimization manual lists a NOP as taking 0.5 cycles ( 1 cycle if you are a Prescott).  That is great that AMD implemented it that way :)   The other issue is even though NOPs are free on AMD ( for up to 3 NOPs), ALIGN for the most part does not use NOPs.
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

MichaelW

For the P3 100 nops take 45 cycles, so it would seem to be 0.5 cycles per nop. And on a P3 there is no penalty for a single segment override prefix.

From the Intel P1 Developer's Manual, Volume 3:

The NOP instruction is an alias mnemonic for the XCHG (E)AX,(E)AX instruction.

eschew obfuscation