Print Page - No-op sequences inserted by MASM for alignment

Title: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 13, 2005, 08:22:31 AM

Just to satisfy my curiosity.


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; A crude app to determine the no-op sequences inserted by MASM
; for alignment, 1 to 15 byte lengths.
;
; Assemble and dis-assemble to see sequences.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .486                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm

    pad MACRO cnt
      REPEAT cnt
        clc
      ENDM
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    pad 1
    align 2
    stc
    pad 3
    align 4
    stc
    align 4
    stc
    pad 7
    align 8
    stc
    pad 2
    align 8
    stc
    pad 1
    align 8
    stc
    align 8
    stc
    pad 7
    align 16
    stc
    pad 6
    align 16
    stc
    pad 5
    align 16
    stc
    pad 4
    align 16
    stc
    pad 3
    align 16
    stc
    pad 2
    align 16
    stc
    pad 1
    align 16
    stc
    align 16
    stc    
    jz    @F
  @@:
    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

; -------------------------------------------------------------
; No-op sequences inserted for align, MASM 6.14, 1 to 15 bytes:
; -------------------------------------------------------------

00401001 90                     nop

00401006 8BFF                   mov     edi,edi

00401009 8D4900                 lea     ecx,[ecx]

00401014 8D642400               lea     esp,[esp]

0040101B 0500000000             add     eax,0

00401022 8D9B00000000           lea     ebx,[ebx]

00401029 8DA42400000000         lea     esp,[esp]

00401038 8DA42400000000         lea     esp,[esp]
0040103F 90                     nop

00401047 8DA42400000000         lea     esp,[esp]
0040104E 8BFF                   mov     edi,edi

00401056 8DA42400000000         lea     esp,[esp]
0040105D 8D4900                 lea     ecx,[ecx]

00401065 8DA42400000000         lea     esp,[esp]
0040106C 8D642400               lea     esp,[esp]

00401074 8DA42400000000         lea     esp,[esp]
0040107B 0500000000             add     eax,0

00401083 8DA42400000000         lea     esp,[esp]
0040108A 8D9B00000000           lea     ebx,[ebx]

00401092 8DA42400000000         lea     esp,[esp]
00401099 8DA42400000000         lea     esp,[esp]

004010A1 8DA42400000000         lea     esp,[esp]
004010A8 8DA42400000000         lea     esp,[esp]
004010AF 90                     nop

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mirno on May 13, 2005, 11:01:42 AM

The interesting one is "add eax, 0", of all the "NOP" instructions it's the only one which has an effect on the flags....
I guess it's worth remembering that the state may not be the same after an align...

Mirno

Title: Re: No-op sequences inserted by MASM for alignment
Post by: P1 on May 13, 2005, 01:24:02 PM

Quote from: Mirno on May 13, 2005, 11:01:42 AM
The interesting one is "add eax, 0", of all the "NOP" instructions it's the only one which has an effect on the flags....
I guess it's worth remembering that the state may not be the same after an align...

Mirno

Any chance this a microprocessor dependent bug? Have you done any testing for this behavior?

Regards, P1 :8)

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mirno on May 13, 2005, 03:19:05 PM

It's not a processor bug, it's an assembler bug. NOPs are supposed to have no effect on the execution, and the add instruction is (it affects the flags if nothing else).

Code Select


.486
.model flat, stdcall
option casemap :none 

include \masm32\include\windows.inc 
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc

includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib

pad MACRO cnt
  REPEAT cnt
    nop
  ENDM
ENDM

.data
msg db "You should see this", 0

.code
start:
  mov eax, 1
  cmp eax, 1

  pad 2 ; <- REPLACE WITH 3
ALIGN 16

jne @F
  invoke MessageBox, NULL, ADDR msg, ADDR msg, MB_OK
@@:

  invoke ExitProcess, 0
end start

If you change the pad line to 3, the message won't display, the only difference being that the "NOPs" inserted by the align directive are not entirely benign.
The code here is contrived, but I could imagine someone somewhere being bitten by this.

Imagine that you've got working code, and then change something (even in another function), that pushes the align a bit causing it to use the "add eax, 0" NOP and your code breaks. It'd be really nasty to find. Although writing code which has the compare & jump on opposite sides of an align directive is morally wrong!

Mirno

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MazeGen on May 13, 2005, 09:32:48 PM

YAMLB!

Michael, thanks for reporting that hidden bug :thumbu

(Yet Another ML Bug!)

Did you thing about reporting that to Microsoft? They could fix it in ML 8.0.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 14, 2005, 09:27:36 AM

Hi MazeGen,

I didn't recognize the problem until Mirno pointed it out. I just tested 6.15.8803 and 7.00.9466, and both generated the same "add eax,0". Surely Microsoft knows about this ??

I found two 5-byte encodings that do not affect the flags, but both might be somewhat slow and ML will not encode the second (although this might be true for some of the other encodings).

Code Select


jmp   near ptr $
lea   esp,ss:[esp+0]
db    36h,8dh,64h,24h,00h

00401001                    loc_00401001:
00401001 E9FBFFFFFF             jmp     loc_00401001
00401006 8D2424                 lea     esp,[esp]
00401009 368D642400             lea     esp,ss:[esp]

Title: Re: No-op sequences inserted by MASM for alignment
Post by: roticv on May 14, 2005, 12:36:46 PM

Maybe masm is being pragmatic. When you used esp, the processor already know you are referring to the segment ss, so 36h as a prefix has no meaning.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: hutch-- on May 14, 2005, 01:07:53 PM

It does the same with added displacements of 00000000h.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: zooba on May 15, 2005, 03:24:33 AM

What's wrong with padding with this?

Code Select

lea esp, [esp]
nop

I'm guessing the longer lea command is faster than straight nop's because the processor skips through it in one step, but a jump (even to nowhere) is going to flush stuff on newer processors isnt it?

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 15, 2005, 06:47:27 AM

Investigating further, I coded an app that measures the cycle counts for the instructions that MASM uses for alignment, along with one of the 5-byte alternatives. Obviously, timing a series of jumps with a displacement of zero will not yield useful results (or at least it was obvious to me... after I tried it :green).

Code Select


; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive
 
    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition,caption
      sz SIZESTR <definition>
      bytes SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          db bytes
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(caption," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start: 
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    ; "+00h" indicates a one-byte displacement of zero.
    ; "+00000000h" indicates a four-byte displacement of zero.
    
    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

The results for my P3:

Code Select


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 95 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles

[attachment deleted by admin]

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Jimg on May 15, 2005, 12:45:55 PM

Interesting. Results for an athlon xp 3000

Code Select


1 byte,  nop                      : 29 cycles
2 bytes, mov  edi,edi             : 96 cycles
3 bytes, lea  ecx,[ecx]           : 198 cycles
4 bytes, lea  esp,[esp+00h]       : 198 cycles
5 bytes, add  eax,0               : 96 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 198 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea  esp,[esp+00000000h] : 198 cycles

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mirno on May 15, 2005, 03:33:33 PM

XP3000+ AMD64.

Code Select


1 byte,  nop                      : 27 cycles
2 bytes, mov  edi,edi             : 94 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 94 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles

It'd be interesting to time a jump over the same region.

Mirno

Title: Re: No-op sequences inserted by MASM for alignment
Post by: AeroASM on May 15, 2005, 03:46:13 PM

Pentium M 1.5GHz

Code Select


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mark Jones on May 15, 2005, 04:53:21 PM

I get the same as Jim with an AMD XP 1800+

...so what's the moral of the story here, instructions involving brackets run much slower on AMD?

Title: Re: No-op sequences inserted by MASM for alignment
Post by: roticv on May 15, 2005, 05:51:16 PM

I think it is not a fair comparsion. We should compare the length of padding. Like for instance compare 4 nops with one lea esp,[esp+00h] and so on...

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Jimg on May 15, 2005, 06:21:40 PM

QuoteI think it is not a fair comparsion. We should compare the length of padding. Like for instance compare 4 nops with one lea esp,[esp+00h] and so on...

Yeah, and compare a jump
jmp @f
db x dup (0)
@@:

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 15, 2005, 09:52:24 PM

roticv,

The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

I updated the previous AlignTiming attachment to include a test of jmp near ptr $+5. Here are the results for my P3:

Code Select


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 95 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles

I'd like to see the P4 timings.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Jimg on May 16, 2005, 12:44:50 AM

amd
1 byte, nop : 29 cycles
2 bytes, mov edi,edi : 95 cycles
3 bytes, lea ecx,[ecx] : 197 cycles
4 bytes, lea esp,[esp+00h] : 197 cycles
5 bytes, add eax,0 : 95 cycles
5 bytes, jmp near ptr $+5 : 901 cycles
5 bytes, lea esp,ss:[esp+00h] : 197 cycles
6 bytes, lea ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea esp,[esp+00000000h] : 197 cycles

Athlons seem to be really sensitive to alignment, which is what we're testing. by repeating the 5 byte sequence over and over you get every possible bad alignment that can be. 7 bad ones for 1 good one. The normal thing we would align for is to jmp to a byte at alignment 4 or 8 or 16 because we want the code at that point aligned on one of these good locations. I added two tests where the destination was always on an 8 byte alignment.

time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp short $+8 "
time "90h,0EBh,05,00,00,00,00,00","8 bytes, nop,jmp short $+7 "

6 bytes, lea ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea esp,[esp+00000000h] : 196 cycles
8 bytes, jmp short $+8 : 198 cycles
8 bytes, nop,jmp short $+7 : 199 cycles

So the moral to this story is don't jump to oddly aligned addresses with an Athlon. :wink

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 16, 2005, 05:59:13 AM

To me, the moral is that a jmp of any form is not a reasonable choice for a 5-byte alignment filler, and for general-purpose use, probably not for any size filler.

Changed the tests to this:

Code Select


    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "0E9h,00,00,00,00","5 bytes, jmp  near ptr $+5       "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "0EBh,03,00,00,00","5 bytes, jmp  short $+5          "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"
    time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp  short $+8          "
    time "90h,90h,90h,0E9h,00,00,00,000","8 bytes, 3 nops, jmp near ptr $+5"

Results for my P3:

Code Select


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
5 bytes, jmp  short $+5           : 205 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles

Edit:
Something that did not occur to me until after I posted, the last timing seems to me to indicate that the three nops are executing in parallel with the jmp.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: roticv on May 16, 2005, 12:46:47 PM

Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Jimg on May 16, 2005, 02:52:51 PM

Michael-

QuoteTo me, the moral is that a jmp of any form is not a reasonable choice for a 5-byte alignment filler, and for general-purpose use, probably not for any size filler.

I agree in principle, especially on a P4. I was mostly saying that the jmp to a non-aligned address for the test was not an equivalent comparison as that's the purpose of using align. 901cycles is a ridiculous figure for the test.

I have found, however, that on my screwy Athlon, in the normal loop many times timing tests, jmps followed by zeros are often quicker. There is no reason they should be, but in actual testing, they often seem to be. Hopefully this doesn't say anything about the validity of this type of 'loop a million times' testing.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 16, 2005, 08:21:44 PM

Quote from: roticv on May 16, 2005, 12:46:47 PM
Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

I agree. I failed to consider paring.

More tests:

Code Select


    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "90h,90h","2 bytes, nop nop                 "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "90h,90h,90h","3 bytes, nop nop nop             "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "90h,90h,90h,90h","4 bytes, nop nop nop nop         "
    time "90h,8Dh,49h,00","4 bytes, nop lea  ecx,[ecx]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "0E9h,00,00,00,00","5 bytes, jmp  near ptr $+5       "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "0EBh,03,00,00,00","5 bytes, jmp  short $+5          "
    time "90h,90h,8Dh,49h,00","5 bytes, nop nop lea  ecx,[ecx]  "
    time "90h,8Dh,64h,24h,00","5 bytes, nop lea  esp,[esp+00h]  "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "90h,36h,8Dh,64h,24h,00","6 bytes, nop lea esp,ss:[esp+00h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"
    time "0EBh,06,00,00,00,00,00,00","8 bytes, jmp  short $+8          "
    time "90h,90h,90h,0E9h,00,00,00,000","8 bytes, 3 nops, jmp near ptr $+5"

Results on my P3:

Code Select


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
2 bytes, nop nop                  : 96 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
3 bytes, nop nop nop              : 146 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
4 bytes, nop nop nop nop          : 196 cycles
4 bytes, nop lea  ecx,[ecx]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
5 bytes, jmp  short $+5           : 205 cycles
5 bytes, nop nop lea  ecx,[ecx]   : 146 cycles
5 bytes, nop lea  esp,[esp+00h]   : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
6 bytes, nop lea esp,ss:[esp+00h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles

Title: Re: No-op sequences inserted by MASM for alignment
Post by: AeroASM on May 20, 2005, 07:29:45 AM

Quote from: MichaelW on May 14, 2005, 09:27:36 AM
Hi MazeGen,

I didn't recognize the problem until Mirno pointed it out. I just tested 6.15.8803 and 7.00.9466, and both generated the same "add eax,0". Surely Microsoft knows about this ??

I found two 5-byte encodings that do not affect the flags, but both might be somewhat slow and ML will not encode the second (although this might be true for some of the other encodings).
Code Select Expand
jmp near ptr $ lea esp,ss:[esp+0] db 36h,8dh,64h,24h,00h 00401001 loc_00401001: 00401001 E9FBFFFFFF jmp loc_00401001 00401006 8D2424 lea esp,[esp] 00401009 368D642400 lea esp,ss:[esp]

Shouldn't the first one be:

00401001 E900000000 jmp loc_00401006
00401006 loc_00401006

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 20, 2005, 09:29:16 AM

Yes, it should be.

Correction:

No it should not be. The value of the location counter ($) is the address of the current instruction. When the instruction executes (E)IP will be set to the address of the next instruction, and the processor will make the jump by adding the encoded displacement to (E)IP. Since the destination, which is the address of the jmp instruction, is 5 less than the value (E)IP will have when the instruction executes, the encoded displacement must be -5.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mark_Larson on May 20, 2005, 03:30:28 PM

Quote from: MichaelW on May 16, 2005, 08:21:44 PM
[
Quote from: roticv on May 16, 2005, 12:46:47 PM
Quote from: MichaelW on May 15, 2005, 09:52:24 PM
roticv,
The loop is timing back to back executions of the instructions, so you can just multiply the cycle count by the number of nops. For a P3 or Pentium M the break-even point is ~2 nops, and for an Athlon XP ~4 nops.

You can't just multiply like that in my opinion. Some nop might pair.

I agree. I failed to consider paring.

I can get rid of the pairing on the three byte nops using black magic. I learned a neat trick to make instructions longer, so I can make the nop 3 bytes (which is what you want), and only 1 instruction. I learned this trick at Centaur.

Code Select


db 66h,66h
nop

The 66h are prefixes, but since they don't affect the nop nothing happens from them. The whole instruction is 3 bytes, so you won't get pairing problems. Let me know if it works ok, and if it solves your problem.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 20, 2005, 06:14:32 PM

Hi Mark,

Adding the prefixes doubles the execution time on my P3. Is this a paring effect, or is because the prefixes themselves add to the execution time?

Code Select


1 byte,  nop                      : 46 cycles
2 bytes, mov  edi,edi             : 95 cycles
2 bytes, nop nop                  : 96 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
3 bytes, nop nop nop              : 146 cycles
3 bytes, 66h 66h nop              : 298 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
4 bytes, nop nop nop nop          : 196 cycles
4 bytes, nop lea  ecx,[ecx]       : 96 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, jmp  near ptr $+5        : 223 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
5 bytes, jmp  short $+5           : 204 cycles
5 bytes, nop nop lea  ecx,[ecx]   : 147 cycles
5 bytes, nop lea  esp,[esp+00h]   : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
6 bytes, nop lea esp,ss:[esp+00h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles
8 bytes, jmp  short $+8           : 197 cycles
8 bytes, 3 nops, jmp near ptr $+5 : 197 cycles
8 bytes, 66h 66h nop, jmp near ptr $+5 : 297 cycles

So now you are going to add Black Magic to your Resume?

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mark_Larson on May 20, 2005, 07:48:08 PM

Quote from: MichaelW on May 20, 2005, 06:14:32 PM
Hi Mark,

Adding the prefixes doubles the execution time on my P3. Is this a paring effect, or is because the prefixes themselves add to the execution time?

darn I was hoping it wouldn't. I wonder if adding it to another "do nothing" instruction wouldn't give such poor execution time. I am guessing it's related to the fact Intel has problems decoding it since it technically isn't a valid instruction.

Quote from: MichaelW on May 20, 2005, 06:14:32 PM
So now you are going to add Black Magic to your Resume?

hehee. I was feeling silly today, so made the comment about black magic. ;) I had a friend yesterday think that when she logged out of her computer and the screen blanked, it was going into standby and saving power. I had to explain to her that that is just the monitor blanking and the rest of the system is still at running full power. So I showed her how to go into standby to save power. I realized that a lot of the things that Windows XP and computers do are probably black magic to people like her. Thus I made the comment this morning about black magic. :)

However I did think adding the prefixes is a neat trick. Too bad it's so slow. I wonder if there is a way to extend an instruction with a valid prefix for that instruction to get rid of pairing, while at the same time still having a valid instruction.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: dioxin on May 20, 2005, 09:32:30 PM

Mark,
<<I can get rid of the pairing on the three byte nops using black magic>>

I thought (at least on the Athlon) that NOPs weren't "executed" anyway but were removed from the instruction stream before consuming resources.

From the Athlon Optimisation Guide:

QuoteThese instructions {NOPs} have an effective latency of that which is listed {zero}. They map to internal NOPs that can be executed at a rate of three per cycle and do not occupy execution resources.

So pairing shouldn't be a problem for 1,2 or 3 byte NOPs. I don't know if Pentiums behave differently.

Paul.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: Mark_Larson on May 20, 2005, 09:46:49 PM

Dioxin, Intel P4 optimization manual lists a NOP as taking 0.5 cycles ( 1 cycle if you are a Prescott). That is great that AMD implemented it that way :) The other issue is even though NOPs are free on AMD ( for up to 3 NOPs), ALIGN for the most part does not use NOPs.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MichaelW on May 21, 2005, 07:37:03 AM

For the P3 100 nops take 45 cycles, so it would seem to be 0.5 cycles per nop. And on a P3 there is no penalty for a single segment override prefix.

From the Intel P1 Developer's Manual, Volume 3:

The NOP instruction is an alias mnemonic for the XCHG (E)AX,(E)AX instruction.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: MazeGen on May 21, 2005, 11:42:12 AM

BTW ;)

From the Intel Optimization Reference manual P4 (248966-011)

Quote
The one byte NOP, xchg EAX,EAX, has special hardware support.
Although it still consumes a µop and its accompanying resources, the
dependence upon the old value of EAX is removed. Therefore, this µop
can be executed at the earliest possible opportunity, reducing the
number of outstanding instructions. This is the lowest cost NOP
possible.

Title: Re: fwiw, MASM's ALIGN is broken in more than one way
Post by: nasm64developer on May 22, 2005, 09:14:33 PM

> The interesting one is "add eax, 0", of all
> the "NOP" instructions it's the only one which
> has an effect on the flags...

Actually, MASM's ALIGN directive is broken even
worse than that. For example, try this with good
old ml.exe version 6.15:

s segment para use16

mov ax,1

align 16
nop
align 1
hlt

align 16
nop
align 2
hlt

align 16
nop
align 4
hlt

align 16
nop
align 8
hlt

.386

mov eax,1

align 16
nop
align 1
hlt

align 16
nop
align 2
hlt

align 16
nop
align 4
hlt

align 16
nop
align 8
hlt

s ends

end

Before the .386 directive, MASM uses NOPs and
MOV AX,AX to pad, prefixed with CS: as needed.
Perfectly reasonable.

After the .386 directive, it silently decides
to emit 32-bit code, despite the fact that you
are still in a 16-bit segment. Execute any of
that code, and boom, you're dead.

The fact of the matter is that relying on the
ALIGN directive is not a good idea. Only the
programmer knows what padding bytes should be
emitted and why, period.

On top of that, MASM's ALIGN doesn't accept an
arbitrary alignment. By contrast, programmers
can easily write a macro that does.

Title: Re: No-op sequences inserted by MASM for alignment
Post by: hutch-- on May 23, 2005, 12:46:05 AM

:bg

It may hae something to do with the programmer knowing the use of alignment for the particular platform involved. Under 16 bit DOS, align by 2 bytes was the maximum required and was enabled with,

Code Select


ALIGN EVEN

The basic distinction is if you are going to use a later version of ML that is a PE file (post 6.11d) align on the basis of the platform requirement. The only real problem for 32 bit code that I see is the 5 byte padding (add eax, 0) that sets a flag which could break some code if there was a flag test pending.

The following code demonstrates that alignment in 32 bit code is hardly a problem. It has been built with both ML 6.14 and 7.00.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

comment * -----------------------------------------------------
                        Build this  template with
                       "CONSOLE ASSEMBLE AND LINK"
        ----------------------------------------------------- *

    main PROTO :DWORD,:DWORD,:DWORD

    .code

start:
   
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    .data
      psrc db "This is a test of different alignments in MASM",0
      pdst db "                                                "
    .code

    invoke main,ADDR psrc,ADDR pdst,LENGTHOF psrc

    print ADDR pdst

    exit

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

main proc src:DWORD,dst:DWORD,cnt:DWORD

  align 1
    cld
  align 2
    mov esi, src
  align 4
    mov edi, dst
  align 8
    mov ecx, cnt
  align 16
    rep movsb
  align 8
    ret

main endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

end start

Title: Re: No-op sequences inserted by MASM for alignment
Post by: nasm64developer on May 23, 2005, 02:03:45 PM

> Under 16 bit DOS, align by 2 bytes was the maximum required

Are you confusing the 16-bit 8086/8088 with USE16?

While 16-bit DOS requires segments to be aligned
on 16-byte boundaries -- due to x86 real mode, of
course -- the contents of such segments may very
well desire/require alignments of 2, 4, 8, or 16.

For example, ALIGN EVEN won't do you any good if
you're trying to use 16-byte SSE data under DOS...

Nah... once you know that MASM's ALIGN is riddled
with problems, you just get over it and start to
use your own macro to do a better job. ;-)

Title: Re: No-op sequences inserted by MASM for alignment
Post by: hutch-- on May 23, 2005, 02:11:23 PM

:bg

I guess with so many variations, x86 REAL mode using SSE under DOS you could end up with anything you like. Microsoft/IBM DOS was a 16 bit real mode non-reentrant monotasking OS that snreaked some access into extended memory for data apart from the A20 line just above 1 meg.

If you are crafting/running such a hybrid OS, I guess you could get the appropriate assembler from the vendor that would suit your requirements. :green

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: MichaelW on May 13, 2005, 08:22:31 AM