News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

No-op sequences inserted by MASM for alignment

Started by MichaelW, May 13, 2005, 08:22:31 AM

Previous topic - Next topic

MichaelW

Just to satisfy my curiosity.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
; A crude app to determine the no-op sequences inserted by MASM
; for alignment, 1 to 15 byte lengths.
;
; Assemble and dis-assemble to see sequences.
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .486                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    include \masm32\macros\macros.asm

    pad MACRO cnt
      REPEAT cnt
        clc
      ENDM
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    pad 1
    align 2
    stc
    pad 3
    align 4
    stc
    align 4
    stc
    pad 7
    align 8
    stc
    pad 2
    align 8
    stc
    pad 1
    align 8
    stc
    align 8
    stc
    pad 7
    align 16
    stc
    pad 6
    align 16
    stc
    pad 5
    align 16
    stc
    pad 4
    align 16
    stc
    pad 3
    align 16
    stc
    pad 2
    align 16
    stc
    pad 1
    align 16
    stc
    align 16
    stc   
    jz    @F
  @@:
    mov   eax,input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

; -------------------------------------------------------------
; No-op sequences inserted for align, MASM 6.14, 1 to 15 bytes:
; -------------------------------------------------------------

00401001 90                     nop

00401006 8BFF                   mov     edi,edi

00401009 8D4900                 lea     ecx,[ecx]

00401014 8D642400               lea     esp,[esp]

0040101B 0500000000             add     eax,0

00401022 8D9B00000000           lea     ebx,[ebx]

00401029 8DA42400000000         lea     esp,[esp]

00401038 8DA42400000000         lea     esp,[esp]
0040103F 90                     nop

00401047 8DA42400000000         lea     esp,[esp]
0040104E 8BFF                   mov     edi,edi

00401056 8DA42400000000         lea     esp,[esp]
0040105D 8D4900                 lea     ecx,[ecx]

00401065 8DA42400000000         lea     esp,[esp]
0040106C 8D642400               lea     esp,[esp]

00401074 8DA42400000000         lea     esp,[esp]
0040107B 0500000000             add     eax,0

00401083 8DA42400000000         lea     esp,[esp]
0040108A 8D9B00000000           lea     ebx,[ebx]

00401092 8DA42400000000         lea     esp,[esp]
00401099 8DA42400000000         lea     esp,[esp]

004010A1 8DA42400000000         lea     esp,[esp]
004010A8 8DA42400000000         lea     esp,[esp]
004010AF 90                     nop


eschew obfuscation

Mirno

The interesting one is "add eax, 0", of all the "NOP" instructions it's the only one which has an effect on the flags....
I guess it's worth remembering that the state may not be the same after an align...

Mirno

P1

Quote from: Mirno on May 13, 2005, 11:01:42 AM
The interesting one is "add eax, 0", of all the "NOP" instructions it's the only one which has an effect on the flags....
I guess it's worth remembering that the state may not be the same after an align...

Mirno
Any chance this a microprocessor dependent bug?  Have you done any testing for this behavior?

Regards,  P1  :8)

Mirno

It's not a processor bug, it's an assembler bug. NOPs are supposed to have no effect on the execution, and the add instruction is (it affects the flags if nothing else).


.486
.model flat, stdcall
option casemap :none

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc

includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib

pad MACRO cnt
  REPEAT cnt
    nop
  ENDM
ENDM

.data
msg db "You should see this", 0

.code
start:
  mov eax, 1
  cmp eax, 1

  pad 2 ; <- REPLACE WITH 3
ALIGN 16

jne @F
  invoke MessageBox, NULL, ADDR msg, ADDR msg, MB_OK
@@:

  invoke ExitProcess, 0
end start


If you change the pad line to 3, the message won't display, the only difference being that the "NOPs" inserted by the align directive are not entirely benign.
The code here is contrived, but I could imagine someone somewhere being bitten by this.

Imagine that you've got working code, and then change something (even in another function), that pushes the align a bit causing it to use the "add eax, 0" NOP and your code breaks. It'd be really nasty to find. Although writing code which has the compare & jump on opposite sides of an align directive is morally wrong!

Mirno

MazeGen

YAMLB!

Michael, thanks for reporting that hidden bug  :thumbu

(Yet Another ML Bug!)

Did you thing about reporting that to Microsoft? They could fix it in ML 8.0.

MichaelW

Hi MazeGen,

I didn't recognize the problem until Mirno pointed it out. I just tested 6.15.8803 and 7.00.9466, and both generated the same "add eax,0". Surely Microsoft knows about this ??

I found two 5-byte encodings that do not affect the flags, but both might be somewhat slow and ML will not encode the second (although this might be true for some of the other encodings).

jmp   near ptr $
lea   esp,ss:[esp+0]
db    36h,8dh,64h,24h,00h

00401001                    loc_00401001:
00401001 E9FBFFFFFF             jmp     loc_00401001
00401006 8D2424                 lea     esp,[esp]
00401009 368D642400             lea     esp,ss:[esp]


eschew obfuscation

roticv

Maybe masm is being pragmatic. When you used esp, the processor already know you are referring to the segment ss, so 36h as a prefix has no meaning.

hutch--

It does the same with added displacements of 00000000h.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

zooba

What's wrong with padding with this?

lea esp, [esp]
nop


I'm guessing the longer lea command is faster than straight nop's because the processor skips through it in one step, but a jump (even to nowhere) is going to flush stuff on newer processors isnt it?

MichaelW

#9
Investigating further, I coded an app that measures the cycle counts for the instructions that MASM uses for alignment, along with one of the 5-byte alternatives. Obviously, timing a series of jumps with a displacement of zero will not yield useful results (or at least it was obvious to me... after I tried it :green).

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    time MACRO definition,caption
      sz SIZESTR <definition>
      bytes SUBSTR <definition>,2,sz-2
      counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS
        REPEAT REPEAT_COUNT
          db bytes
        ENDM
      counter_end
      mov   ebx,eax
      print chr$(caption," : ")
      print ustr$(ebx)
      print chr$(" cycles",13,10)
    ENDM
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT    EQU 10000000
    REPEAT_COUNT  EQU 100

    ; "+00h" indicates a one-byte displacement of zero.
    ; "+00000000h" indicates a four-byte displacement of zero.
   
    time "90h","1 byte,  nop                     "
    time "8Bh,0FFh","2 bytes, mov  edi,edi            "
    time "8Dh,49h,00","3 bytes, lea  ecx,[ecx]          "
    time "8Dh,64h,24h,00","4 bytes, lea  esp,[esp+00h]      "
    time "05h,00,00,00,00","5 bytes, add  eax,0              "
    time "36h,8Dh,64h,24h,00","5 bytes, lea  esp,ss:[esp+00h]   "
    time "8Dh,9Bh,00,00,00,00","6 bytes, lea  ebx,[ebx+00000000h]"
    time "8Dh,0A4h,24h,00,00,00,00","7 bytes, lea  esp,[esp+00000000h]"

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

The results for my P3:

1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 95 cycles
4 bytes, lea  esp,[esp+00h]       : 95 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 95 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 95 cycles
7 bytes, lea  esp,[esp+00000000h] : 95 cycles



[attachment deleted by admin]
eschew obfuscation

Jimg

Interesting.  Results for an athlon xp 3000

1 byte,  nop                      : 29 cycles
2 bytes, mov  edi,edi             : 96 cycles
3 bytes, lea  ecx,[ecx]           : 198 cycles
4 bytes, lea  esp,[esp+00h]       : 198 cycles
5 bytes, add  eax,0               : 96 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 198 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 197 cycles
7 bytes, lea  esp,[esp+00000000h] : 198 cycles

Mirno

XP3000+ AMD64.

1 byte,  nop                      : 27 cycles
2 bytes, mov  edi,edi             : 94 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 94 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles


It'd be interesting to time a jump over the same region.

Mirno

AeroASM

Pentium M 1.5GHz


1 byte,  nop                      : 45 cycles
2 bytes, mov  edi,edi             : 95 cycles
3 bytes, lea  ecx,[ecx]           : 94 cycles
4 bytes, lea  esp,[esp+00h]       : 94 cycles
5 bytes, add  eax,0               : 95 cycles
5 bytes, lea  esp,ss:[esp+00h]    : 94 cycles
6 bytes, lea  ebx,[ebx+00000000h] : 94 cycles
7 bytes, lea  esp,[esp+00000000h] : 94 cycles

Mark Jones

I get the same as Jim with an AMD XP 1800+

...so what's the moral of the story here, instructions involving brackets run much slower on AMD?
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

roticv

I think it is not a fair comparsion. We should compare the length of padding. Like for instance compare 4 nops with one lea  esp,[esp+00h] and so on...