Object Code Signatures and HLA v1.102

Randall Hyde · March 16, 2008, 11:02:30 PM

Though many people don't realize this, few x86 assemblers generate the same object code when given the same (well, equivalent) source input files. As it turns out, there are many ambiguities in the encoding of various x86 machine instructions (that is, certain instructions have multiple encodings) and different assemblers make different decisions concerning the choice of opcode values.

Some examples might help clear this up:

1) The "mov" instructions have special variants that will load a value into AL/AX/EAX and a generic version that will load a value into any general-purpose register. Originally (on the 8086) the "accumulator" variants were shorter, but with the advent of the 80386, the mov ax, <mem> instruction can be encoding two different ways, both the same length.

2) Most two-register instructions (those encoded with a "mod-reg-r/m" byte) offer two different encodings by setting or clearing the instruction's "direction bit" and swapping the reg and r/m bit fields.

3) Many of the immediate-mode instructions provide "short" forms that encode a value in the range -128..+127 as a single sign-extended byte value.

No two assemblers have ever made the exact same choice of opcode selection. Indeed, there have been papers written that describe an "object code signature" that can be used to identify the compiler or assembler used to produce some object code based on the particular opcodes appearing in a program.

Originally, HLA was at the mercy of the back-end assembler with respect to the object code signature it produced. Several versions back (I forget the version number, but it was somewhere around v1.98) I modified HLA to produce hexadecimal encodings for various instructions that MASM v6 did not support. In HLA v1.102, I've extended this encoding scheme to *all* instructions (except certain JMP, Jcc, and CALL instructions); that is, HLA v1.102 (by default) will now produce hex opcodes for most instructions rather than passing them down to the back-end assembler for conversion to binary. As a result, HLA v1.102 wound up with its own unique "object code signature".

Unfortunately, this completely destroyed the instruction opcode test suite that I developed for HLA. The opcode test suite worked by producing two sets of source files: one source file in HLA and one in some other assembler (MASM, FASM, or Gas). The test suite would compile the two files (HLA source and non-HLA source), disassemble the executables produced, and then compare the disassemble files. If there were equivalent, the test succeeded. Other than a few instructions where the back-end assemblers didn't assemble the code correctly (and there are several such defects in MASM6, MASM7, and even FASM v1.66), this was a great way to test the correctness of the HLA code generator algorithms.

Of course, once HLA began producing hexadecimal opcode output, and developed its own "object code signature", this testing scheme fell apart. HLA, for example, was generating different opcodes for TEST instructions than was (say) FASM. Though the instructions where equivalent, their disassemblies were not because of the different opcodes involved (the text strings were the same, but the hex opcodes listed were not).

The solution I came up with was to modify HLA's code generator so that it's "object code signature" matches (as much as possible) that of the current back-end assembler. For example, if you assemble an HLA program using FASM as the back-end assembler, the object code signature matches FASM as much as possible (e.g., except for the incorrect conversions). Likewise, if you're using TASM as the back-end assembler, then HLA's object code signature matches TASM's. And so on.

Note that a back-end assembler of some sort is still necessary in HLA v1.102 because HLA doesn't yet handle branch displacement optimization (i.e., the Jcc, Jmp, and Call instructions) and HLA doesn't produce COFF/ELF files. HLA, by default, produces hexadecimal opcodes, but the assembler emits "DB", "DW", and "DD" statements that the back-end assembler then converts to machine code. (At some point in the future I will probably take the next step, which is to produce the COFF/ELF files directly; right now, however, a back-end assembler is still necessary).

One thing I've added over the past week is a new source code translator. Originally, HLA v1.102 was going to generate hex opcodes and that was it. But for testing purposes, I really wanted to provide an option to emit source code, so I've spent the last several days adding that feature. As with earlier versions, HLA v1.102 provides an option to produce MASM, TASM, Gas, or FASM source output files. The default is to produce hex output files, but if you specify the new "-sourcemode" command-line option, HLA will emit human-readable source rather than a bunch of DB, DW, and DD statements.

It's important to understand that "-sourcemode" is different from the "-sm", "-sf", "-st", and "-sg" options. "-sm", for example, tells HLA to produce an output file that can be processed by MASM. By default, that file will simply contain a bunch of DB, DW, and DD statements. If you want a human-readable source file, you should specify the "-sourcemode -sm" options together.

Note that the "-test" command-line option has been expanded in HLA v1.102. If you've specified "-test" and you've *not* also specified "-sourcemode", then HLA v1.102 will emit a *comment* listing the actual machine instruction immediately before the hexadecimal object code it emits. This is useful for debugging HLA output (which is the purpose of the "-test" command-line option).

Anyway, getting back to the object code signature idea -- HLA will automatically produce the same object code signature as whatever back-end assembler you choose. So if you run HLA with the "-xm" command-line option, the file it produces will be (mostly) object-code signature compatible with MASM; "-xt" option produces TASM object-code-signature output; and so on.

Though the production of object-code-signature compatible output may not be of tremendous interest to most people, for those who are producing "hacker-proof" code this is a useful facility. Also, the process of learning the signatures for all these different assemblers has taught me how to generate the *best* object code when I get around to writing HLA v2.0 (or when writing an HLA-specific back-end processor). *None* of the assemblers out there generate optimal code in all cases. While HLA might not ever do this either, having gone through the process of studying the output of all these assemblers, I've got a good idea how to select the best from all the assemblers out there.
hLater,
Randy Hyde

News:

Object Code Signatures and HLA v1.102

Randall Hyde