Intel Syntax, revsited

Randall Hyde · June 21, 2008, 05:51:27 PM

Hi All,
As I've been cleaning up and extending HLA recently, I've come across several peculiarities in Intel's syntax and instruction set design that have been driving me crazy. I'm also trying to resolve the (different) behavior of certain instructions that cannot be selected by any syntactical suggestions that Intel has provided.

For example, consider the push( <segment register> ); instruction. All segment registers are 16-bits wide, so the standard version of this instruction will push a 16-bit value onto the stack (misaligning the stack in a 32-bit application). The typical encoding for the "push( dseg );" instruction is $66, $1E (as an example). One might ask what would happen if you took away the $66 operand-size prefix byte (which converts 32-bit operands to 16-bit operands)? After all, the DSEG register is always 16 bits, there is no 32-bit variant. Well, what happens is that the CPU will actually push 32 bits of data. So what's in the H.O. byte of those 32 bits? Well, for FSEG and GSEG the H.O. word is zero filled (at least, according to Intel's documentation). Intel's documentation doesn't say anything about the other segment registers, so it's probably safest to assume that the value of the H.O. word is undefined. Now consider the "pop( dseg );" instruction. Once again, as a 16-bit instruction, a typical assembler (including HLA) will emit a $66 operand-size-prefix byte before the $1F opcode. This pops a 16-bit value off the stack, matching the operation of the $66, $1E [push(dseg)] opcode. If we take the $66 prefix off the opcode, the instruction removes 32-bits from the stack, putting the L.O. 16 bits into DSEG and throwing away the H.O. 16 bits, matching what the $1E opcode does (without the size prefix). Now, granted, you don't really need to push or pop segment registers very often in 32-bit flat-model 80x86 programs. However, should you need to do this for some reason or another, you'll almost always want to push 32 bits (in order to keep the stack pointer aligned). This means pushing/poping an extra word on the stack (or adding/subtracting two to/from ESP) when pushing these 16-bit registers. Why do this when the "32-bit version" of the instruction spares you the extra add/sub instruction *and* the 32-bit versions are one byte shorter as well? Intel offers no clues with respect to syntax on how to handle this.

IIRC, some assemblers solve the problem by using the pushw and pushd mnemonics. Specifically, such assemblers make "push( dseg )" equivalent to "pushw( dseg )" (emitting the $66 operand-size prefix) and they have "pushd( dseg )" emit the 32-bit version of the instruction, without the size prefix. This is a workable solution for the push and pop segment register instructions, though I really don't like attaching size information to the mnemonic itself (I'll explain this problem in a moment). However, as the pushw and pushd mnemonics have long been a part of Intel syntax, this approach is acceptable.

Push and pop aren't the only segment-register related instructions that exist. The mov instruction also operates on 16-bit segment registers (and memory locations). Consider the case where a mov instruction moves some data into a 16-bit segment register; by default, most assemblers go ahead an emit an unneeded $66 operand-size prefix byte even though Intel's documentation clearly states that this is unnecessary and, in fact, makes the instruction run one clock slower on top of making the instruction one byte longer. Intel suggests that "most assemblers" allow you to dispense with this by using a 32-bit general-purpose register as the source operand:

"When operating in 32-bit mode and moving data between a segment register and a general-purpose register, the 32-bit IA-32 processors do not require the use of the 16-bit operand-size prefix (a byte with the value 66H) with this instruction, but most assemblers will insert it if the standard form of the instruction is used (for example, MOV DS, AX). The processor will execute this instruction correctly, but it will usually require an extra clock. With most assemblers, using the instruction form MOV DS, EAX will avoid this unneeded 66H prefix. When the processor executes the instruction with a 32-bit general-purpose register, it assumes that the 16 least-significant bits of the general-purpose register are the destination or source operand. If the register is a destination operand, the resulting value in the two high-order bytes of the register is implementation dependent. For the Pentium 4, Intel Xeon, and P6 family processors, the two high-order bytes are filled with zeros; for earlier 32-bit IA-32 processors, the two high order bytes are undefined."

That is to say, when loading a 16-bit general-purpose register into a segment register, the default case should be to avoid emitting the $66 operand-size prefix byte. When copying a 16-bit segment register into a general-purpose register, there is a difference – one form wipes out the H.O. 16 bits of the GP register, the other form leaves it unchanged. While I cannot imagine that needing to preserve the H.O. bits of the 32-bit register is very important most of the time, one can easily synthesize a case where those bits must be preserved. Therefore, it's important to have some syntactical variation on the instruction that lets the programmer chose whether they want the 32-bit equivalent or the 16-bit version.

The problem I'm running into here is that assemblers have wound up using two completely different syntactical devices for specifying the absence/presence of the $66 prefix byte. In one case (push/pop segment registers) the mnemonic is modified to implement the different opcodes. In a second case (mov), the operands are modified to signify the difference. Worse, the solution offered by these "assemblers" that Intel talks about is not general in the case of the mov instruction – that syntax only works if one of the operands is a 32-bit general-purpose register, there is not syntax (that Intel discusses) that would work when moving data between segment registers and memory. So we have two inconsistent schemes for dealing with 16-bit segment registers in 32-bit applications, neither of which is complete.

Ideally, a *good* assembly language syntax will specify the operand size by either the instruction mnemonic (as was typical for 68000 assembly language and many older assembly languages) or the operand size was specified by the operands themselves (as was the case with the original 8086 assembly language syntax). Intel's syntax started out very pure – using the operand's syntax and semantics to specify the operand size. Over the years, they bastardized it. First there was the movsb/movsw(/movsd) syntactical variants (rather than the original syntax of "movs es:dest, ds:source" that specified the size of the operand via to "phantom" memory references). Then we saw pushw/pushd added to handle immediate pushes in the 32-bit instruction set. Ultimately, the MMX and SSE instruction sets came along and Intel syntax become completely inconsistent. Now some instructions specify their size by the instruction mnemonic, some instructions specify their size by the operands' syntax. I'm not going to claim that one form is fundamentally better than the other, but I will argue that any syntax that mixes both is ugly, inconsistent, and ought to be avoided. Yet that is exactly where we are today with Intel's syntax.

Is there a better solution? No doubt. Do I have a solution to offer? No, not yet. But I have begun thinking about this idea because assembly language would be much easier to learn (particularly the MMX and SSE instruction sets) if the assembly language syntax were a bit more consistent.

So how to improve the situation? Well, let's consider the heritage of the 80x86 assembly language syntax. The roots are firmly embedded in the "the operand specifies the instruction size" and attempts by assembler authors to specify the operand size via the instruction mnemonics have met with little success or acceptance. The integer instruction set syntax is too well entrenched at this point. Granted, every assembler out there has it's own idea about 80x86 syntax, but bring up the subject of GAS (or one of the other variants that attach size to the mnemonics) and you'll find that most assembly programmers despise that approach. Granted, this is more a problem of familiarity than anything else, but it is something we have to live with. The bottom line is that we aren't going to get away with changing too many mnemonics in the integer portion of the instruction set, which is what most people have already learned. Because the integer instructions do a decent job of sticking with "the operand specifies the size" syntax (not a perfect job, but decent), it makes sense to stick with that approach elsewhere in the instruction set.

The place where Intel completely flipped is in the MMX and SSE instructions. Consider the SSE floating-point addition instruction set. They have ADDSS, ADDSD, ADDPS, and ADDPS. Consider, for a moment, the variants that work on 128-bit SSE registers. I argue that rather than having four times as many instructions for people to memorize, a better solution would have been to apply the integer instruction set approach and use different register names with the same mnemonic. For example, why not have "xmm0.s", "xmm0.ps", "xmm0.d", and "xmm0.pd" (or any other variation you please, to signify "scalar single", "packed single", "scalar double", or "packed double" operands)? Then you could specify your instructions thusly:

add xmm0.s, xmm1.s
add xmm1.ps, xmm2.ps
add xmm2.d, xmm3.d
add xmm3.pd, xmm4.pd

The memorization nightmare that the CVT..... instructions represent would be incredibly simplified by this approach.

Granted, this is not a perfect approach that will work in all cases, but I argue that it would vastly simplify the syntax of the MMX and SSE instruction sets.

Comments are welcome. Suggestions are welcome.
Cheers,
Randy Hyde

daydreamer · June 21, 2008, 06:48:18 PM

Quote from: Randall Hyde on June 21, 2008, 05:51:27 PM
The place where Intel completely flipped is in the MMX and SSE instructions. Consider the SSE floating-point addition instruction set. They have ADDSS, ADDSD, ADDPS, and ADDPS. Consider, for a moment, the variants that work on 128-bit SSE registers. I argue that rather than having four times as many instructions for people to memorize, a better solution would have been to apply the integer instruction set approach and use different register names with the same mnemonic. For example, why not have "xmm0.s", "xmm0.ps", "xmm0.d", and "xmm0.pd" (or any other variation you please, to signify "scalar single", "packed single", "scalar double", or "packed double" operands)? Then you could specify your instructions thusly:

add xmm0.s, xmm1.s
add xmm1.ps, xmm2.ps
add xmm2.d, xmm3.d
add xmm3.pd, xmm4.pd

The memorization nightmare that the CVT..... instructions represent would be incredibly simplified by this approach.

Granted, this is not a perfect approach that will work in all cases, but I argue that it would vastly simplify the syntax of the MMX and SSE instruction sets.

Comments are welcome. Suggestions are welcome.
Cheers,
Randy Hyde

funny you want to choose a similar encoding as its machinecode is structured
and easily written equate .s and .ps with the byte that controls difference between a scalar and packed instruction and .d and .pd extends it even more by add db66h prefix
why not? modern assembly should follow syntax MOV EAX,EBX rather than have similarity to old 6502 assembler's LDA,LDX,LDY,LDZ ,LDS and it is logical to have it like your suggestion because it is close to the cpu works that the newbie/student already can see when running his code thru debugger

johnsa · June 21, 2008, 09:39:05 PM

I agree that the x86 instruction set is far from optimal by today's (being my) standards.
I believe that they just kept adding and adding and never ever took the time to clean it up. Obviously backwards compatability needs to be preserved, but really.

For example the stack based fpu.. why oh why?
why not have the fpu work the same was as the general purpose and integer instructions?
like the 68k did.

fmov f0,[mem]
fadd f0,f1
fsqrt f2,f1
fmov [mem],f0

sort of idea.
I know it's possible to use sse instructions instead of fpu especially with the SS instructions. However SIMD/SSE was not designed to replace the FPU but to complement the whole package. Most scientific calculation and even compiler output does/will still use the std. FPU and given that the above instruction set style produces code which is about 20% shorter than the stack approach would provide an considerable performance increase IMHO to the FPU.

MichaelW · June 22, 2008, 06:17:04 AM

Regarding the stack-based FPU, I recall once reading a statement by Gordon Moore to the effect that Intel was really stretching when they designed the 8087, and the 'transistor budget' could not support more complex designs.

johnsa · June 22, 2008, 08:13:11 PM

True, but given today's design, one would think they could've kept the stack based system for compatability but added an improved FPU model going forward. In fact if they'd done this from the early pentiums on the stack based code could all but be forgotten now.

While they're at it I'm sure there is a lot of room for improvement in the the actual general instruction set too mainly in terms of either fixed length opcodes or re-arranging the opcodes to reduce the overall size and increase decoder efficiency.

Randall Hyde · June 22, 2008, 10:55:07 PM

Quote from: johnsa on June 22, 2008, 08:13:11 PM
True, but given today's design, one would think they could've kept the stack based system for compatability but added an improved FPU model going forward. In fact if they'd done this from the early pentiums on the stack based code could all but be forgotten now.

In the hardware, they did. Take a look at the SSE instruction set. The "Single Scalar" and "Scalar Double" instructions implement the floating-point instructions using the XMM register set.

Quote
While they're at it I'm sure there is a lot of room for improvement in the the actual general instruction set too mainly in terms of either fixed length opcodes or re-arranging the opcodes to reduce the overall size and increase decoder efficiency.

It is a real shame, for example, that when AMD designed the x64 instruction set that they didn't drop all the specialized encodings for instructions like XCHG, ALU immediate with accumulator, and dozens of other redundant instructions. They already co-opted the INC and DEC instructions to get the REX prefixes. So preserving object-code compatibility was a moot point. Imagine what they could have done by grabbing back all those 1-byte opcodes. Oh well, there's still hope for when the x128 architecture rolls around.
Cheers,
Randy Hyde

Randall Hyde · June 22, 2008, 10:57:09 PM

Quote from: johnsa on June 21, 2008, 09:39:05 PM
I agree that the x86 instruction set is far from optimal by today's (being my) standards.
I believe that they just kept adding and adding and never ever took the time to clean it up. Obviously backwards compatability needs to be preserved, but really.

For example the stack based fpu.. why oh why?
why not have the fpu work the same was as the general purpose and integer instructions?
like the 68k did.

fmov f0,[mem]
fadd f0,f1
fsqrt f2,f1
fmov [mem],f0

sort of idea.
I know it's possible to use sse instructions instead of fpu especially with the SS instructions. However SIMD/SSE was not designed to replace the FPU but to complement the whole package. Most scientific calculation and even compiler output does/will still use the std. FPU and given that the above instruction set style produces code which is about 20% shorter than the stack approach would provide an considerable performance increase IMHO to the FPU.

The main reason for the stack orientation of the FPU was the fact that they had to encode opcode and addressing mode bits into the MOD-REG-R/M byte of the ESC opcodes ($D8..$DF). There just weren't enough bits to go around (to support two operands). So they did the common thing back in 1978 (when the 8087 was being planned) and they used an implied operand (ST0).

Cheers,
Randy Hyde

Neo · June 26, 2008, 05:10:13 AM

Quote from: Randall Hyde on June 22, 2008, 10:55:07 PM
It is a real shame, for example, that when AMD designed the x64 instruction set that they didn't drop all the specialized encodings for instructions like XCHG, ALU immediate with accumulator, and dozens of other redundant instructions. They already co-opted the INC and DEC instructions to get the REX prefixes. So preserving object-code compatibility was a moot point. Imagine what they could have done by grabbing back all those 1-byte opcodes.

Finally, someone else who thinks so! Imagine if they could have taken the ever growing number of instructions with 3- and 4-byte opcodes and just levelled everything out so that the SSE instructions didn't have horrible encodings and the instructions that inherently take lots of time and aren't used often have worse encodings. Imagine how much faster (and potentially easier) the decoding would be in the CPU! They could have done this, because they had little need to be tied to legacy, and yet they chose not to because they were too lazy to think though how to do it, and now we're stuck with this until people move on from x86-based CPUs :(

Plus, they shouldn't have kept things like MMX in 64-bit mode; there are no x64 CPUs that don't have SSE of some sort, so it's a complete waste of encodings and transistors.

Randall Hyde · June 27, 2008, 10:48:25 PM

Quote from: Neo on June 26, 2008, 05:10:13 AM

Plus, they shouldn't have kept things like MMX in 64-bit mode; there are no x64 CPUs that don't have SSE of some sort, so it's a complete waste of encodings and transistors.

Worse yet, there is the interference between the MMX and FPU (and despite the presence of FP in the SSE instruction set, the FPU is still useful today). About the only argument for the MMX instruction set today is that you get an additional set of 64-bit registers if you don't use the FPU.
Cheers,
Randy Hyde

MazeGen · July 02, 2008, 01:26:30 PM

Additionally, there are few documented opcodes without mnemonic: Both new branch prefixes (2E, 3E), AAD and AAM for any number base and another undefined instruction with ModR/M byte 0FB9.

I can't see any reason why Intel didn't give them any mnemonic.

Neo · July 02, 2008, 09:54:04 PM

You're right, though for those, it's not too bad to make a macro to do it. I've also named them in my IDE, so the branch hint prefixes are "likely" and "unlikely", and the AAD and AAM optionally take an immediate operand. Mind you, I won't have the encoder implemented until August, but at least I've thought ahead, hehe.

Edit: Whoops, bad grammar :red

News:

Intel Syntax, revsited

MazeGen