News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Encoding and Operand Data

Started by Neo, June 09, 2008, 08:16:38 AM

Previous topic - Next topic

Neo

Finally, after a ton of work, including an unhealthy amount of data entry, I've put together a file containing operand and encoding information for almost every encoding of every instruction in 16-bit real mode, 32-bit protected mode, and 64-bit mode.  I'll be adding more information later on, such as operation information, register information, indexing information, and documentation, but I'm curious to know who might be interested in this sort of thing.

There are a few known major errors in the encoding data, such as the positioning and type of REX prefixes, since the Intel Software Developer's Manual was extremely inconsistent and bad at explaining the patterns and special cases.  That means don't use this for encoding yet, unless you can do double-checks.  I'll generate a semi-exhaustive test case in a few weeks to compare and consolidate with results from ml and ml64.  I probably won't post the full test case, since it will probably be well over 100,000 lines of code, but I'll make it available if people are interested.

I've categorized instructions into various groups (e.g. common, fpu, mmx, sse).  One of note is the "bad" group, which consists of instructions and/or encodings that are effectively deprecated or that if you're using them, it's probably a mistake (I debated whether or not to label the entire mmx group as bad).  The reason for these groupings is for use in PwnIDE as a way to keep only instructions that the user cares about for a given project.

The special parts of the encodings are:
rm    - register,memory encoding (first operand uses register field)
mr    - memory,register encoding (first operand uses memory field)
m#    - memory encoding with # in register field
im    - immediate value
rx??? - REX prefix (of some sort, though probably wrong)
##+R  - add low 3 bits of register number to value of ##
##+cc - add condition code number to value of ##
(plus a few others I can't remember)


Man, I'm glad I can move on now.  :bg

[attachment deleted by admin]

Neo

Oh yeah, some other notes I missed:

  • jcc, setcc, and cmovcc have an asterisk thrown in there to identify that those aren't their actual names and to use the condition codes on them.
  • Some instructions appear more than once because some operand sets are in one group and other operand sets are in a different group.
  • Prefixes are in the group "prefix", and their operands listed are the names of the instructions that can follow them.  Note that I added the "likely" and "unlikely" prefixes for conditional jumps since I don't know of a standard name for them yet.
  • The rampant "_32" and "_64" may seem confusing and/or overkill.  For example, "rg16_32" means that it's a 16-bit register valid in 32-bit mode; "rm16_64" means a 16-bit register or memory operand that can use registers unique to 64-bit mode (either as the operand or in indexing).  This was done because the patterns for REX prefixes seem too irregular to risk not specifying the prefix types, and because of the weirdness with 8-bit registers.
  • I didn't bother putting in info about address size overrides.

Time for bed now  *yawn*


Neo

Very cool!  That's more the type of data for going the opposite direction (decoding), and it doesn't quite have all of the nitty gritty details that a full encoder will need.  However, I definitely have a use later on for the type of data you've got on the flags that get read and modified and other dependencies, if you're okay with me extracting, reformatting, and redistributing that data for use in my IDE.  :U

I've been really buckling down to try to get the Alpha 3 release out on July 1st (possibly the end of July 1st instead of the beginning), so I've got an updated version of the file attached (still without the encodings verified, though).  The name of the IDE will be changing from PwnIDE to Inventor IDE with this release.  (So far as I know, it should be okay just calling it Inventor, but I don't want to risk annoying Autodesk too much, even though there's really no risk of confusion.)



Edit: The new data I've added is register info and memory operand encoding data, plus a few minor fixes.

[attachment deleted by admin]

MazeGen

Quote from: Neo on June 29, 2008, 10:49:12 PM
Very cool!  That's more the type of data for going the opposite direction (decoding), and it doesn't quite have all of the nitty gritty details that a full encoder will need.

Well, it depends on your encoder's and XML design ;-) My design is more general - all is generalized to make the reference reasonable small, yet easy to transform to another forms. For example, you have huuuge list of all possible operands, I use generalized operand codes instead. For ADC rm32_32, rg32_32 you use encoding16="66 11 mr" encoding32="11 mr" encoding64="11 mr", what looks unneccessary to me - in 32-bit mode, the only way how to force 16-bit operands is 66 prefix. I use ADC Evqp, Gvqp - E means reg/mem, G means reg, v means word or doubleword, according to operand size, qp means promoted to quadword in 64-bit mode using REX.W. (You can see that my reference is not limited to 32-bit and 64-bit mode, it works also for 16-bit default operand and address sizes).

And yes, the XML is driven by opcode value, however, it is easy to sort it by mnemonic.

Quote from: Neo on June 29, 2008, 10:49:12 PM
However, I definitely have a use later on for the type of data you've got on the flags that get read and modified and other dependencies, if you're okay with me extracting, reformatting, and redistributing that data for use in my IDE.  :U

I am ok, however, you should send me files connected to the reference. That's a part of the license.

Neo

The issues with respect to detail are mostly because of the numerous edge cases, such as instructions that don't use the 66 prefix to change operand size, differing use of REX prefixes between different instructions, instructions that support some operand types in some modes and not others, differing sizes of r/m operands (mostly in MMX/SSE instructions), etc.  One that is absolutely critical is knowing which operand is encoded into the reg field and which is in the r/m field in the case where the allowed operands are both registers.  That's the difference between "mr" and "rm" in my XML data.  Plus there's the craziness with 3 and 4 byte opcodes using prefix codes.

The file is actually generated from a file I made pretty much manually that's less than 1/3 the size, containing pattern information for encodings.  Having them expanded out just means that parsing them at run-time is easier.  I'll probably convert the whole shebang into a sort of binary XML format down the road to save disk loading time (since it will get much bigger when I add other data).

Thanks for the permission, and I'll gladly send the files along to you if/when the time comes.  :U

MazeGen

Let's take it as an exercise on how is my reference buit :-)

Quoteinstructions that don't use the 66 prefix to change operand size

Size of an operand is either fixed, or sensitive to 66 prefix, or to 67 prefix (size of counter in case of LOOP family and JrCXZ), or sign-extended to size of destination operand (ADD EAX, imm8; MOV RAX, imm32), or sign-extended to size of stack pointer (PUSH imm8), or fixed to 32 bits and sensitive to REX.W, or... uh, there are so many possibilities. All of them are encoded so any possibility can be retrieved.

Quotediffering use of REX prefixes between different instructions

Are there any edge cases? Or exeptions?

Quoteinstructions that support some operand types in some modes and not others

This is solved using multiple syntaxes, for example, SMSW Mb (m16) if Mod!=11, otherwise SMSW Rvqp (r16/32/64).

QuoteOne that is absolutely critical is knowing which operand is encoded into the reg field and which is in the r/m field in the case where the allowed operands are both registers.

The entry@direction attribute specifies if the opcode contains the direction bit which indicates this. If it is not present, the operands are encoded in the other way. (For example, ARPL is encoded the other way).

Quotethere's the craziness with 3 and 4 byte opcodes using prefix codes.

entry/prefix element specifies fixed prefix and entry/sec_opcd fixed secondary opcode.

QuoteThe file is actually generated from a file I made pretty much manually that's less than 1/3 the size, containing pattern information for encodings.

I thought you made it all manually. The original file is available somewhere as well?

Quote
Thanks for the permission, and I'll gladly send the files along to you if/when the time comes.  :U

Looking forward to your work :thumbu