Writing an assembler.

travism · July 02, 2009, 05:20:29 PM

Quote from: hutch-- on July 02, 2009, 04:41:08 PM
travis,

Its a bit to do with what target you have in mind in terms of assembler complexity, at its simplest a bare mnemonic grinder of masm 5 technology or lower is a far simpler task than one that adds any pseudo high level capacity to it. A macro engine is another layer of complexity that collectively makes the result larger and a lot harder to code.

Yeah exactly that's why I'm trying to break down these sources into there simplest forms.. And I want to start this project right so later when I decide to add a macro type engine I can add it without rewriting the whole thing, I understand the whole parsing its just me trying to figure how the best way to store the tokens so the parser can read it well and then turn it into a ast and so forth without getting half way through and seeing I should have done it a different way. This will be a lot of reading :) thanks for the input hutch

d0d0 · July 02, 2009, 07:16:39 PM

Travis, Big up mon!

I admire you mate. Here are some more links that might be useful:

checkout Randy Hyde's docs on Lexers/Parsers. I know it's on HLA but you can learn something out of it. There is also a link to a free book on compiler consturction - the first few chapters covers the creation of a basic assembler. it then multiple pass assemblers and macro capabilities.
http://webster.cs.ucr.edu/AsmTools/RollYourOwn/index.html

Check out YASM as well especially the libyasm library. maybe you could start with a MASM frontend for YASM.

Good luck mate!

Respect

travism · July 02, 2009, 07:24:25 PM

Hey thanks for that link d0d0, ill check it out! Thank you again for everyones help and support I really wanna make this assembler right :)

travism · July 03, 2009, 09:42:39 PM

Well I've got a fairly decent sized document written on syntax and design of the assembler. Also the note on multi threaded I was thinking in a way where after the source is broken up into tokens its checked for invalid syntax and errors and then starts a thread while its parsing the syntax tree to then pass it off to the thread to generate code so its working at the same time? Just an idea. Also I'm still trying to figure out the best way to store the parse tree and ast.. In c I can think of linked list so a structure breakdown for each line of source, but wondering some different ideas for assembly. Just some of my thoughts :) any input?

Neo · July 04, 2009, 07:33:33 AM

In making the built-in assembling for Inventor IDE, I've put together a huge data file containing all of the instruction encodings expanded out. It took months of data entry to make a file detailed enough to use for assembling all 16-bit, 32-bit, and 64-bit versions of instructions, but it lets you easily generate huge test cases to exhaustively check against other assemblers (and find bugs in them :wink). If you'd like to use it, just let me know. It's in the download of Inventor IDE as ASM.xml.

You can find the code I use for compiling here in Encoder.java, asm/Encoder.java, and bits and pieces around there. It kind of depends on the code already being fully parsed, though (in ASMLine.java and ASMParse.java). However, the way I've set it up has the advantage that it's very easy to parallelize (most of it is already), and other things like to crawl through the code to include only things that are referenced from a specified entry point (for the automated performance test/analysis feature I'm working on for a demo on Tuesday :toothy).

For the record, I don't use a parse tree, since assembly without the high-level macros isn't very tree-like; I keep track of lines, since that's how they're edited anyway, but you may want to do things differently since you're not going to do editing. When I add support for C, I'll keep track of a parse tree (but just inside of functions, 'cause elsewhere in Inventor IDE, it's not necessary to parse much after loading). The only thing tree-like I do is parsing of global variables that are structure types, but that's just a simple recursive descent parsing whenever it needs to be evaluated instead of keeping it parsed.

Cheers! :U

travism · July 08, 2009, 03:38:16 AM

hey thanks for the information! Im not very good with java at all lol, I tried but failed. Im having some trouble finding information on encoding the instructions really not sure the most efficient way. I really don't think it would be efficient entering all the hex opcodes and cmp and jmps lol. Anyone know of any docs on it? Ive read the encoding part in the intel manuals. :\

bruce1948 · July 08, 2009, 11:08:04 PM

This might help

[attachment deleted by admin]

Neo · July 09, 2009, 09:49:02 AM

Quote from: travism on July 08, 2009, 03:38:16 AM
hey thanks for the information! Im not very good with java at all lol, I tried but failed. Im having some trouble finding information on encoding the instructions really not sure the most efficient way. I really don't think it would be efficient entering all the hex opcodes and cmp and jmps lol. Anyone know of any docs on it? Ive read the encoding part in the intel manuals. :\

I didn't mean to imply that you should have to enter all that data too. :wink Here's the ASM.xml file that's included with Intentor IDE. It should have most everything you need that doesn't involve directives or macros, in a format that's not too hard to handle. Everything, that is, unless you want to implement a really fancy error/warning system (like tracking dependency chains to find uninitialized values or ignored values). I'd post the main huge test case too, but even compressed, it's 1MB, and you probably won't need it for a while. We can compare our results when the time comes.

Edit: whoops, forgot to attach it, hehe.

[attachment deleted by admin]

travism · July 09, 2009, 07:46:57 PM

Neo thanks againf or all the help, bruce I have that downloaded and is a vital source thank you!, What im trying to work on is i read in the file, break it into tokens and save it to a syntax tree which is a structure of the grammar, but then when you move to the next instruction it will overwrite the current instruction in the syntax tree, so thats why i thought it went through each phase which each instruction first then moved on to the next... What is the proper way of using a syntax tree?

travism · July 13, 2009, 09:27:25 PM

Does anyone have any more sources? I have read pdf after pdf about lexical analysis and parsing, but nothing actually explains it used in programming or even in the sense of writing a compiler... I might just have to put this project on hold since no one has really written to much about it.. :\

dedndave · July 13, 2009, 09:38:33 PM

in the days of DOS, i have written probably hundreds of command-line parsers
i realize that is nothing like what you are up against, but i have a feel for parsing
i think the first "beautifier" pass is what you are talking about
it needs to convert tabs to spaces, eliminate extra spaces, strip out comments, tokenize instructions and directives, and terminate lines
there is no magical formula for all that
in real-mode DOS, using lodsb, stosb, and loop worked fairly well
with the extended set of pentium instructions, there are probably better ways to do it - i have not learned all these instructions, yet
if i were in your shoes, i would take a good look at source code for other assemblers to get ideas and concepts

travism · July 13, 2009, 09:44:25 PM

Yeah thats what i have been trying to do, but so many of them are highly optimized and very advanced and am just trying to find the basics. Basically the only part im having trouble with understanding is like this if you have mov eax,ebx or what not, that ofcourse you would move into a structure for your syntax tree but as soon as you hit the next statement it would overwrite that structure witht he new information thats why I thought you actually go through the whole process, tokenize, parse and encode each line at a time..

dedndave · July 13, 2009, 09:51:59 PM

try to dynamically allocate memory for the parsed output
(one of those functions lets you "grow" an allocation - don't recall which one)
managing memory throughout the assembly process is going to be tricky
this first pass is an example - as your output data grows, your requirement for input space diminishes
i dunno if i would try to validate any code on that pass or not (that is one thing i would look to others for example)
of course, if it isn't a valid instruction or directive, it must be a label - lol - i dunno

travism · July 13, 2009, 09:56:23 PM

Wow, I feel retarded, I completely forgot I wrote how i was going to tokenize the file, it saves each structure of grammar to memory and then parses it checking it for error etc... :| Thanks again for your help lol

dedndave · July 14, 2009, 05:18:42 AM

hiya Travis
i know you feel as though you have seen all the docs and pdf's you ever want - lol
here are a couple short ones that i found and thought you might appreciate
http://gec.di.uminho.pt/Discip/Lesi/AC10203/docs/P4ISAformat.pdf
http://webster.cs.ucr.edu/AoA/Windows/HTML/ISA.html

News:

Writing an assembler.

d0d0

bruce1948