News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

gen -a simple lexer generator

Started by Sevag.K, March 12, 2006, 05:41:28 AM

Previous topic - Next topic

Sevag.K

gen 0.3.001

By Sevag Krikorian
Released as opensource freeware, author takes no responsibility for
the use or misuse of this program.

Adapted from the lexer unit and rw.exe of HLA2.0 sources written
by Randall Hyde.

Description:
gen takes a simple input file that contain keywords and typeclasses,
creates a hash table, lexer sources, and outputs to 2 hla files:
genLexer.hla   this file contains the lexer unit
genLexer.hhf   this file contains the constants and symbols

This program may then be compiled and linked with any program that
requires scanning the keywords declared in the gen source file.


External symbols:

procedure lex ( Cursor:dword in esi); @external;

   This procedure is the main lexer.  It takes a source
in esi and begins scanning.
Returns: EAX = token, EBX = type class
   
procedure extractID; @external;

   For internal use, but may be used externally.
This procedure copies the data at starting address in EDI
and ending address in ESI to two strings: genID (the data as is)
and genlcID (the data converted to lower case).

procedure CheckRW ( src:dword in esi ); @external;

   For internal use.  Checks to see if the cursor at address esi
points to one of the reserved keywords delcared in the source.  Returns
the token in EAX and typeclass in EBX, EDI pointing to the start of the
lexeme and ESI pointing just beyond the lexeme.


lineNumber   :uns32; @external;
   
   User must initialize this to a starting line number (usually 1), before
starting the lexer, initially, this value is zero.
   
   
hashValue   :dword; @external;      For internal use

genID      :string; @external;

   When the lexer returns gen_id in EAX and/or genID_tc in EBX,
this string contains the identifyer.  Max length = 1024 bytes.

genlcID      :string; @external;
   When the lexer returns gen_id in EAX and/or genID_tc in EBX,
this string contains the identifyer converted to lowercase.
Max length = 1024 bytes.

GoodID      :cset; @external;
   For internal use, this character set determines what is a legal
identifier.

genEOF      :dword; @external;
   The user must load this variable with the address of the end of
the source file to make sure the lexer functions properly.


Internal constants:
   gen has default constants it returns when it encounters characters
that are not part of identifiers.

EBX will be zero when one of these constants are returned in EAX.

   gen_error      := $0;      // illegal character encountered
   gen_eof         := $1;      // end of file reached
   gen_id         := $2;      // identifier returned
   gen_number      := $3;      // unused/reserved
   gen_backslash   := $4;      // '\' char
   gen_grave      := $5;      // '`' char
   gen_del         := $6;      // del char
   gen_question   := $7;      // '?' char
   gen_minus      := $8;      // '-' char
   gen_plus      := $9;      // '+' char
   gen_asterisk   := $10;      // '*' char
   gen_lparen      := $11;      // '(' char
   gen_semicolon   := $12;      // ';' char
   gen_lbracket   := $13;      // '[' char
   gen_lbrace      := $14;      // '{' char
   gen_comma      := $15;      // ',' char
   gen_dollar      := $16;      // '$' char
   gen_percent      := $17;      // '%' char
   gen_slash      := $18;      // '/' char
   gen_vertbar      := $19;      // '|' char
   gen_caret      := $20;      // '^' char
   gen_amper      := $21;      // '&' char
   gen_colon      := $22;      // ':' char
   gen_period      := $23;      // '.' char
   gen_equal      := $24;      // '=' char
   gen_exclaim      := $25;      // '!' char
   gen_greater      := $26;      // '>' char
   gen_less      := $27;      // '<' char
   gen_rbrace      := $28;      // '}' char
   gen_rparen      := $29;      // ')' char
   gen_underscore   := $30;      // '_' char
   gen_tilde      := $31;      // '~' char
   gen_atsign      := $32;      // '@' char
   gen_pound      := $33;      // '#' char
   gen_quote      := $34;      // '"' char
   gen_apost      := $35;      // ''' char
   gen_rbracket   := $36;      // ']' char


Reserved typeclass:
   genID (genID_tc)
This tc is returned in EBX when the lexer encounteres
an identifier.

   gen (gen_tc)
This tc is returned in EBX if the user has not
declared a typeclass.

At this time, gen expects to see a file called 'gensrc' (no extensions)
in the same directory as gen.  It will process this file to create
the lexer unit and hash tables.

The parser is very simple and contains only a couple of commands.

Use a semicolon ';' for comment lines.

To declare keywords, use:

.keywords { ... }

To declare typeclass use inside the .keywords section:
.typeclass typeclass

That's all there is to it.  The maximum length of keywords is 12, minimum
is of course, 1.  The maximum number of keywords that may be declared is
1024.

Sample:

; this is a sample gensrc file
.keywords
{
   .typeclass reg32
   eax
   ebx
   ecx
   
   .typeclass instr
   mov
   div
   add
   
   .typeclass stmnt
   if
   else
   endif
   
}


Feed this to gen and it will create a lexer that looks for these
keywords in the hash table.  gen will also create a header file
with tokens and typeclasses as declared in the source.

Type class constants appear with a "_tc" postfix.
Token constants appear with a "tkn_" prefix.

Eg: for the above return constants in EAX/EBX will be:

EAX         EBX
tkn_eax      reg32_tc
tkn_ebx      reg32_tc
tkn_ecx      reg32_tc
tkn_mov      instr_tc
tkn_div      instr_tc
tkn_add      instr_tc
tkn_if      stmnt_tc
tkn_else   stmnt_tc
tkn_endif   stmnt_tc

If no .typeclass is declared, gen uses the default "gen_tc"
So, for generation purposes, "gen" and "genID" may not be used
as typeclasses.  There is no error checking for this (at this time)
so beware!

It is highly advised that genLexer.hhf and genLexer.hla are not modified
by hand unless it is fully determined that gen will no longer be
used to update the lexer sources, since any changes made will be lost
the next time gen is run.
If the user wishes to add constants, the last constant has the id:
"tkn_endTknList"
So adding a new token in another header file for example:
const   
   tkn_mytoken   := tkn_endTknList + 0;
   tkn_mytoken2:= tkn_endTknList + 1;
   ...

Note that digits in the source are treated as identifier characters as
far as the lexer is conserned.


How To Use Gen:

-write a source file with all the desired typeclasses and keywords
and save it as a text file 'gensrc' (no extensions) in the same
directory as gen

-run gen.  This will generate genLexer.hla and genLexer.hhf

-compile the lexer module to an object file:
hla -c genLexer.hla

This will generate genLexer.obj (or genLexer.o in Linux).

-to use this object in your programs, include the header file:
#include ("genLexer.hhf")

This contains all the constants and external declarations.

-compile your program noramlly and link with genLexer.obj (or genLexer.o)

hla myProgram.hla genLexer.obj


To use the lexer in your program:
-load the file to be scanned into
memory or memory mapped file, load ESI with a pointer where you wish
to begin scanning.
-Load the external variable genEOF with the address of where
you wish to stop scanning.
-Load the external variable lineNumber with a number of your choice,
1 is a good place to start if you are scanning from the start of the
file.  If not changed, gen begins with lineNumber at 0.  The lexer
increments this everytime it encounteres LF.


That's all there is to it.  Don't expect great things from this program,
it will not replace flex/bison nor is it intended to.  It's primary
purpose is to generate a quick lexer program for use with HLA or any
programming language that supports object linking and external symbols.
The native support is for HLA, to use with other languages, the constants
in the header must be converted by hand (or modify the source to output
an additional file with the headers of your choice).


Making Changes To Gen:
gen sources are available and will compile under Win32 and soon for
Linux.  The only thing the Linux version needs is a port of the latest
version of hidelib (for the test suite).
Win32 version is set up as a HIDE project.  If you have HIDE, download
the latest hidelib.lib from aoaprogramming files section.  If you have
HIDE 1.1.201+ (not yet released at the time of this writing) this will
not be necessary as it will already contain the updated files.



Updates:
Gen can now also produce any number of binary search trees on a procedural basis
with implicit calls (that can be automated). And a new .code section to rewrite
some default lexer behaviors.



[attachment deleted by admin]

Randall Hyde

I see you make the maxlength constant.
Good job.  I really should have done that in RW.HLA.
Cheers,
Randy Hyde

Sevag.K

You did!

Thanks for the input.  I'm also considering combining funcs.hla to give gen an ability to compose binary search trees on a procedural basis.  Those too I think, can come in handy and aren't quiet as big as hash tables.

Sevag.K

New version of gen.

-produce any number of binary search trees in their own namespace
-change default lexer code generation

Sevag.K

New Version of gen

This version of gen adds support for filenames, and a new independent binary search procedure.

The binary search procedure produces individual HLA files containing a single procedure and necessary constants (without the lexer code) ready for including into your programs.

You can now also specify a filename other than the default 'gensrc' on the commandline.


To generate a binary search tree, eg:

.binsearch {
    .prefix prefix
    .funcname procedurename
    keywords list
}

Here is what one file might look like.

.binsearch {
    .prefix tkn
    .funcname binSearch
    list
    of
    words
    that
    will
    go
    into
    the
    binary
    search
    tree
}


The word list can be in any order, one to a line, or many on one line, as long as each is separated by at least one white space.

Running this through gen will produce one file that has the name supplied by .funcname + .hla, in this case binSearch.hla, ready for including into your programs.  The created procedure, in this case "binSearch", expects one parameter, an HLA string, and preserves all registers, except EAX used for return code.

binSearch ( src:string );

Pass a string, binSearch will look up the string in the tree and return either 0, not found, or a token representing found string.

To use in a program:

program hasBinTree;

    #include ("stdlib.hhf")
    #include ("binSearch.hla")

begin hasBinTree;

    if ( binSearch ("word")) then
       stdout.put ( " 'word' is in the binary tree" nl);
    else
       stdout.put (" 'word' not found in binary tree" nl);
    endif;

    if ( binSarch ("tree")) then
       stdout.put ( " 'tree' is in the binary tree" nl);
    else
      stdout.put (" 'tree' not found in binary tree" nl);
    endif;

end hasBinTree;