News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

RFC: Lexor / Parser...

Started by James Ladd, October 15, 2006, 12:30:42 AM

Previous topic - Next topic

James Ladd

Please describe how you would go about building a very fast Lexor/Parser where
processing speed was more important than memory.
If you need to know a language in order to make some decisions then consider the
lexor/parser to be for the language LISP.

I'm interested in how different people would approach this problem.

Rgs, james.

tenkey

First, optimize I/O. Memory access in any form is much, much faster than disk access. The more bytes you can load into RAM without paging it out, the faster your lexer/parser. Minimize the overhead of API requests.

Second, optimize table insertion and searching. Minimize cache misses by reducing search depth or by clustering.

Finally, see what tradeoffs there are in the different lexing and parsing strategies.
A programming language is low level when its programs require attention to the irrelevant.
Alan Perlis, Epigram #8

Synfire

To expand on what tenkey has said, I suggest using a structure similiar to the one below (only larger if need be). Store a pointer to your strings in the tok_entry.string item. Hash your string using some simple hashing algo and save that as the tok_entry.hash dword. Then obtain the length of the string and save it into the tok_entry.len byte. This way, you have the option of searching your tokens by the length, once an entry has the same length, you can test hash against stored hash to ensure it's the right token. This will increase searching times. You store the string pointer in case you need access to the actual string (since a hash is generally difficult, if not impossible, to revert back to a string). If you want even faster speed, I suggest making this tok_entry structure as part of a BST (binary search tree) to further increase searching, by doing so in a logical fashion.

tok_entry STRUCT
hash DWORD ?
len BYTE ?
string DWORD ?
tok_entry ENDS


Also, here is a simple hashing function, should work fine for what you want:

SimpleHashGen:
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::
; ESI - String Address
; ECX - String Length
; Returns a DWORD hash of a string in EAX on success.
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::
xor EAX, EAX
xor EBX, EBX
push ECX
add ECX, ESI
@@: cmp ECX, ESI
je @F
mov BL, BYTE PTR [ESI]
shl EAX, 1
add EAX, EBX
mov EBX, 0123h ; any number would do
mul EBX
inc ESI
xor EBX, EBX
jmp @B
@@: pop ECX
add EAX, ECX
ret


Regards,
Bryant Keller

James Ladd

thanks guys.

Synfire wrote
QuoteThis will increase searching times.
Do you mean this, or do you mean decrease times ?

hutch--

I have moved this topic to the workshop as it was hardly a beginners question.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

gabor

Hi!

First of all I hope I didn't misunderstand what this topic is about. My post refers to parsers. I believe that the most speed-up can be achieved by a good concept and design, so I'll write about theory only.

As I could read only the storage of the input tokens were discussed. I think storing and applying the rules is as much important as the storage of the tokens. I'd like to list up 3 points of my idea on creating a parser:
1. The elements of the language, the tokens are to be stored and a fast way of access has to be assign to them.
2. The rules of the language are to be implemented either by many IF-THEN-ELSE clauses or a better approche would be the decision table (like the switch instruction in HLLs).
3. The rules should be capable of executing actions to provide entry points to the user, since the parser itself can be responsible for the syntax only and the parsed document generaly stores information that has to processed too.

Some more details:
1. There are 2 major solutions for storage: hash and retival trees. Hash is fast, but consumes more memory. Trees can be slower but are more space-saving. Of course a combination of the 2 is a quite good idea.
Example to store 10000 strings
- Create a bucket-hash with 100 entries. Every 100th insert will cause a collision (a string gets a hash key, that has been already given away).
- The strings that collide are stored in a linked list (in the bucket). The linked list has a tree-like indexing. I would not stick to binary trees...
I am positive that there is no such huge document to be parsed that could not be processed with parser data stored mainly in memory. (I may be wrong...)

2. For this part there is a vast theory again. The most known and used are the transition machines (I might translate the terms wrong, sorry for that). In my XML parser I used an extended finite state machine concept. I wrote about the FSM, EFSM and the parser in this thread:
http://www.masm32.com/board/index.php?topic=2173.0
There you can find the automaton.zip as attachment, in the zip there is source and explanation too.

3. The EFSM provides 4 actions. Entry action, Exit action and 2 action assigned with transitions (rules).

I hope this will help a bit  :U

Greets, Gábor

Synfire

heh, yea, decrease.. my bad mate.  :bg