XML parser lib

gabor · July 06, 2005, 09:25:22 AM

Hello!

I've been planning to develop an XML library for some time, I bumped into needing it over and over again. I read some topics about compilers and even about an XML parser for HLA, but none of them was trying to achieve what I have in mind.
This are my suggestions about a usable XML parser
- easy to use
- fast
- easy to update/extend (supports the changes W3C might introduce)
- FREE

From the infos I gather so far two different type of parsers emerged:
1. traditional:
The parsing is event-driven, events are reading a starting/ending tag and reading CDATA.
The corresponding handlers would be:

Code Select



int StartTag(tag String) - when the parser processes a <tag> string calls StartTag(tag)
int CData(data String) - when the parser processes text between tags calls CData(text)
int EndTag(tag String) - when the parser processes a </tag> string calls EndTag(tag)

The int return value can be the error code.

2. C# like:
The whole XML document is loaded into a variable, fully dynamically. The variable is a memory buffer, that can be accessed via casting a pointer to the start address. This way an overlay is applied to the buffer.

Code Select


DWORD ImportXML(lpBuffer,lpFILE XMLdoc)

MyStruct    STRUCT
            main db 40 dup(0)
            tag1 dd 0
            tag2 dd 0
MyStruct    ENDS

ASSUME esi:PTR MyStruct
mov    esi,lpBuffer

mov  eax,[esi].tag1
...

The return value can be the error code again.

What do you say to this ideas? How sane are they? Please give your comments!

Greets, Gábor

Tedd · July 06, 2005, 10:18:05 AM

In many ways I would prefer the first method. There are lots of situations where the entire xml document isn't actually available - it's streamed - meaning the second method is entirely inappropriate.
Another idea would be to have it represent the document in a tree structure, but in such a way that new branches are added as the rest of the document is received; if the entire document is available at once, then this is no problem anyway. So as you add data to the document, it's automatically parsed and the tags are used to create the tree structure. This also means that editing operations become much simpler (if you need that.)

I have had the intention to write one at some point too, so if you want a partner message me :wink

Eóin · July 06, 2005, 11:15:07 AM

The first way (called SAX I believe) is generally considered to most efficient and is clearly to the which would suit asm the best. Maybe you've already done so, but check out the code for TinyXML, its a good and simple example of the second way you mention (the DOM way) of doing things.

tenkey · July 07, 2005, 03:23:15 AM

I think that both streaming and document modes are needed.

The streaming mode is for efficiency when processing can be serialized for streaming.

The document mode is needed for random access.

gabor · July 07, 2005, 09:37:45 AM

Hi!

I've considered your posts, many thanks for them, and I've decided to create the procedures for the first version. There are two reasons for this beside you preference:
- the parsing is mainly based on finite state machines, and I've learnt much about them
- this is a basic solution, based on it the 2nd version can be implemented
So, first I'll create a method to parse a really simple XML, no namespaces, no encoding check yet, just the skeleton of the event-driven functionality.

I'll write a short description about my design and post it here later to get your comments and opinion.

Greets, Gábor

Rifleman · July 07, 2005, 05:58:05 PM

gabor,
As a Software Engineer, I have some interest in FSM as a modeller and I am curious. Which of the four methods do you plan to implement? Probably Exit Action?

greetz,
PBrennick

Rifleman · July 07, 2005, 10:01:30 PM

Gabor,
A finite state machine (FSM) is a model of behaviour composed of states, transitions and actions. A state stores information about the past, i.e. it reflects the input changes from the system start to the present moment. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. An action is a description of an activity that is to be performed at a given moment. There are several action types:

Entry action - execute the action when entering the state
Exit action - execute the action when exiting the state
Input action - execute the action dependant on present state and input conditions
Transition action - execute the action when performing a certain transition

What I am referring to is a model based upon the action you wish your automaton (the program) to impose upon the real world (the user). I guess this is more highbrow talk and less about coding but it is good to understand what a finite state machine is all about. It is more a discussion on the 'user interface.' Most people unknowingly use 'Input Action.' If your code is a library, there is no 'Input Action' because the user is another automaton (another program). This means, if your library is going to work similar to the way an overlay works (as a sequential extension to the calling automaton, then you are using an 'Exit action.' If your library is going to be an accessable series of procedures, then it would be a 'Transition action.' A library would never use an 'Entry action' unless the calling automaton is asking it to function as a series of APIs, that would be an 'Entry action.' Did I make this understandable?

Write your code, it will work I bet!
PBrennick

gabor · July 10, 2005, 10:54:06 PM

Hi folks!

I created a small program to test and demonstrate my idea about finite state machines. I've already designed an extended version of the FSM, I'll write a test applicatioin for it too.
The FSM in the code does not use actions, just steps into a state if a rule exists which allows that step.
In the zip there are all files needed. The automaton.inc and automaton.asm do the whole thing, I will update these files with new code for other automaton types.

Finally I would like to share my 3 ideas about how to extend this simple modell:
- introduce actions (like what Riflman wrote about)
- use local transition tables one for each state instead of one big global table,
this local tables are specific for each state,
the state can be identified by the address of the transition table, the traditional state code may become obsolite.
- 3 different type of transition rules
normal: this is the traditional rule, the condition for the input is a single symbol - if (input == 'a') State=A
enum: enumeration of symbols allowing the transition - if (input in {'1','2','4','8')) State=pow2
interval: the input must be within the interval - if (input in [a,z]) State=lowerCase

The EFSM demo is ready, the attachment now contains both, FSMtest and EFSMtest.
I rewrited the EFSM codes, now the newest version is available in the zip!
The EFSM is ready featuring normal, interval and enum input matching.

Greets, Gábor

[attachment deleted by admin]

gabor · July 18, 2005, 09:16:41 AM

Hello friends!

"I proudly announce" that my XML parser is finally completed!

The zip contains all source and the make files. I've made a small application to test the parser.

The parsing is done by an instance of the XMLParser class. So, first a new parser instance must be created by calling
XML_createParser(lpStartHandler:DWORD,lpCDHandler:DWORD,lpEndHandler:DWORD)
the parameters are pointers to procedures, the procedures must be defined like this:
- StartHandler(DWORD lpTAG,DWORD lpAttributeList)
- CDataHandler(DWORD lpCData)
- EndHandler(DWORD lpTAG)

The AttributeList has the following form:

Code Select


1st attribute name,1st attribute value,2nd attribute name...nth attribute name,nth attribute value,DWORD PTR 0

The members of the list are unicode strings.

The XML_parse(lpData:DWORD) method of the parser instance can be used for parsing.
The parameter lpData is a pointer to the XML document.

For every opening tag the StartTagHandler is called with the appropiate parameters.
For every encapsulated character data (CData) the CDataHandler is called with the appropiate parameter.
For every closing tag the EndTagHandler is called with the appropiate parameter.

The user receives the tags, aatributes and char data in unicode strings.

The test application is simple, it creates MessageBoxes displaying this strings in every handler procedure.
Important that the parser uses unicode input only. Every document that uses different encoding must be converted!
The test application makes only ascii->unicode conversation.
A big handicap exists in this version, it uses a small 64kB buffer for the unicode text. This means 32k characters only. I am working on a reflexive version...
Still to do:
- handling different encodings (this version supports unicode)
- more strict tests (please help testing)
- parsing from a buffer that is filled only with a portion of the document, when the buffer is completly processed a new read is initiated until end_of_file.

I will continue this project after every known lacks are eliminated, so please help me to find the bugs!

Greets, Gábor

PBrennick · August 26, 2005, 03:19:45 PM

gabor,
Very nicely done and I appreciate your nice comments (in a PM) about my help when I was the Rifleman. Keep up the good work here and also in your SGN topic.

Paul

gabor · November 28, 2005, 01:43:36 PM

Hello!

I found that my former parser had some serious errors and it lacked some important features. I considered the necessary modifications and decided to rewrite the whole library.
Now I've reached a milestone, here are the features working:
- buffered reading of xml input
- handling of ASCII and Unicode encoding
- supports event driven parsing, every XML event (StartTag, CDATA, EndTag) has a callback function that the user must supply

Missing:
- creating the attribute list of start tags and passing it to the callback function
- command handling (command tags start with ? and don't have end tags, like <?xml...?>)
- checking that there is only one top element

About command handling I have the following idea:
- the parser should have an extra function for commands
- when the parser reads a command tag (a tag in the form of '?xxx' - xxx is a command name, like xml or php) passes the command name to the command handler, that must return the address of a callback function related to the command. When no function is associated with the command the handler must return NULL.
To this callback function every input characteris passed until the end of the command the '?>' sequence is reached.
Example:

Code Select


<?myCommand command1, parameter11,parameter12
command2, parameter21,parameter22 ?>

First, myCommand is passed to the command handler. The command handler puts the address of a function (that interprets command1 and command2) into the lpCommand attribute. The character stream starting with command1 and ending with parameter22 is passed to this function, so it can process the command's body.

The address of the actual callback function is stored in the lpCommand attribute of the XML parser instance.

What do you think? Is this a good concept? Or is it too comlicated? Do you have any suggestions?

I add the missing stuff this week, and I would like to have some help in testing. After I am fully finished I'll try to write a brief and usefull doc (mainly a well commented inlcude file will do).

I appreciate all of your comments!

Greets, Gábor

sluggy · November 29, 2005, 12:46:00 AM

gabor,
good work :)

I would suggest you use NUnit for the testing. With this, you can write a bunch of C# or C++ functions that call your library, you can dictate the input, then test that the output is what was expected. It should only take you an hour or so to "learn" how to write an NUnit test module and get it going. This sort of testing is excellent for picking up errors, can be run in an automated mode, and is excellent for ensuring that new code doesn't break what was already there. It is also a valuable addition to relying on humans to do testing :lol

gabor · November 30, 2005, 08:36:32 AM

Hi!

2 of the 3 missing features are covered, hopefully without errors:
- attribute handling
- command handling

2005.12.21: I created a document that describes the usage of this lib. I also made some changes to the code.

2006.02: Next step is to include namespace handling (won't be difficult).

Greets, Gábor

gabor · July 28, 2006, 11:26:19 AM

Hello!

First of all: to stop the confusion about the versions of my code I detached the zip files from every posts except this last post. The current version is of 2006.07.29.

After several month I pulled out my XML parser code. I corrected some major bugs, caused by design mistakes. I added the missing comment handling too. Now, the parser is more stable and is nearer to perfection :bg
Still missing:
- XML command handling (the processing of the <?xml tag)
- namespaces

I am sure there are still errors lurking in my code, so please help me testing it!

The readme docu has't changed, I hope it gives enough explanations. The executable is just a wrapper application to test the parser functions. The xml.asm source in the zip is the most important part of the tester application. It contains the 3 XML event handler showing an example how the parser can be used.

I await your feed backs!

Greets, Gábor

ps: I am working on a DOM like XML processing as well. This subproject needs graph algos wich proved to be a bigger project... Comes later!

gabor · September 06, 2006, 03:52:12 PM

Hello!

I made a few minor modifications, added a second example (stats.exe) and modified the readme docu.
This example is a step to the DOM type of parsing. It gathers information about the nodes (objects described by the tags) of the XML document. After collecting the infos it is possible to build up the graph representation via static methods.

The last version is attached here.

Greets, Gábor

[attachment deleted by admin]

News:

XML parser lib

Eóin

Rifleman

Rifleman