Writing a parser

DarkWolf · April 24, 2005, 01:20:56 AM

I have an interest in writing an XML parser.
Any suggestions on good resources for this or parsers in general.

<edit lastmodified="7/27/2005 9:12:37 PM">
I'll try to post the most recent version of the project here....
Fair Warning, You may or may not be able to actually compile anything.

generic makefile added, doesn't mean anything might actually work : )
batch file too
Still looking for some info on Unicode support ( UTF-8/16 )
</edit>

[attachment deleted by admin]

bozo · April 25, 2005, 02:02:01 AM

Would looking at web browser source code parses XML be helpful?
I'm thinking about the open source browser Firefox..

Eóin · April 25, 2005, 10:59:16 AM

I would think this could be a very helpful project for the Asm community (assuming you share it :wink ). Looking at other open source examples could certainly help. In theory you should latch out for liscence issues but honestly I think that if you're only getting basic ideas then even GPL doesn't mind.

More importantly perhaps, you should get an idea of how other parsers work on the outside, what do they offer? how do they present the data? etc... This will be muxh more useful to you.

Some links to check out: TinyXML, Libxml2, eXpat and search sourceforge.

DarkWolf · April 25, 2005, 06:39:56 PM

Not sure if browser source code will help.

Despite all the hype W3C gives regarding their ' recomendations ', software vendors are still going their own way.

None of the browsers support XML as much as they claim.
Opera won't even touch XSL-FO .
Microsoft went their own way with XML Data instead of XML Schema.
And Mozilla even though they are a little better, they want to use their syntax XUL on top of RDF on top of XML.

Though Eoin's suggestion of looking at libxml sounds good.

DarkWolf · April 28, 2005, 01:34:05 AM

Well that was mostly a rant.

Thanks for the links Eoin
I am going to be looking at these more closely when I can.

gabor · April 28, 2005, 10:07:40 AM

Hi!

I have a question that can possible help the work:

which parser would be more usable? (or maybe both could come handy)

1. An older PHP style parser with three event handlers:

Code Select

start_element() char_data() end_element()
or
2. The newer XML import/export like in .NET: an XML can be imported into a variable, this variable will have the structure described in the XML. A variable can be exported the same way...

Since XML parser lib would be really a great thing to have in the masm community, please support this work and give ideas and tips!

Greets, gábor

James Ladd · April 28, 2005, 08:49:25 PM

DarkWolf,
I would look at the java language and sites for how they handle parsing xml.
Id say the majority of use and focuse it there at present.
They support two styles of parser and both a good for different reasons.
One does call backs, so as each element is completed you get called back with the new element
and another parses the whole document before you can deal with it.
Hopefully both will be provided by your parser. Id supply the callback on each element parser if
you did not want to do both up front.
Have a look at things like JAXP, JAXM and Xerces for xml parsing information and styles.

DarkWolf · April 30, 2005, 03:56:48 AM

striker:

You are talking about sax vs. dom correct ?
As far as a itenerator vs. a treewalker, I don't really care though I am leaning towards the itenerator so that only the portion you are working with is parse instead of the whole document, which is a waste of memory if you're not working on the whole document.

But my idea was a library of code so that a programmer could choose either style they prefered.

gabor:

I am not famaliar with php or .net but you were suggesting something similar right ?
*****************

I got the sources to libxml, tinyxml and xerces.
I'll be looking at them to see how to go about this.
The idea was to make a library like the HLA standard library but for XML.

James Ladd · April 30, 2005, 11:07:35 PM

DarkWolf,
Yes i am speaking of SAX vs DOM.
There are different situations where you want to have the whole document and times when you do not
as you are just looking for a fragment.
I would need a library that did both if I needed a library.

DarkWolf · May 04, 2005, 12:19:53 AM

Yes, I am considering both as well.

Use whatever method is needed for the task at hand.
It seems alot of other projects either use mostly DOM or both methods.
Might as well implement both ourselves.

I have been going over the tinyxml code and the XML 1.1 recommendation.
Been looking to see how to implement each of the "productions" though I admit I haven't come up with much yet.

DarkWolf · May 14, 2005, 05:39:01 AM

I have started to consider how to best write functions to support the W3C Recommendation.

The main idea os to have a library similar to the HLA stdlib, I want to have the headers and libs so that you can 'assemble' whatever parser you wanted. There are few things I am trying to wrap my head around, such as how to tell a program to read the string following '<?' to see if it's a valid prolog, simple idea but it's stumping me : (

I have heard from one person about working on a project, anyone else ?

Sevag.K · May 14, 2005, 10:32:00 PM

What exactly are you trying to do? I've done quiet a bit of work on writing source scanners.

The most important concern is the format of your source. If it's standard ASCII, you'll have
the full power of HLA pattern matching and string functions.

gabor · May 15, 2005, 07:48:49 AM

Hi!

The format of the source varies, this is what is described in the "header" of an XML:

For instance
<?xml version="1.0" encoding="ISO-8859-2"?>
is for ISO central european codes.

Greets, gábor

Sevag.K · May 15, 2005, 04:29:46 PM

I don't know much XM, so bear with me.

I see three keywords:
xml
version
encoding

Are these all the possible keywords or is the number of keywords unknown?
Can there be data without keywords?
The values of keys are always in double quotes ""?

Also, each keyword is followed by '=' is this always the case?
Can there be white spaces between keyword and the '='?
Is the data always encapsulated between <? and ?>

Answering all these questions will narrow down what needs to be done to parse he prolog.

chep · May 15, 2005, 05:33:59 PM

Most parsers use UTF-16 internally so you might want to convert the file as soon as you have extracted the encoding.

@Sevag.K:

Taken from the specs :

Code Select

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
S    ::=    (#x20 | #x9 | #xD | #xA)+ 

XMLDecl      ::=    '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' 
VersionInfo  ::=    S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"') 
Eq           ::=    S? '=' S? 
VersionNum   ::=    '1.0'
EncodingDecl ::=    S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )  
EncName      ::=    [A-Za-z] ([A-Za-z0-9._] | '-')*                             /* Encoding name contains only Latin characters */
SDDecl       ::=    S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))

Hopefully there is no need to handle entities here so you're done with the optional declaration :P

The full XML 1.0 specification : http://www.w3.org/TR/REC-xml/

News:

Writing a parser

Eóin

chep