The MASM Forum Archive 2004 to 2012

Project Support Forums => HLA Forum => Topic started by: DarkWolf on April 24, 2005, 01:20:56 AM

Title: Writing a parser
Post by: DarkWolf on April 24, 2005, 01:20:56 AM
I have an interest in writing an XML parser.
Any suggestions on good resources for this or parsers in general.

<edit lastmodified="7/27/2005 9:12:37 PM">
I'll try to post the most recent version of the project here....
Fair Warning, You may or may not be able to actually compile anything.

generic makefile added, doesn't mean anything might actually work  : )
batch file too
Still looking for some info on Unicode support ( UTF-8/16 )
</edit>

[attachment deleted by admin]
Title: Re: Writing a parser
Post by: bozo on April 25, 2005, 02:02:01 AM
Would looking at web browser source code parses XML be helpful?
I'm thinking about the open source browser Firefox..
Title: Re: Writing a parser
Post by: Eóin on April 25, 2005, 10:59:16 AM
I would think this could be a very helpful project for the Asm community (assuming you share it :wink ).  Looking at other open source examples could certainly help. In theory you should latch out for liscence issues but honestly I think that if you're only getting basic ideas then even GPL doesn't mind.

More importantly perhaps, you should get an idea of how other parsers work on the outside, what do they offer? how do they present the data? etc... This will be muxh more useful to you.

Some links to check out: TinyXML (http://www.grinninglizard.com/tinyxml/), Libxml2 (http://xmlsoft.org/), eXpat (http://expat.sourceforge.net/) and search sourceforge (http://sourceforge.net/search/).
Title: Re: Writing a parser
Post by: DarkWolf on April 25, 2005, 06:39:56 PM
Not sure if browser source code will help.

Despite all the hype W3C gives regarding their ' recomendations ', software vendors are still going their own way.

None of the browsers support XML as much as they claim.
Opera won't even touch XSL-FO .
Microsoft went their own way with XML Data instead of XML Schema.
And Mozilla even though they are a little better, they want to use their syntax XUL on top of RDF on top of XML.

Though Eoin's suggestion of looking at libxml sounds good.
Title: Re: Writing a parser
Post by: DarkWolf on April 28, 2005, 01:34:05 AM
Well that was mostly a rant.

Thanks for the links Eoin
I am going to be looking at these more closely when I can.
Title: Re: Writing a parser
Post by: gabor on April 28, 2005, 10:07:40 AM
Hi!


I have a question that can possible help the work:

which parser would be more usable? (or maybe both could come handy)

1. An older PHP style parser with three event handlers: start_element() char_data() end_element()
or
2. The newer XML import/export like in .NET: an XML can be imported into a variable, this variable will have the structure described in the XML. A variable can be exported the same way...

Since XML parser lib would be really a great thing to have in the masm community, please support this work and give ideas and tips!

Greets, gábor
Title: Re: Writing a parser
Post by: James Ladd on April 28, 2005, 08:49:25 PM
DarkWolf,
I would look at the java language and sites for how they handle parsing xml.
Id say the majority of use and focuse it there at present.
They support two styles of parser and both a good for different reasons.
One does call backs, so as each element is completed you get called back with the new element
and another parses the whole document before you can deal with it.
Hopefully both will be provided by your parser. Id supply the callback on each element parser if
you did not want to do both up front.
Have a look at things like JAXP, JAXM and Xerces for xml parsing information and styles.
Title: Re: Writing a parser
Post by: DarkWolf on April 30, 2005, 03:56:48 AM
striker:

You are talking about sax vs. dom correct ?
As far as a itenerator vs. a treewalker, I don't really care though I am leaning towards the itenerator so that only the portion you are working with is parse instead of the whole document, which is a waste of memory if you're not working on the whole document.

But my idea was a library of code so that a programmer could choose either style they prefered.

gabor:

I am not famaliar with php or .net but you were suggesting something similar right ?
*****************

I got the sources to libxml, tinyxml and xerces.
I'll be looking at them to see how to go about this.
The idea was to make a library like the HLA standard library but for XML.
Title: Re: Writing a parser
Post by: James Ladd on April 30, 2005, 11:07:35 PM
DarkWolf,
Yes i am speaking of SAX vs DOM.
There are different situations where you want to have the whole document and times when you do not
as you are just looking for a fragment.
I would need a library that did both if I needed a library.
Title: Re: Writing a parser
Post by: DarkWolf on May 04, 2005, 12:19:53 AM
Yes, I am considering both as well.

Use whatever method is needed for the task at hand.
It seems alot of other projects either use mostly DOM or both methods.
Might as well implement both ourselves.

I have been going over the tinyxml code and the XML 1.1 recommendation.
Been looking to see how to implement each of the "productions" though I admit I haven't come up with much yet.
Title: Re: Writing a parser
Post by: DarkWolf on May 14, 2005, 05:39:01 AM
I have started to consider how to best write functions to support the W3C Recommendation.

The main idea os to have a library similar to the HLA stdlib, I want to have the headers and libs so that you can 'assemble' whatever parser you wanted. There are few things I am trying to wrap my head around, such as how to tell a program to read the string following '<?' to see if it's a valid prolog, simple idea but it's stumping me : (

I have heard from one person about working on a project, anyone else ?
Title: Re: Writing a parser
Post by: Sevag.K on May 14, 2005, 10:32:00 PM
What exactly are you trying to do?  I've done quiet a bit of work on writing source scanners.

The most important concern is the format of your source.  If it's standard ASCII, you'll have
the full power of HLA pattern matching and string functions.

Title: Re: Writing a parser
Post by: gabor on May 15, 2005, 07:48:49 AM
Hi!

The format of the source varies, this is what is described in the "header" of an XML:

For instance
<?xml version="1.0" encoding="ISO-8859-2"?>
is for ISO central european codes.

Greets, gábor

Title: Re: Writing a parser
Post by: Sevag.K on May 15, 2005, 04:29:46 PM
I don't know much XM, so bear with me.

I see three keywords:
xml
version
encoding

Are these all the possible keywords or is the number of keywords unknown?
Can there be data without keywords?
The values of keys are always in double quotes ""?

Also, each keyword is followed by '='  is this always the case?
Can there be white spaces between keyword and the '='?
Is the data always encapsulated between <? and ?>

Answering all these questions will narrow down what needs to be done to parse he prolog.
Title: Re: Writing a parser
Post by: chep on May 15, 2005, 05:33:59 PM
Most parsers use UTF-16 internally so you might want to convert the file as soon as you have extracted the encoding.



@Sevag.K:

Taken from the specs :

S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
S    ::=    (#x20 | #x9 | #xD | #xA)+

XMLDecl      ::=    '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
VersionInfo  ::=    S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
Eq           ::=    S? '=' S?
VersionNum   ::=    '1.0'
EncodingDecl ::=    S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) 
EncName      ::=    [A-Za-z] ([A-Za-z0-9._] | '-')*                             /* Encoding name contains only Latin characters */
SDDecl       ::=    S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))

Hopefully there is no need to handle entities here so you're done with the optional declaration :P

The full XML 1.0 specification : http://www.w3.org/TR/REC-xml/
Title: Re: Writing a parser
Post by: DarkWolf on May 16, 2005, 12:15:28 AM
Just so you know chep, I want to make an XML 1.1 parser which has to do both 1.0 and 1.1
Not sure where you got the idea of UTF-16, by default XML is assumed to be UTF-8, but it is a mute point HTTP servers transmit it as us-ascii anyway ( don't ask me I don't know why ).

The encoding attribute is optional by default all XML docs are treated as UTF-8 ( which I just saw that Randy has started work on ) .

I am not going to bore kain with a lot of technical terms he may not be aware of, but the full XML prolog looks like this

<?xml version="1.1" encoding="put character-set here" standalone="no" ?>

If you're famaliar with HTML then calling version, encoding and standalone as attributes should make sense.
version is mandatory and always a number
encoding (string value) and standalone (boolean value) are optional if not included UTF-8 and no are assumed.
Values in attributes are surrounded by either double or single quotes, this is handy if you need to use one or the other in the value.
Yes the attribute takes the form of  string="value" Whitespace can be in the value but not newline characters, whitespace/newline can be between the = and first double quote. Whitespace/newline can be between attributes. PI's and tags/elements can take up more than one line.

<? ?> are PI's ( won't bore you with complete definitions ), it is an instruction passed on and not displayed after parsing.
They all take the form of <?string attributes?> where the string can only be xml in the prolog it's a reserved name and cannot be used any where else.

The XML recommendations are somewhat hard to read, a good cheap book might be a good place to start. I missed the bookstore near where I live, had PC books for around a dollar : )  But it was one of those local run mom and pop places and it closed down eventually : (

I am trying to find out if standalone defaults to 'no' or not but explorer (local not internet explorer) is giving me fits and won't open the file, oh well I think it defaults to 'no'

I have made XML docs before just never a parser to read them.
Title: Re: Writing a parser
Post by: Sevag.K on May 16, 2005, 01:32:49 AM
Well, with the information presented:
If the file you are parsing is ASCII, then it's pretty easy as most of the routines you need are either already available or easy to implement.  I can help with this one.
If it's UTF-xx, then you pretty much have to start from ground zero and create a whole new set of routines that deal with UTF-xx character arrays.  I'm not too familiar with the formats so I can't be much help for now.

Title: Re: Writing a parser
Post by: chep on May 16, 2005, 02:27:17 AM
DarkWolf,

Neither did I ever wrote an XML parser, only used Xerces-C, MSXML, and a Java thing I can't remember the name... so I never needed anything else than the W3C's BNF grammars :wink
Anyway you managed to translate the grammar into plain english, well done!! :U (not sure I could have done that as english is not my native language :toothy).


Concerning the UTF-16 remark, I only wanted to say that most parsers use UTF-16 as their internal string representation (from the API point of view). I guess it's because UTF-16 surrogates are far easier to handle than multibyte UTF-8 characters. Of course you could as well choose UTF-8 or even UCS-4 :eek if you whish, it's just a trade between easy implementation and memory footprint. IMHO UTF-16 is clearly the best choice.


Concerning the encoding I understand the specs as follow :
- if the encoding declaration is missing : either you have a Byte Order Mark indicating the UTF encoding, or it is assumed to be UTF-8.
- if a transport protocol explicitely specifies an encoding it will override the encoding declaration in the document :dazzled: (if I were you I'd assign this the lowest priority :red).


Concerning the standalone part :
QuoteIf there are no external markup declarations, the standalone document declaration has no meaning. If there are external markup declarations but there is no standalone document declaration, the value "no" is assumed.
Clear enough :toothy

Cheers
Title: Automatons
Post by: gabor on May 17, 2005, 12:03:20 PM
Hi folks!


Maybe I've already mentioned the automatons that can be used to "recognize" XML docs. So they can be the base of a parser, as they are in some parsers...In most cases finite state machines (FSM) can be used, but in this case, since XML is a recursive structure/language the appropriate automaton would be a stack state machine.
I have created a web page on this topic http://members.chello.hu/nemeth.gabor1/automaton/index.html. It is still under work, but there is a small code showing the basics of the theory.
I post the code here as well, for more comfort.  :bg

Greets, gábor



[attachment deleted by admin]
Title: Re: Writing a parser
Post by: Randall Hyde on May 18, 2005, 12:13:04 AM
Quote from: chep on May 16, 2005, 02:27:17 AM


Concerning the UTF-16 remark, I only wanted to say that most parsers use UTF-16 as their internal string representation (from the API point of view). I guess it's because UTF-16 surrogates are far easier to handle than multibyte UTF-8 characters. Of course you could as well choose UTF-8 or even UCS-4 :eek if you whish, it's just a trade between easy implementation and memory footprint. IMHO UTF-16 is clearly the best choice.


UTF-16 is not a whole lot easier to handle than UTF-8, unfortunately. The original 16-bit Unicode standard *was* easier to handle, but once everyone figured out that 16 bits was insufficient, UTF-16 was created (which is a "multi-word" character set, with all the problems of UTF-8 with respect to easy processing).

UCS-4 (UTF-32) is the only way to go if you want each character cell to consume the same amount of memory. But four bytes per character is a heavy price to pay for that convenience.
Cheers,
Randy Hyde
Title: Re: Writing a parser
Post by: chep on May 18, 2005, 03:09:42 AM
Quote from: Randall Hyde on May 18, 2005, 12:13:04 AM... UTF-16 was created (which is a "multi-word" character set, with all the problems of UTF-8 with respect to easy processing). ...

The advantage of UTF-16 over UTF-8 is that an UTF-16 character can be at most 2 words long (when a word is between 0xD800 and 0xDBFF then this is the leading high surrogate of a pair, the following low surrogate being between 0xDC00 and 0xDFFF) whereas an UTF-8 character can be several bytes long.

Of course you can always tell the character length by looking at the first byte (0xxxxxxx = 1 byte, 110xxxxx = lead byte of a 2 byte char, and so on), but obviously the code needed to handle UTF-16 is simpler than for UTF-8.

Moreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.


Randall I'm pretty sure you already knew that but I just wanted to explain why IMO UTF-16 is still easier to handle than UTF-8. :wink

I have to admit that I'm very reluctant to use anything else but UTF-16 in memory and UTF-8 externally (files/network) since we had to internationalize an application. I can assure you that's not a task I want to have to perform again!! :boohoo:


BTW don't expect me to post anything soon as I'm leaving in a few hours (holidays at last!). :8) :dance:
Title: Re: Writing a parser
Post by: Randall Hyde on May 18, 2005, 09:04:04 PM
Quote from: chep on May 18, 2005, 03:09:42 AM
[The advantage of UTF-16 over UTF-8 is that an UTF-16 character can be at most 2 words long (when a word is between 0xD800 and 0xDBFF then this is the leading high surrogate of a pair, the following low surrogate being between 0xDC00 and 0xDFFF) whereas an UTF-8 character can be several bytes long.
Granted, the code to parse multiple characters versus two words is slightly more complex, but in the end, the fact that you cannot assume that each character consumes an equal amount of space is the real killer. That is, you *have* to scan the string, character by character, to determine it's length. You cannot assume that the number of bytes reserved (or words) equals the number of characters. Once you cannot make that assumption, the software gets *far* more complex.

Quote
Moreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.
This is *not* what I'm hearing from people. Windows supports 16-bit Unicode *only*, not UTF-16 (i.e., no multi-word sequences). Correct me if I'm wrong, but this is what people are saying to me.

Cheers,
Randy Hyde
Title: Re: Writing a parser
Post by: DarkWolf on May 19, 2005, 02:19:03 AM
gabor I don't recall the automotons before maybe that was another forum on the site.
But I downloaded that archive and will take a look at the website.

However a bit of question, Which character set do I support ?

What I know is:
1. HTTP servers transmit text as us-ascii
2. XML by default is to be parsed as UTF-8
3. Authors of XML docs can specify encodings

Given that it would seem to use us-ascii, except that if the parser is used locally, not over the net, or by means other than http; it would seem that UTF-8 or the authors encoding would have to be supported. I could let the OS handle the character encodings but then the parser would need to be written in several different forms, one for each OS.

So what sort of approach should I take here ?
Title: Re: Writing a parser
Post by: Sevag.K on May 19, 2005, 03:51:57 AM

I guess that depends on which method the parser will be used.  If you can't predict the origin, then I suppose you'll have to support all possible formats (yikes!).
Title: Re: Writing a parser
Post by: gabor on May 19, 2005, 09:49:37 AM
Hi!

1. HTTP servers transmit text as us-ascii. Are you sure about this? For example in HTML there is an option in META tag to define the used encoding. In my language there are some chracters that are not in the us-set...
2. There must be a default if no encoding was given. This can be UTF-8, I cannot argue with that, must look on www.w3c.org (There is everything about XML since they created and developed it. The XML whitepaper is about 40 pages)
3. In XML the used encoding can be specified. So there is no way to leave out use of variable character formats and sets.

My conclusion is, that a very precise and correct parser would handle all main cases.

In my studies I met this concept:
1. Lowest level of semantic network: character encoding is unicode. - level of reading text files, character streams
2. Based on the universal encoding parsers provide synthax checking - level of tags, words/tokens
3. Based on correct synthax, semantic/conceptual modells the input documents can be interpreted, the information processed - level of agents

My first, quick idea would be to create a converter from every used encoding into unicode. This could be a task for preprocessing, before parsing, or could be done along parsing. Well I hope I helped a bit.

Greets, gábor

Title: Re: Writing a parser
Post by: DarkWolf on May 29, 2005, 06:44:52 PM
I have been going through the Xerces and libxml source code and admittingly I don't understand the UTF-8/16 code.
Which is the default, must have, encoding for XML parsers.

Oh well, I'll start with asciii and maybe Unicode can be added later.
I have been looking through the HLA stdlib reference, I can think of a few ways to extract the information from a string but I am trying to figure out how to retain the structure from the parsed file.

I have been trying think of a way to do it one step at a time, starting with the prolog and working my way through the file.
But eventually you come to a fork in the road, so how to add to what has already been parsed is what I am trying to think of.
Title: Re: Writing a parser
Post by: Sevag.K on May 29, 2005, 11:14:42 PM
I don't understand what you want to do, can you give an example?
Title: Re: Writing a parser
Post by: chep on May 30, 2005, 02:41:23 PM
Hi,

Quote from: Randall Hyde on May 18, 2005, 09:04:04 PMYou cannot assume that the number of bytes reserved (or words) equals the number of characters. Once you cannot make that assumption, the software gets *far* more complex.
You're right that's the problem with any multi byte/word character set.

However the main (only?) use I can see for character-counted strings is for input validation (ie. to check that the string is between x and y characters long). So in a XML parser I guess this will only take place at the schema-checking level ; AFAIK every other routine should be happy with the buffer size.

Anyway, I see 2 cases :

1. You use raw null terminated strings without any additional values. In this case you have to scan a string everytime you need to manipulate it.

2. My preferred method: you store additional values along with the raw string, like it's size (in bytes or words) and the buffer maximum size (hmm... sounds like a UNICODE_STRING structure). Then nothing prevents you from adding another field containing the string length (in characters), and populate it along with the other fields during the initial string scan. Eg:
UTF16_STRING STRUCT
    dwLengthInChars DWORD ?
    usString        UNICODE_STRING<?>
UTF16_STRING ENDS



Quote from: Randall Hyde on May 18, 2005, 09:04:04 PM
Quote from: chep on May 18, 2005, 03:09:42 AMMoreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.
This is *not* what I'm hearing from people. Windows supports 16-bit Unicode *only*, not UTF-16 (i.e., no multi-word sequences). Correct me if I'm wrong, but this is what people are saying to me.
Sorry I gave incomplete/erroneous information... I'm used to Win2k+ so I forgot to think before I posted. :red

Win9x and NT4 don't support surrogates.
In Windows 2000 and later, UTF-16 surrogates are disabled *by default* unless you install an IME that needs surrogates, or you enable them yourself in the registry.

See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp for more info.
Title: Re: Writing a parser
Post by: gabor on May 31, 2005, 05:16:03 AM
Hi

I think the topic has turned a bit away from the original idea. Believe me, processing the text input is not the most difficult part of a parser. As I wrote it before there is a general concept, that the international text represention should be Unicode/UTF16. Since the header of an XML must contain the encoding format, I think it is quite easy what to do. Windows API supports UTF16 (MultiByteToWideChar proc) and as I've already wrote it in another post converters should be used to this format, so the parser can work independently of the format.

I suspect and this is more than a plain suspect parsing is the important and tough part. The XML makes it possible to define recursive languages, recursive languages are a bit harder to interpret and require additional logic, or complex stack state machines.

Greets,Gábor

Title: Re: Writing a parser
Post by: DarkWolf on June 04, 2005, 03:11:34 AM
Can I get some pointers ( no pun intended ) to some resources on how to program unicode support ( UTF-8/16 ).
I would like to write a parser that is complient with the XML 1.1 specs if I can.
Otherwise I'll just have to use only us-ascii .

I have been to the Unicode website but didn not find any material on writing programs to support unicode.
Of course I did expect not to find any.
Title: Re: Writing a parser
Post by: DarkWolf on June 18, 2005, 02:31:06 AM
I have been trying to convert the notations in the XML specs into HLA
The first few I did are listed below. I know I made some mistakes, I'll change document to a string.
But overall is there problems that anyone can see ?

// [1] document      ::= prolog element Misc*
// [2] Char               ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
            /* any unicode character excluding surrogate blocks, FFFE, and FFFF */
// [3] S         ::= (#x20 | #x9 | #xD | #xA)+
            /* whitespace is space, tab, cr, lf */
// [4] NameChar      ::= Letter | Digit | '.' | '_' | ':' | CombiningChar | Extender
// [5] Name          ::= (Letter | '_' | ':') (NameChar)*

static
document:   cset := prolog + element + Misc
Char:               cset := {u#$9,u#$A,u#$D,[u#$20..u#$D7FF],[u#$E000..u#$FFFD],[u#$10000..u#$10FFFF]}
S:                cset := {u#$20,u#$9,u#$D,u#$A}
NameChar:   cset := Letter + Digit + CombiningChar + Extender + {'.','_',':'}
Name:             string;

procedure checkName;   
begin checkName;
  pat.match( Name );
   pat.zeroOrMoreCset( Letter + {'_',':'} );
   pat.oneOrMoreCset( NameChar );
   pat.EOS

  pat.if_failure
   stdout.put( "Not a valid name" );
  pat.endmatch;
end Name;
Title: Re: Writing a parser
Post by: tenkey on June 18, 2005, 03:57:26 AM
The definition for Name doesn't look right.

According to the grammar you posted, an XML name must start with a letter, '_' or ':'. "Zero or more" does not sound like it matches exactly one character.
Title: Re: Writing a parser
Post by: DarkWolf on June 19, 2005, 08:35:14 PM
I reversed them ( oops! )

the first one should be pat.oneOrMoreCset and the second is pat.zeroOrMoreCset

But I need to exclude 'xml' , no matter what case the string 'xml' is reserved. I am thinking using pat.alternate may be a good solution.
Title: Re: Writing a parser
Post by: DarkWolf on June 19, 2005, 08:38:23 PM
Just remembered

I know I can do a case insensitive pattern match for characters but what about a case insensitive match for strings ?
Title: Re: Writing a parser
Post by: Randall Hyde on June 26, 2005, 03:02:21 PM
Quote from: DarkWolf on June 19, 2005, 08:38:23 PM
Just remembered

I know I can do a case insensitive pattern match for characters but what about a case insensitive match for strings ?
Try matchistr. IIRC, it should be present.
Cheers,
Randy Hyde
Title: Re: Writing a parser
Post by: Randall Hyde on June 26, 2005, 03:02:44 PM
Quote from: DarkWolf on June 19, 2005, 08:35:14 PM
I reversed them ( oops! )

the first one should be pat.oneOrMoreCset and the second is pat.zeroOrMoreCset

But I need to exclude 'xml' , no matter what case the string 'xml' is reserved. I am thinking using pat.alternate may be a good solution.

Yes. First, try to match "xml" and then, as an alternate, match your identifiers.  It is important, however, to attempt the match on xml first.
Cheers,
Randy Hyde
Title: Re: Writing a parser
Post by: DarkWolf on June 27, 2005, 09:14:38 PM
Thanks,

Figures I would miss that ' i ' in there on my first read through.

Other questions though:

// [1] document      ::= (prolog element Misc*) - (Char* RestrictedChar Char*)
// [10] AttValue   ::=  '"' ([^<&"] | Reference )* '"' | "'" ([^<&'] | Reference )* "'"

Anyone famaliar with Backus Naur ? There are two questions above I am not sure about, otherwise I understand them.
Does ^ in 10 mean that the characters that follow are some sort of excluded set ?
In 1 RestrictedChar is excluded from document but why is it preceded and followed by Char which is included ?
Title: Re: Writing a parser
Post by: DarkWolf on June 30, 2005, 02:17:16 AM
Below should be attached what I have worked on so far.
( Not as muched as I would have liked )

Maybe that will help in case my questions have not been descriptive enough.
I organized the project to also show what I intend to do with it.

[attachment deleted by admin]
Title: Re: Writing a parser
Post by: Sevag.K on June 30, 2005, 06:07:20 AM
Haven't looked too in depth yet, but I do have one comment to make on your use of HIDE.

Apparantly, you used "add existing..." which copies the entire path of the file as-is, which means that
it's practically unusable on somebody elses' computer (where they may not have the same
folder setup as you).

I've remedied this problem for the next version of HIDE by adding an 'import' option which will
make a copy of an existing file.  I'll release it as soon as I finish the update (which should be in
the next couple of days).

Title: Re: Writing a parser
Post by: Randall Hyde on June 30, 2005, 01:49:09 PM
Quote from: DarkWolf on June 27, 2005, 09:14:38 PM
Thanks,

Figures I would miss that ' i ' in there on my first read through.

Other questions though:

// [1] document      ::= (prolog element Misc*) - (Char* RestrictedChar Char*)
// [10] AttValue   ::=  '"' ([^<&"] | Reference )* '"' | "'" ([^<&'] | Reference )* "'"

Anyone famaliar with Backus Naur ? There are two questions above I am not sure about, otherwise I understand them.
Does ^ in 10 mean that the characters that follow are some sort of excluded set ?
Yes, though this is UNIX regular expression syntax rather than BNF.

Quote
In 1 RestrictedChar is excluded from document but why is it preceded and followed by Char which is included ?
Probably because they're trying to say that *any word* containing the restricted character is to be ignored.
Cheers,
Randy Hyde
Title: Re: Writing a parser
Post by: DarkWolf on June 30, 2005, 04:20:40 PM
Thanks Kain, didn't know using "add existing" would have been a problem.
Actullay I was thinking of throwing a generic makefile in there for those not using an IDE or HIDE.
Question, can there be additional subfolders in the "src" directory like src/someotherdir ?

To Randy:

Ah figures, I am not famaliar with Unix expressions, no wonder it didn't really make any sense.
W3C has got some screwed up ways of doing things, they shouldn't have mixed two notation styles.
Title: Re: Writing a parser
Post by: Sevag.K on July 01, 2005, 01:15:06 AM
Quote from: DarkWolf on June 30, 2005, 04:20:40 PM
Thanks Kain, didn't know using "add existing" would have been a problem.
Actullay I was thinking of throwing a generic makefile in there for those not using an IDE or HIDE.
Question, can there be additional subfolders in the "src" directory like src/someotherdir ?

Unfortunately no, the project files are setup as a 'relative' system and the 'src' folder is hardcoded in
the source.  It's a design implementation I made early on that won't be easy to change.  It's something
to consider for HIDE 2.0

You did throw in a good idea for me though, I'll see if I can whip up a tool that will convert a HIDE
project to a makefile with Borland make compatibility.


Title: Re: Writing a parser
Post by: DarkWolf on July 09, 2005, 07:06:32 PM
Edited the first post, will try to keep the most recently worked on files up there.

The project is by no means anywhere near something useful, so keep that in mind.
Title: Re: Writing a parser
Post by: DarkWolf on July 28, 2005, 02:07:29 AM
I think I am making progress. ( maybe backwards : )  )
There are things I have been stumbling on and those are noted somewhere in the source or exta docs.

Still want to get some unicode support but I don't know how to code that. Anyone know ?

I have been thinking about the parser's API, should be notes on that too in there.
Right now I still want to finish the productions from the XML specs.