Writing a parser

DarkWolf · May 16, 2005, 12:15:28 AM

Just so you know chep, I want to make an XML 1.1 parser which has to do both 1.0 and 1.1
Not sure where you got the idea of UTF-16, by default XML is assumed to be UTF-8, but it is a mute point HTTP servers transmit it as us-ascii anyway ( don't ask me I don't know why ).

The encoding attribute is optional by default all XML docs are treated as UTF-8 ( which I just saw that Randy has started work on ) .

I am not going to bore kain with a lot of technical terms he may not be aware of, but the full XML prolog looks like this

<?xml version="1.1" encoding="put character-set here" standalone="no" ?>

If you're famaliar with HTML then calling version, encoding and standalone as attributes should make sense.
version is mandatory and always a number
encoding (string value) and standalone (boolean value) are optional if not included UTF-8 and no are assumed.
Values in attributes are surrounded by either double or single quotes, this is handy if you need to use one or the other in the value.
Yes the attribute takes the form of string="value" Whitespace can be in the value but not newline characters, whitespace/newline can be between the = and first double quote. Whitespace/newline can be between attributes. PI's and tags/elements can take up more than one line.

<? ?> are PI's ( won't bore you with complete definitions ), it is an instruction passed on and not displayed after parsing.
They all take the form of <?string attributes?> where the string can only be xml in the prolog it's a reserved name and cannot be used any where else.

The XML recommendations are somewhat hard to read, a good cheap book might be a good place to start. I missed the bookstore near where I live, had PC books for around a dollar : ) But it was one of those local run mom and pop places and it closed down eventually : (

I am trying to find out if standalone defaults to 'no' or not but explorer (local not internet explorer) is giving me fits and won't open the file, oh well I think it defaults to 'no'

I have made XML docs before just never a parser to read them.

Sevag.K · May 16, 2005, 01:32:49 AM

Well, with the information presented:
If the file you are parsing is ASCII, then it's pretty easy as most of the routines you need are either already available or easy to implement. I can help with this one.
If it's UTF-xx, then you pretty much have to start from ground zero and create a whole new set of routines that deal with UTF-xx character arrays. I'm not too familiar with the formats so I can't be much help for now.

chep · May 16, 2005, 02:27:17 AM

DarkWolf,

Neither did I ever wrote an XML parser, only used Xerces-C, MSXML, and a Java thing I can't remember the name... so I never needed anything else than the W3C's BNF grammars :wink
Anyway you managed to translate the grammar into plain english, well done!! :U (not sure I could have done that as english is not my native language :toothy).

Concerning the UTF-16 remark, I only wanted to say that most parsers use UTF-16 as their internal string representation (from the API point of view). I guess it's because UTF-16 surrogates are far easier to handle than multibyte UTF-8 characters. Of course you could as well choose UTF-8 or even UCS-4 :eek if you whish, it's just a trade between easy implementation and memory footprint. IMHO UTF-16 is clearly the best choice.

Concerning the encoding I understand the specs as follow :
- if the encoding declaration is missing : either you have a Byte Order Mark indicating the UTF encoding, or it is assumed to be UTF-8.
- if a transport protocol explicitely specifies an encoding it will override the encoding declaration in the document :dazzled: (if I were you I'd assign this the lowest priority :red).

Concerning the standalone part :

QuoteIf there are no external markup declarations, the standalone document declaration has no meaning. If there are external markup declarations but there is no standalone document declaration, the value "no" is assumed.

Clear enough :toothy

Cheers

gabor · May 17, 2005, 12:03:20 PM

Hi folks!

Maybe I've already mentioned the automatons that can be used to "recognize" XML docs. So they can be the base of a parser, as they are in some parsers...In most cases finite state machines (FSM) can be used, but in this case, since XML is a recursive structure/language the appropriate automaton would be a stack state machine.
I have created a web page on this topic http://members.chello.hu/nemeth.gabor1/automaton/index.html. It is still under work, but there is a small code showing the basics of the theory.
I post the code here as well, for more comfort. :bg

Greets, gábor

[attachment deleted by admin]

Randall Hyde · May 18, 2005, 12:13:04 AM

Quote from: chep on May 16, 2005, 02:27:17 AM

Concerning the UTF-16 remark, I only wanted to say that most parsers use UTF-16 as their internal string representation (from the API point of view). I guess it's because UTF-16 surrogates are far easier to handle than multibyte UTF-8 characters. Of course you could as well choose UTF-8 or even UCS-4 :eek if you whish, it's just a trade between easy implementation and memory footprint. IMHO UTF-16 is clearly the best choice.

UTF-16 is not a whole lot easier to handle than UTF-8, unfortunately. The original 16-bit Unicode standard *was* easier to handle, but once everyone figured out that 16 bits was insufficient, UTF-16 was created (which is a "multi-word" character set, with all the problems of UTF-8 with respect to easy processing).

UCS-4 (UTF-32) is the only way to go if you want each character cell to consume the same amount of memory. But four bytes per character is a heavy price to pay for that convenience.
Cheers,
Randy Hyde

chep · May 18, 2005, 03:09:42 AM

Quote from: Randall Hyde on May 18, 2005, 12:13:04 AM... UTF-16 was created (which is a "multi-word" character set, with all the problems of UTF-8 with respect to easy processing). ...

The advantage of UTF-16 over UTF-8 is that an UTF-16 character can be at most 2 words long (when a word is between 0xD800 and 0xDBFF then this is the leading high surrogate of a pair, the following low surrogate being between 0xDC00 and 0xDFFF) whereas an UTF-8 character can be several bytes long.

Of course you can always tell the character length by looking at the first byte (0xxxxxxx = 1 byte, 110xxxxx = lead byte of a 2 byte char, and so on), but obviously the code needed to handle UTF-16 is simpler than for UTF-8.

Moreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.

Randall I'm pretty sure you already knew that but I just wanted to explain why IMO UTF-16 is still easier to handle than UTF-8. :wink

I have to admit that I'm very reluctant to use anything else but UTF-16 in memory and UTF-8 externally (files/network) since we had to internationalize an application. I can assure you that's not a task I want to have to perform again!! :boohoo:

BTW don't expect me to post anything soon as I'm leaving in a few hours (holidays at last!). :8) :dance:

Randall Hyde · May 18, 2005, 09:04:04 PM

Quote from: chep on May 18, 2005, 03:09:42 AM
[The advantage of UTF-16 over UTF-8 is that an UTF-16 character can be at most 2 words long (when a word is between 0xD800 and 0xDBFF then this is the leading high surrogate of a pair, the following low surrogate being between 0xDC00 and 0xDFFF) whereas an UTF-8 character can be several bytes long.

Granted, the code to parse multiple characters versus two words is slightly more complex, but in the end, the fact that you cannot assume that each character consumes an equal amount of space is the real killer. That is, you *have* to scan the string, character by character, to determine it's length. You cannot assume that the number of bytes reserved (or words) equals the number of characters. Once you cannot make that assumption, the software gets *far* more complex.

Quote
Moreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.

This is *not* what I'm hearing from people. Windows supports 16-bit Unicode *only*, not UTF-16 (i.e., no multi-word sequences). Correct me if I'm wrong, but this is what people are saying to me.

Cheers,
Randy Hyde

DarkWolf · May 19, 2005, 02:19:03 AM

gabor I don't recall the automotons before maybe that was another forum on the site.
But I downloaded that archive and will take a look at the website.

However a bit of question, Which character set do I support ?

What I know is:
1. HTTP servers transmit text as us-ascii
2. XML by default is to be parsed as UTF-8
3. Authors of XML docs can specify encodings

Given that it would seem to use us-ascii, except that if the parser is used locally, not over the net, or by means other than http; it would seem that UTF-8 or the authors encoding would have to be supported. I could let the OS handle the character encodings but then the parser would need to be written in several different forms, one for each OS.

So what sort of approach should I take here ?

Sevag.K · May 19, 2005, 03:51:57 AM

I guess that depends on which method the parser will be used. If you can't predict the origin, then I suppose you'll have to support all possible formats (yikes!).

gabor · May 19, 2005, 09:49:37 AM

Hi!

1. HTTP servers transmit text as us-ascii. Are you sure about this? For example in HTML there is an option in META tag to define the used encoding. In my language there are some chracters that are not in the us-set...
2. There must be a default if no encoding was given. This can be UTF-8, I cannot argue with that, must look on www.w3c.org (There is everything about XML since they created and developed it. The XML whitepaper is about 40 pages)
3. In XML the used encoding can be specified. So there is no way to leave out use of variable character formats and sets.

My conclusion is, that a very precise and correct parser would handle all main cases.

In my studies I met this concept:
1. Lowest level of semantic network: character encoding is unicode. - level of reading text files, character streams
2. Based on the universal encoding parsers provide synthax checking - level of tags, words/tokens
3. Based on correct synthax, semantic/conceptual modells the input documents can be interpreted, the information processed - level of agents

My first, quick idea would be to create a converter from every used encoding into unicode. This could be a task for preprocessing, before parsing, or could be done along parsing. Well I hope I helped a bit.

Greets, gábor

DarkWolf · May 29, 2005, 06:44:52 PM

I have been going through the Xerces and libxml source code and admittingly I don't understand the UTF-8/16 code.
Which is the default, must have, encoding for XML parsers.

Oh well, I'll start with asciii and maybe Unicode can be added later.
I have been looking through the HLA stdlib reference, I can think of a few ways to extract the information from a string but I am trying to figure out how to retain the structure from the parsed file.

I have been trying think of a way to do it one step at a time, starting with the prolog and working my way through the file.
But eventually you come to a fork in the road, so how to add to what has already been parsed is what I am trying to think of.

Sevag.K · May 29, 2005, 11:14:42 PM

I don't understand what you want to do, can you give an example?

chep · May 30, 2005, 02:41:23 PM

Hi,

Quote from: Randall Hyde on May 18, 2005, 09:04:04 PMYou cannot assume that the number of bytes reserved (or words) equals the number of characters. Once you cannot make that assumption, the software gets *far* more complex.

You're right that's the problem with any multi byte/word character set.

However the main (only?) use I can see for character-counted strings is for input validation (ie. to check that the string is between x and y characters long). So in a XML parser I guess this will only take place at the schema-checking level ; AFAIK every other routine should be happy with the buffer size.

Anyway, I see 2 cases :

1. You use raw null terminated strings without any additional values. In this case you have to scan a string everytime you need to manipulate it.

2. My preferred method: you store additional values along with the raw string, like it's size (in bytes or words) and the buffer maximum size (hmm... sounds like a UNICODE_STRING structure). Then nothing prevents you from adding another field containing the string length (in characters), and populate it along with the other fields during the initial string scan. Eg:

Code Select

UTF16_STRING STRUCT
    dwLengthInChars DWORD ?
    usString        UNICODE_STRING<?>
UTF16_STRING ENDS

Quote from: Randall Hyde on May 18, 2005, 09:04:04 PM
Quote from: chep on May 18, 2005, 03:09:42 AMMoreover Windows has native support for UTF-16 so there's no need to convert your in-memory UTF-16 strings before displaying them.
This is *not* what I'm hearing from people. Windows supports 16-bit Unicode *only*, not UTF-16 (i.e., no multi-word sequences). Correct me if I'm wrong, but this is what people are saying to me.

Sorry I gave incomplete/erroneous information... I'm used to Win2k+ so I forgot to think before I posted. :red

Win9x and NT4 don't support surrogates.
In Windows 2000 and later, UTF-16 surrogates are disabled *by default* unless you install an IME that needs surrogates, or you enable them yourself in the registry.

See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp for more info.

gabor · May 31, 2005, 05:16:03 AM

Hi

I think the topic has turned a bit away from the original idea. Believe me, processing the text input is not the most difficult part of a parser. As I wrote it before there is a general concept, that the international text represention should be Unicode/UTF16. Since the header of an XML must contain the encoding format, I think it is quite easy what to do. Windows API supports UTF16 (MultiByteToWideChar proc) and as I've already wrote it in another post converters should be used to this format, so the parser can work independently of the format.

I suspect and this is more than a plain suspect parsing is the important and tough part. The XML makes it possible to define recursive languages, recursive languages are a bit harder to interpret and require additional logic, or complex stack state machines.

Greets,Gábor

DarkWolf · June 04, 2005, 03:11:34 AM

Can I get some pointers ( no pun intended ) to some resources on how to program unicode support ( UTF-8/16 ).
I would like to write a parser that is complient with the XML 1.1 specs if I can.
Otherwise I'll just have to use only us-ascii .

I have been to the Unicode website but didn not find any material on writing programs to support unicode.
Of course I did expect not to find any.

News:

Writing a parser

chep

chep

chep