UTF-8 Support in HLA v2.0

Randall Hyde · May 12, 2005, 04:47:09 PM

Hi All,

Well, I've spent the past two days hacking UTF-8 support into HLA v2.0. I'm sure I'll be tracking down defects for the next several months (heck, years!), but a quick and casual test demonstrates that things work decently thus far.

I added a new string type, "utf8" to the HLA language to support this. Within the assembler, utf8 strings are quite similar to standard strings. When I get around to redoing the HLA Standard Library for HLA v2.0, I'll have to add a "utf" module to the library because we'll need all new string handling functions. Given the multi-byte character format, I'm afraid that UTF-8 string processing will be slower than standard ANSI strings, but such is the price one has to pay...

I'm currently thinking of the following format for UTF-8 strings in the run-time library:

utf8: record :=-12;
numChars: dword;
maxLen :dword;
length :dword;
charData :char[ maxLen+1];
endrecord;

The "numChars" field would contain the length of the string in characters, the "length" field will contain the length in bytes (which is not necessarily the same as "numChars"). Actually, I'll probably use the names "length" and "byteLength" rather than "numChars" and "length", but the naming scheme above maintains the names inherited from standard strings.

It would be *really* nice if I could figure out a fast way to locate each character's index in the string (without having to scan the entire string). But that optimization has escaped me to date. That's okay, lots of time to work on it.
Cheers,
Randy Hyde

Sevag.K · May 13, 2005, 12:17:36 AM

I seem to have built up a disdain for anything utf lately, but as it is something many people request, I suppose it makes sense to include it!

Quote
The "numChars" field would contain the length of the string in characters, the "length" field will contain the length in bytes (which is not necessarily the same as "numChars"). Actually, I'll probably use the names "length" and "byteLength" rather than "numChars" and "length", but the naming scheme above maintains the names inherited from standard strings.

"numBytes" would be my choice.

QuoteIt would be *really* nice if I could figure out a fast way to locate each character's index in the string (without having to scan the entire string). But that optimization has escaped me to date. That's okay, lots of time to work on it.
Cheers,

I've been doing some work on filesystems and one thing that works well there are bitmaps. Have you thought of this kind of approach?

Eg:
Each bit represents 1 byte of data.
Each 1 bit represents a character.

110111001

bit char index
# in bytes
0........ 0
1.........1
2.........
3.........3
4.........4
5.........5
6.........
7.........
8.........8

Randall Hyde · May 13, 2005, 11:21:52 PM

Quote from: Sevag.K on May 13, 2005, 12:17:36 AM
I seem to have built up a disdain for anything utf lately, but as it is something many people request, I suppose it makes sense to include it!

I agree. However, it's pretty clear that Unicode/UTF support is going to be absolutely necessary to remain competitive. Fortunately, writing an assembler that processes UTF-8 is a couple of orders of magnitude easier than writing an editor to process UTF-8. I don't envy you at all.

Quote
I've been doing some work on filesystems and one thing that works well there are bitmaps. Have you thought of this kind of approach?

Sure. It's what I use for HLA character sets, for example. There is still the issue of the extra space (even if it's 1/8th the length of the character data). Of course, processing the bit map might not be cheap (i.e., it may be cheaper just to processing the UTF-8 data).
Cheers,
Randy Hyde

News:

UTF-8 Support in HLA v2.0

Randall Hyde

Sevag.K

Randall Hyde