News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

UTF-16LE Doesn't seem to be working

Started by Tapejara, January 12, 2010, 12:49:54 AM

Previous topic - Next topic

Tapejara

Hi All,

I just downloaded the latest GoAsm (0.56.8) and am trying it out.  I converted the
Hello64World example to a UTF-16LE file and tried to assemble it.  It complained as follows: 

Error!
No code or data found in assembler source file
Obj file not made

It does the assembly okay with an ASCII file and the executable prints normally. 
The file in UTF-8 format builds without complaint but every other character is a
space in the output.  I am sure that this is because the assembler detected Unicode
input file, but since the command prompt cannot print Unicode, I am wondering why
you have to alter the console output routines just because you want to process
Unicode character internally. 

Thanks.

donkey

Well, I assume that UTF-16LE isn't read properly because it has no BOM (Byte Order Marker) prepended since it is implicit in the scheme (LE = Little Endian). the GoAsm manual only makes mention of supporting UTF-8 and UTF-16, since UTF-16 has a prepended BOM it is probably looking for that to identify the encoding type and byte order. Since the first bytes in an 16LE based file would normally not be a BOM (either FE FF or FF FE) the file would look like it contains only a single character since it would see it as an ASCII file containing a single valid byte followed by a NULL. At any rate I would be wary of using an encoding scheme that is not explicitly supported, stick with UTF-16, UTF-8 or ANSI.

For the console issue, use WriteConsole to write text to the console not WriteFile, you can specify WriteConsoleW or if you are using my headers just define the UNICODE switch.

#define UNICODE
STRINGS UNICODE

#DEFINE WINVER NTDDI_WINXP
#DEFINE LINKFILES
#DEFINE WIN32_LEAN_AND_MEAN
#include Windows.h

.data
hConsole HANDLE ?
cbWritten DWORD32 ?
Chars CHAR "Hello"

.code
START:

invoke GetStdHandle, STD_OUTPUT_HANDLE
mov [hConsole],eax

invoke WriteConsole,[hConsole],offset Chars,5,offset cbWritten,NULL

invoke Sleep,5000

invoke ExitProcess,0
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Tapejara

Since I spent most of the day working on my editor to handle BOMs, that one didn't get by me.  But I hadn't used UTF-16 or UTF-32 much yet and the carriage returns were missing before every linefeed.  Once I got that fixed, it started assembling okay. 

As for the console issue, I'll have to study it further. 

Thanks for the help.

donkey

Quote from: Tapejara on January 12, 2010, 03:59:39 AM
Since I spent most of the day working on my editor to handle BOMs, that one didn't get by me.  But I hadn't used UTF-16 or UTF-32 much yet and the carriage returns were missing before every linefeed.  Once I got that fixed, it started assembling okay. 

As for the console issue, I'll have to study it further. 

Thanks for the help.

Hi Tapejara,

Sorry if I sounded pedantic but without a source file I had to take a guess at the problem, and since I didn't know your level of competence its better to explain completely rather than assume that you know what a BOM is.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Tapejara

Quote from: donkey on January 12, 2010, 04:05:41 AM
Hi Tapejara,

Sorry if I sounded pedantic but without a source file I had to take a guess at the problem, and since I didn't know your level of competence its better to explain completely rather than assume that you know what a BOM is.

No problem.  I was in a hurry so I didn't have time to include much information. 
Thanks again.

Tapejara

Quote from: donkey on January 12, 2010, 03:12:03 AM
For the console issue, use WriteConsole to write text to the console not WriteFile, you can specify WriteConsoleW or if you are using my headers just define the UNICODE switch.

I used WriteConsoleW and altered the Message to read the following to see how Unicode characters show up:

Message DB 'Hello «λ» World'

Running this has the output:

Message DB 'Hello «?» World'

Notice that the double angle quotes are there but the 'λ' is turned into '?'

This should not have worked because DB defines Bytes not Words.  However, DW gives the same results.  Can you tell me what is going on here? 

Thanks

donkey

I am pretty sure that when STRINGS UNICODE is defined all quoted strings are converted to Unicode regardless of whether you specify DB or not, for this reason my headers define the CHAR data type that I find is a bit clearer (as it does with HANDLE which is dynamically resized based on the WIN64 switch). In order to force a string to ANSI you can prepend an A to the string, similarly prepending an L will force a string to Unicode...

AnsiStr DB A"hello",0
WideStr DB L"hello",0

When you assemble a file encoded as Unicode the STRINGS UNICODE switch is automatically defined, the ? character is the substitution character defined by Windows when an error in conversion is encountered or a character cannot be substituted, AFAIK it is not related to GoAsm. You can change this character in the MultibyteToWideChar function, with the WriteConsole function that option is not provided. I should note that the STRINGS UNICODE conversion behavior was extensively discussed before it was implemented.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Yuri

I can see the «λ» if I change the console font to 'Lucida Console'.

Tapejara

Quote from: Yuri on January 13, 2010, 04:52:03 AM
I can see the «λ» if I change the console font to 'Lucida Console'.
:dance:
So how do you select a font for the console? 

Tapejara

Quote from: donkey on January 12, 2010, 11:21:39 PM
I am pretty sure that when STRINGS UNICODE is defined all quoted strings are converted to Unicode regardless of whether you specify DB or not...

::)This is very troubling.  I am used to defining bytes and then knowing that the assembler will produce bytes, not something else.  I can conceive of writing a program that reports to the console in Unicode, but its job is to process ASCII text, not Unicode.  If some Unicode switch makes a global change in the defintion of DB, then I am in trouble.  What is produced if I write:

DB 'Unicode Text',04Ah,' more Unicode Text'

Is there a byte in there sandwiched in between two strings of UTF-16?  If so, this is a problem that needs to be fixed.  If it were me, I would make DB ALWAYS mean define byte.  If you want UTF-16, then you say DW instead.  If you want UTF-32, then you say DD instead regardless of whether the source text is Unicode or not.  Granted, you can wirte text in the DB strings that are not ASCII.  The assembler should either truncate, print a warning, or consider it to be an error if any of the characters within quotes are not ASCII (i.e. the upper nine bit are not zero). 

As for the Unicode that is supported, is it only 16-bit BMP or entire 21-bits encoded in  UTF-8, UTF-16 and UTF-32? 

Yuri

Quote from: Tapejara
So how do you select a font for the console?

Well, actually I did it manually, through the properties, and wrote that just to point out where the problem seems to be.

About selecting it programmatically, here is what I've found so far: How to change the console font programmaticaly (to support unicode)

donkey

You can override the automatic STRINGS UNICODE by specifying STRINGS ANSI in your source, that will force ANSI encoding regardless of the format of the file. The DB/DW issue is by design, quoted strings are automatically encoded as either Unicode or ANSI depending on the STRINGS directive however immediate (non-quoted) data is not. If you wish to mix immediate and quoted characters you can use DW for Unicode and DB for ANSI. However if you use DB, immediate data will not be converted to WORDs in a Unicode application for obvious reasons.

There are trade-offs in every feature, GoAsm provides a rich set of automated conversions to provide flexibility in assembling your source, with every one of these features there is an override directive that will disable it if your code cannot use it. This transparent conversion is one of the things I like in GoAsm, and it is constantly being expanded to provide more seamless cross assembly of 32/64 bit and Unicode/ANSI applications. My hope is that I can eventually write a program and assemble it under WIN32 ANSI and all of the other variants without any source level editing necessary.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable