UTF-16LE Doesn't seem to be working

Tapejara · January 12, 2010, 12:49:54 AM

Hi All,

I just downloaded the latest GoAsm (0.56.8) and am trying it out. I converted the
Hello64World example to a UTF-16LE file and tried to assemble it. It complained as follows:

Error!
No code or data found in assembler source file
Obj file not made

It does the assembly okay with an ASCII file and the executable prints normally.
The file in UTF-8 format builds without complaint but every other character is a
space in the output. I am sure that this is because the assembler detected Unicode
input file, but since the command prompt cannot print Unicode, I am wondering why
you have to alter the console output routines just because you want to process
Unicode character internally.

Thanks.

donkey · January 12, 2010, 03:12:03 AM

Well, I assume that UTF-16LE isn't read properly because it has no BOM (Byte Order Marker) prepended since it is implicit in the scheme (LE = Little Endian). the GoAsm manual only makes mention of supporting UTF-8 and UTF-16, since UTF-16 has a prepended BOM it is probably looking for that to identify the encoding type and byte order. Since the first bytes in an 16LE based file would normally not be a BOM (either FE FF or FF FE) the file would look like it contains only a single character since it would see it as an ASCII file containing a single valid byte followed by a NULL. At any rate I would be wary of using an encoding scheme that is not explicitly supported, stick with UTF-16, UTF-8 or ANSI.

For the console issue, use WriteConsole to write text to the console not WriteFile, you can specify WriteConsoleW or if you are using my headers just define the UNICODE switch.

Code Select

#define UNICODE
STRINGS UNICODE

#DEFINE WINVER NTDDI_WINXP
#DEFINE LINKFILES
#DEFINE WIN32_LEAN_AND_MEAN
#include Windows.h

.data
	hConsole	HANDLE	?
	cbWritten	DWORD32	?
	Chars		CHAR	"Hello"

.code
START:

invoke GetStdHandle, STD_OUTPUT_HANDLE
mov [hConsole],eax

invoke WriteConsole,[hConsole],offset Chars,5,offset cbWritten,NULL

invoke Sleep,5000

invoke ExitProcess,0

Tapejara · January 12, 2010, 03:59:39 AM

Since I spent most of the day working on my editor to handle BOMs, that one didn't get by me. But I hadn't used UTF-16 or UTF-32 much yet and the carriage returns were missing before every linefeed. Once I got that fixed, it started assembling okay.

As for the console issue, I'll have to study it further.

Thanks for the help.

donkey · January 12, 2010, 04:05:41 AM

Quote from: Tapejara on January 12, 2010, 03:59:39 AM
Since I spent most of the day working on my editor to handle BOMs, that one didn't get by me. But I hadn't used UTF-16 or UTF-32 much yet and the carriage returns were missing before every linefeed. Once I got that fixed, it started assembling okay.

As for the console issue, I'll have to study it further.

Thanks for the help.

Hi Tapejara,

Sorry if I sounded pedantic but without a source file I had to take a guess at the problem, and since I didn't know your level of competence its better to explain completely rather than assume that you know what a BOM is.

Tapejara · January 12, 2010, 02:05:59 PM

Quote from: donkey on January 12, 2010, 04:05:41 AM
Hi Tapejara,

Sorry if I sounded pedantic but without a source file I had to take a guess at the problem, and since I didn't know your level of competence its better to explain completely rather than assume that you know what a BOM is.

No problem. I was in a hurry so I didn't have time to include much information.
Thanks again.

Tapejara · January 12, 2010, 10:18:19 PM

Quote from: donkey on January 12, 2010, 03:12:03 AM
For the console issue, use WriteConsole to write text to the console not WriteFile, you can specify WriteConsoleW or if you are using my headers just define the UNICODE switch.

I used WriteConsoleW and altered the Message to read the following to see how Unicode characters show up:

Message DB 'Hello «λ» World'

Running this has the output:

Message DB 'Hello «?» World'

Notice that the double angle quotes are there but the 'λ' is turned into '?'

This should not have worked because DB defines Bytes not Words. However, DW gives the same results. Can you tell me what is going on here?

Thanks

donkey · January 12, 2010, 11:21:39 PM

I am pretty sure that when STRINGS UNICODE is defined all quoted strings are converted to Unicode regardless of whether you specify DB or not, for this reason my headers define the CHAR data type that I find is a bit clearer (as it does with HANDLE which is dynamically resized based on the WIN64 switch). In order to force a string to ANSI you can prepend an A to the string, similarly prepending an L will force a string to Unicode...

AnsiStr DB A"hello",0
WideStr DB L"hello",0

When you assemble a file encoded as Unicode the STRINGS UNICODE switch is automatically defined, the ? character is the substitution character defined by Windows when an error in conversion is encountered or a character cannot be substituted, AFAIK it is not related to GoAsm. You can change this character in the MultibyteToWideChar function, with the WriteConsole function that option is not provided. I should note that the STRINGS UNICODE conversion behavior was extensively discussed before it was implemented.

Edgar

Yuri · January 13, 2010, 04:52:03 AM

I can see the «λ» if I change the console font to 'Lucida Console'.

Tapejara · January 17, 2010, 03:49:32 AM

Quote from: Yuri on January 13, 2010, 04:52:03 AM
I can see the «λ» if I change the console font to 'Lucida Console'.

:dance:
So how do you select a font for the console?

Tapejara · January 17, 2010, 04:06:43 AM

Quote from: donkey on January 12, 2010, 11:21:39 PM
I am pretty sure that when STRINGS UNICODE is defined all quoted strings are converted to Unicode regardless of whether you specify DB or not...

::)This is very troubling. I am used to defining bytes and then knowing that the assembler will produce bytes, not something else. I can conceive of writing a program that reports to the console in Unicode, but its job is to process ASCII text, not Unicode. If some Unicode switch makes a global change in the defintion of DB, then I am in trouble. What is produced if I write:

DB 'Unicode Text',04Ah,' more Unicode Text'

Is there a byte in there sandwiched in between two strings of UTF-16? If so, this is a problem that needs to be fixed. If it were me, I would make DB ALWAYS mean define byte. If you want UTF-16, then you say DW instead. If you want UTF-32, then you say DD instead regardless of whether the source text is Unicode or not. Granted, you can wirte text in the DB strings that are not ASCII. The assembler should either truncate, print a warning, or consider it to be an error if any of the characters within quotes are not ASCII (i.e. the upper nine bit are not zero).

As for the Unicode that is supported, is it only 16-bit BMP or entire 21-bits encoded in UTF-8, UTF-16 and UTF-32?

Yuri · January 17, 2010, 07:20:20 AM

Quote from: Tapejara
So how do you select a font for the console?

Well, actually I did it manually, through the properties, and wrote that just to point out where the problem seems to be.

About selecting it programmatically, here is what I've found so far: How to change the console font programmaticaly (to support unicode)

donkey · January 17, 2010, 07:27:30 AM

You can override the automatic STRINGS UNICODE by specifying STRINGS ANSI in your source, that will force ANSI encoding regardless of the format of the file. The DB/DW issue is by design, quoted strings are automatically encoded as either Unicode or ANSI depending on the STRINGS directive however immediate (non-quoted) data is not. If you wish to mix immediate and quoted characters you can use DW for Unicode and DB for ANSI. However if you use DB, immediate data will not be converted to WORDs in a Unicode application for obvious reasons.

There are trade-offs in every feature, GoAsm provides a rich set of automated conversions to provide flexibility in assembling your source, with every one of these features there is an override directive that will disable it if your code cannot use it. This transparent conversion is one of the things I like in GoAsm, and it is constantly being expanded to provide more seamless cross assembly of 32/64 bit and Unicode/ANSI applications. My hope is that I can eventually write a program and assemble it under WIN32 ANSI and all of the other variants without any source level editing necessary.

News:

UTF-16LE Doesn't seem to be working

Tapejara

donkey

Tapejara

donkey

Tapejara

Tapejara

donkey

Yuri

Tapejara

Tapejara

Yuri

donkey