News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

How to convert Unicode character to small case?

Started by Igor, April 24, 2008, 03:58:54 PM

Previous topic - Next topic

Igor

With ASCII i just had to see if ascii is between 65 and 90 and add 32.

How to convert Unicode character to small case?

MichaelW

There are multiple methods. One simple one would be to use the ucLower procedure from the MASM32 library.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      ascii db "aBcDeFg",0
      wide  dw 100 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ; ----------------------------------
    ; Start by converting ascii to wide.
    ; ----------------------------------

    invoke MultiByteToWideChar,CP_ACP,
                               MB_PRECOMPOSED,
                               ADDR ascii,
                               -1,
                               ADDR wide,
                               LENGTHOF ascii

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    invoke ucLower, ADDR wide

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    invoke ucUpper, ADDR wide

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


aBcDeFg
abcdefg
ABCDEFG

You could probably use a method similar to the one you describe, but I suspect that getting reliable results would require some more or less complex code.
eschew obfuscation

donkey

Case conversion on Unicode strings requires highly specialized character mapping, in many character sets the map would be unique to that particular implementation, for example there are 139 lower case letters in Unicode that have no upper case equivalent (sharp S (ß) for example). To read more you might want to visit http://www.unicode.org/ which has links to various technical articles. Alternatively if you are expecting english only you might just convert to ANSI using the WideCharToMultiByte API, do your case conversion and then convert back though this would not guarantee much in the way of accuracy since not all characters are properly translated using the API, but for English it's not a problem.

EDIT: MichaelW and I posted at the same time so there is some redundancy in the info.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Igor

Thanks for your reply, i guess i should just stick to windows api when manipulating unicode strings.

It's gona be so much slower but i have to go with unicode because it's multi language app.

donkey

No Problem Igor,

If you want to tackle unicode character translation (the way it should be done and not a kludge) you can use the unicode.org map file for a reference, it is rather large and complex but it does map out all of the case conversion rules...

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

jj2007

Tried CharLowerBuff?

The CharLowerBuff function converts uppercase characters in a buffer to lowercase characters. The function converts the characters in place. The function supersedes the AnsiLowerBuff function.

DWORD CharLowerBuff(

    LPTSTR lpsz,   // pointer to buffer containing characters to process
    DWORD cchLength    // number of bytes or characters to process 
   );   


Parameters

lpsz

Pointer to a buffer containing one or more characters to process.

cchLength

Specifies the size, in bytes (ANSI version) or characters (Unicode version), of the buffer pointed to by lpsz.
The function examines each character, and converts uppercase characters to lowercase characters. The function examines the number of bytes or characters indicated by cchLength, even if one or more characters are null characters.

Return Values

If the function succeeds, the return value is the number of bytes (ANSI version) or characters (Unicode version) processed.

MichaelW

Perhaps I should have mentioned that ucLower calls CharLowerBuffW, and ucUpper calls CharUpperBuffW.
eschew obfuscation

donkey

Quote from: jj2007 on April 25, 2008, 01:53:55 AM
Tried CharLowerBuff?

CharLowerBuff does not correctly map all characters when using a non-english character set, if you would like to use the API you should use LCMapString specifying LCMAP_LOWERCASE/LCMAP_UPPERCASE in conjunction with LCMAP_LINGUISTIC_CASING which will do the mapping in compliance to the spec file I posted. I am assuming here that you are using either the Cyrillic or Serbian Latin alphabet because of your location in Serbia so you will need locale specific mapping.

Donkey
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

zooba

And I'll contribute a link to Michael Kaplan's blog. Michael works for Microsoft in internationalisation and blogs about (among other things) issues and functions involved in this area. I'm sure you'll find plenty of useful information in there.

(For the record, I don't use ASCII strings for anything anymore. Using UTF-16 ties me well enough to Windows (*nix preferred is UTF-8) that I can easily use their functions for most of my string work.)

Cheers,

Zooba :U

Igor

I used LCMapString for characters higher then 255 and standard ascii conversion for characters lower then 256, should be faster then calling LCMapString for every char.


cmp cx, 255 ; ascii char?
jle ascii
unicode:
invoke LCMapString, LOCALE_USER_DEFAULT, LCMAP_LOWERCASE, rdx, 1, rdx, 1
jmp _end
ascii:
;.if (cx >= 65) && (cx <= 90)
cmp cx, 65
jl _endif1
cmp cx, 90
jg _endif1
add cx, 32 ; small case
mov word ptr [rdx], cx
_endif1:
_end:


EDIT: I see now that without setting flag LCMAP_LINGUISTIC_CASING everything is same as CharLowerBuff. And if i use LCMAP_LINGUISTIC_CASING i can expect one char to became two chars or vice versa, i will have to rethink my approach on this :)

EDIT2: I got it wrong, it will never convert one char to two....

jj2007

No particular problems with non-English characters - but test yourself for non-Latin characters...

include \masm32\include\masm32rt.inc

.data?
buffer db 1024 dup (?)

.code

AppName db "T",0,"e",0,"s",0,"t", 0, 0, 0
Test$ db "This is ä fünny éxèrcise", 0

start:
invoke MultiByteToWideChar, CP_ACP, 0, addr Test$, sizeof Test$, addr buffer, 1024
; invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
invoke CharLowerBuffW, addr buffer, sizeof Test$
invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
invoke CharUpperBuffW, addr buffer, sizeof Test$
invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
invoke ExitProcess, 0
end start

donkey

Quote from: jj2007 on April 25, 2008, 01:36:04 PM
No particular problems with non-English characters - but test yourself for non-Latin characters...

I should have said non-western, just anglophone arrogance I suppose :)

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

zooba

Quote from: Igor on April 25, 2008, 11:34:44 AM
I used LCMapString for characters higher then 255 and standard ascii conversion for characters lower then 256, should be faster then calling LCMapString for every char.

I doubt it. It's more likely to be faster to call LCMapString once for the entire string. And there are a few cases where some Latin characters don't always map as they do in English (I believe Turkish is an example of this).

Quote from: Igor on April 25, 2008, 11:34:44 AM
EDIT2: I got it wrong, it will never convert one char to two....

Never is an incredibly bad word to use when talking about languages :bg . Governments seem to like changing the rules all the time and Microsoft works quite hard to keep up to date (or face being banned in that country... it's been threatened). It is possible that now or in the future a single-word uppercase character maps to a double-word lowercase character. Using LCMapString for the entire string effectively solves this problem, and also ensures that your program will work with future updates to the languages.

Cheers,

Zooba :U

Mark Jones

"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08