How to convert Unicode character to small case?

Igor · April 24, 2008, 03:58:54 PM

With ASCII i just had to see if ascii is between 65 and 90 and add 32.

How to convert Unicode character to small case?

MichaelW · April 24, 2008, 04:48:25 PM

There are multiple methods. One simple one would be to use the ucLower procedure from the MASM32 library.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      ascii db "aBcDeFg",0
      wide  dw 100 dup(0)
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ; ----------------------------------
    ; Start by converting ascii to wide.
    ; ----------------------------------

    invoke MultiByteToWideChar,CP_ACP,
                               MB_PRECOMPOSED,
                               ADDR ascii,
                               -1,
                               ADDR wide,
                               LENGTHOF ascii

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    invoke ucLower, ADDR wide

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    invoke ucUpper, ADDR wide

    invoke crt_printf, chr$("%S%c"), ADDR wide, 10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


aBcDeFg
abcdefg
ABCDEFG

You could probably use a method similar to the one you describe, but I suspect that getting reliable results would require some more or less complex code.

donkey · April 24, 2008, 04:49:05 PM

Case conversion on Unicode strings requires highly specialized character mapping, in many character sets the map would be unique to that particular implementation, for example there are 139 lower case letters in Unicode that have no upper case equivalent (sharp S (ß) for example). To read more you might want to visit http://www.unicode.org/ which has links to various technical articles. Alternatively if you are expecting english only you might just convert to ANSI using the WideCharToMultiByte API, do your case conversion and then convert back though this would not guarantee much in the way of accuracy since not all characters are properly translated using the API, but for English it's not a problem.

EDIT: MichaelW and I posted at the same time so there is some redundancy in the info.

Igor · April 24, 2008, 05:55:49 PM

Thanks for your reply, i guess i should just stick to windows api when manipulating unicode strings.

It's gona be so much slower but i have to go with unicode because it's multi language app.

donkey · April 24, 2008, 08:04:45 PM

No Problem Igor,

If you want to tackle unicode character translation (the way it should be done and not a kludge) you can use the unicode.org map file for a reference, it is rather large and complex but it does map out all of the case conversion rules...

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Donkey

jj2007 · April 25, 2008, 01:53:55 AM

Tried CharLowerBuff?

The CharLowerBuff function converts uppercase characters in a buffer to lowercase characters. The function converts the characters in place. The function supersedes the AnsiLowerBuff function.

DWORD CharLowerBuff(

LPTSTR lpsz,   // pointer to buffer containing characters to process
DWORD cchLength    // number of bytes or characters to process
);

Parameters

lpsz

Pointer to a buffer containing one or more characters to process.

cchLength

Specifies the size, in bytes (ANSI version) or characters (Unicode version), of the buffer pointed to by lpsz.
The function examines each character, and converts uppercase characters to lowercase characters. The function examines the number of bytes or characters indicated by cchLength, even if one or more characters are null characters.

Return Values

If the function succeeds, the return value is the number of bytes (ANSI version) or characters (Unicode version) processed.

MichaelW · April 25, 2008, 02:48:33 AM

Perhaps I should have mentioned that ucLower calls CharLowerBuffW, and ucUpper calls CharUpperBuffW.

donkey · April 25, 2008, 05:04:10 AM

Quote from: jj2007 on April 25, 2008, 01:53:55 AM
Tried CharLowerBuff?

CharLowerBuff does not correctly map all characters when using a non-english character set, if you would like to use the API you should use LCMapString specifying LCMAP_LOWERCASE/LCMAP_UPPERCASE in conjunction with LCMAP_LINGUISTIC_CASING which will do the mapping in compliance to the spec file I posted. I am assuming here that you are using either the Cyrillic or Serbian Latin alphabet because of your location in Serbia so you will need locale specific mapping.

Donkey

zooba · April 25, 2008, 08:51:36 AM

And I'll contribute a link to Michael Kaplan's blog. Michael works for Microsoft in internationalisation and blogs about (among other things) issues and functions involved in this area. I'm sure you'll find plenty of useful information in there.

(For the record, I don't use ASCII strings for anything anymore. Using UTF-16 ties me well enough to Windows (*nix preferred is UTF-8) that I can easily use their functions for most of my string work.)

Cheers,

Zooba :U

Igor · April 25, 2008, 11:34:44 AM

I used LCMapString for characters higher then 255 and standard ascii conversion for characters lower then 256, should be faster then calling LCMapString for every char.

Code Select


	cmp cx, 255						; ascii char?
	jle ascii
	unicode:
		invoke LCMapString, LOCALE_USER_DEFAULT, LCMAP_LOWERCASE, rdx, 1, rdx, 1
		jmp _end
	ascii:
		;.if (cx >= 65) && (cx <= 90)
		cmp cx, 65
		jl _endif1
		cmp cx, 90
		jg _endif1
			add cx, 32				; small case
			mov word ptr [rdx], cx
		_endif1:
	_end:

EDIT: I see now that without setting flag LCMAP_LINGUISTIC_CASING everything is same as CharLowerBuff. And if i use LCMAP_LINGUISTIC_CASING i can expect one char to became two chars or vice versa, i will have to rethink my approach on this :)

EDIT2: I got it wrong, it will never convert one char to two....

jj2007 · April 25, 2008, 01:36:04 PM

No particular problems with non-English characters - but test yourself for non-Latin characters...

Code Select

include \masm32\include\masm32rt.inc

.data?
buffer	db 1024 dup (?)

.code

AppName	db "T",0,"e",0,"s",0,"t", 0, 0, 0
Test$	db "This is ä fünny éxèrcise", 0

start:
	invoke MultiByteToWideChar, CP_ACP, 0, addr Test$, sizeof Test$, addr buffer, 1024
	; invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
	invoke CharLowerBuffW, addr buffer, sizeof Test$
	invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
	invoke CharUpperBuffW, addr buffer, sizeof Test$
	invoke MessageBoxW, NULL, addr buffer, addr AppName, MB_OK
	invoke ExitProcess, 0
end start

donkey · April 25, 2008, 03:47:31 PM

Quote from: jj2007 on April 25, 2008, 01:36:04 PM
No particular problems with non-English characters - but test yourself for non-Latin characters...

I should have said non-western, just anglophone arrogance I suppose :)

Edgar

zooba · April 25, 2008, 11:42:59 PM

Quote from: Igor on April 25, 2008, 11:34:44 AM
I used LCMapString for characters higher then 255 and standard ascii conversion for characters lower then 256, should be faster then calling LCMapString for every char.

I doubt it. It's more likely to be faster to call LCMapString once for the entire string. And there are a few cases where some Latin characters don't always map as they do in English (I believe Turkish is an example of this).

Quote from: Igor on April 25, 2008, 11:34:44 AM
EDIT2: I got it wrong, it will never convert one char to two....

Never is an incredibly bad word to use when talking about languages :bg . Governments seem to like changing the rules all the time and Microsoft works quite hard to keep up to date (or face being banned in that country... it's been threatened). It is possible that now or in the future a single-word uppercase character maps to a double-word lowercase character. Using LCMapString for the entire string effectively solves this problem, and also ensures that your program will work with future updates to the languages.

Cheers,

Zooba :U

Mark Jones · July 13, 2008, 04:05:22 PM

Quote from: donkey on April 24, 2008, 08:04:45 PM
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

OMG... unicode is a MESS! :lol

News:

How to convert Unicode character to small case?