News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Arabic Dictionary

Started by Farabi, December 30, 2009, 05:18:44 AM

Previous topic - Next topic

Farabi

I was thinkin is someone out there can make a string searching faster than mine.


Have a look for this function

TextBoxProc proc hWnd:dword,uMsg:dword,wParam:dword,lParam:dword
LOCAL lptext:dword
LOCAL buff[256]:dword
LOCAL hFile:dword
LOCAL nIndex:dword

.if uMsg==WM_KEYUP
invoke mAlloc,1024
mov lptext,eax
xor ecx,ecx
mov ecx,0fffeh
mov [eax],ecx
invoke SendMessageW,hEdit,EM_GETLINE,0,lptext
invoke memfill,addr buff,1024,0
invoke WideCharToMultiByte,CP_ACP,0,lptext,38,addr buff,MAX_PATH,0,0

invoke SendMessageW,hEdit,EM_LINELENGTH,0,0
.if eax!=0

invoke SearchString,fFile,addr buff
mov nIndex,eax

.endif

dec nIndex
.if nIndex!=-1
push nIndex
pop f_ind
.endif

invoke SendMessage,lb,LB_SETCURSEL,f_ind,0

invoke GlobalFree,lptext
.endif
invoke CallWindowProc,tbp,hWnd,uMsg,wParam,lParam
ret
TextBoxProc endp
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

hutch--

Farabi,

Tell us what character set you are using or more correctly what code page you are using as you should be able to use nearly any search algo to find words that you select. It looks like you are converting from Unicode to a multibyte format and if we know what you are doing, someone may be able to help.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Farabi,
I've tried to get your app assembled but it does not contain hardcoded include paths, so I gave up.
Without testing, the only section that is probably slow is

Quoteinvoke GetWord,lpFile,ecx,1
         mov eax,l_str
      ;   .if [edx].frbWordStruct.nSize>=eax
            invoke szCmpi,[edx].frbWordStruct.lpAddress,lpString,l_str
No idea what GetWord is, it's not found in your *.asm code, but szCmpi is probably the bottleneck. Try replacing it with crt_strstri. Have a look at the strstri docu.

Farabi

Quote from: hutch-- on December 30, 2009, 08:49:29 AM
Farabi,

Tell us what character set you are using or more correctly what code page you are using as you should be able to use nearly any search algo to find words that you select. It looks like you are converting from Unicode to a multibyte format and if we know what you are doing, someone may be able to help.

Im using the conversion only because the Text box is unicode. So the character set should be the standard
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

Farabi

Quote from: jj2007 on December 30, 2009, 09:28:04 AM
Farabi,
I've tried to get your app assembled but it does not contain hardcoded include paths, so I gave up.
Without testing, the only section that is probably slow is

Quoteinvoke GetWord,lpFile,ecx,1
         mov eax,l_str
      ;   .if [edx].frbWordStruct.nSize>=eax
            invoke szCmpi,[edx].frbWordStruct.lpAddress,lpString,l_str
No idea what GetWord is, it's not found in your *.asm code, but szCmpi is probably the bottleneck. Try replacing it with crt_strstri. Have a look at the strstri docu.

I think I miss uploaded it.
I used my template and some of the component is not on your.

This should work on your system. GetWord is on hNullString
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

donkey

Hi Farabi,

I haven't checked the code yet, I will though, but from a look at the posts I have to ask why you would convert back and forth to Unicode ? Since the NT API is natively Unicode for the most part, definitely the string functions, you're wasting an enormous amount of time in conversions. I have not written an application in ANSI for quite some time, the overhead for conversion is too costly for any string APIs as well as anything that uses COM. Since your target character set is Arabic, you need to convert back and forth to Unicode to search strings, R2L reading is used and the sorting would be affected since it is a non-roman alphabet. Trying to handle all of these in ANSI would be a chore and would definitely slow things down, again I haven't examined the code yet but switching your application to native Unicode can't do anything but help. Remember that most of the string search algos you'll find here are designed for L2R reading and a roman alphabet and would have to be adapted or discarded as unworkable in your target character set.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

daydreamer

I also worked with trying to make a arabic dictionary/translator, it might work, after all arabic unicodes hibyte is same 06 all the time ala all arabic letters are in the shape 06xxh, but maybe you need to reverse the alphabetically search on arabic, after all my small book dictionary is backwards and so is all arabic text, reversed compared to latin letters
all it requires is change the code that handles unicode arabic buffers from this
MOV al,[ebx+buffer] ;ebx holds char number
to
MOV al,[ebx*2+buffer]
you see I mean just change to scale by 2 on all places you access the buffer in a sort algo or whatever algo you need to use there

jj2007

Quote from: daydreamer on December 30, 2009, 07:35:40 PM
all arabic letters are in the shape 06xxh
If speed is an issue, that opens another road: Translate your source file into a single byte file without the 06 bytes, and use standard strstr or whatever. For displaying, you need to add the 06h again, of course.

Farabi

Quote from: donkey on December 30, 2009, 04:00:26 PM
Hi Farabi,

I haven't checked the code yet, I will though, but from a look at the posts I have to ask why you would convert back and forth to Unicode ? Since the NT API is natively Unicode for the most part, definitely the string functions, you're wasting an enormous amount of time in conversions. I have not written an application in ANSI for quite some time, the overhead for conversion is too costly for any string APIs as well as anything that uses COM. Since your target character set is Arabic, you need to convert back and forth to Unicode to search strings, R2L reading is used and the sorting would be affected since it is a non-roman alphabet. Trying to handle all of these in ANSI would be a chore and would definitely slow things down, again I haven't examined the code yet but switching your application to native Unicode can't do anything but help. Remember that most of the string search algos you'll find here are designed for L2R reading and a roman alphabet and would have to be adapted or discarded as unworkable in your target character set.

No Im not going to make an arabic string searching, thats why I convert the string to an ANSI.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

akane

Farabi, to make it fast, you have several options. Here are two of them:
1. Load the dictionary to memory, allocate sorted (important) string pointer array and fill it with string pointers (to each keyword). Use then binary search to find a string in up to log2(stringcount) iterations: max 16 string-compare calls for 64k keywords.
2. If the dict is too large, divide it into blocks, for example each block for word-start letters (A-Z in english). Save block file offset and length somewhere. Now before searching a string, pick the first letter, load the corresponding group into memory as in first step, and again do a binary search.
blockindex = pszKeywordToFind[0] - firstletter_ascii  ;  english: pszKeywordToFind[0] - 'A'
loadBlock(hFile, blockOffsets[blockindex], blockSizes[blockindex])


To search without carrying of the letters case, convert your dict and the keyword to upper or lower case.

Binary search:
min = 0
max = NumberOfKeywords-1

while (min <= max)
{
pos = min + ((max - min) / 2)
cmpr = strcmp(keywords[pos], pszKeywordToFind)
if (cmpr == 0) then return pos            ; found
if (cmpr < 0) then min = pos+1
else max = pos-1
}
return -1 ;not found


Note: string compare function MUST be the same in search and sort subroutines: strcmp, wcscmp, lstrcmp, ...

Farabi

Well, the above code is the best search function I had.
I just wonder if anyone beat it.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

jj2007

Quote from: Farabi on January 02, 2010, 03:13:26 AM
Well, the above code is the best search function I had.
I just wonder if anyone beat it.

We (Hutch, Lingo, ...) can beat any search algo, provided we get a testbed that meets some basic standards, i.e.
1) is written in ml or Jwasm cpmpatible syntax
2) has Masm32 include paths, i.e. include \masm32\include\masm32rt.inc etc
3) contains all necessary libraries and includes.

Larry Hammick

Quote from: akane on December 31, 2009, 12:16:38 AM
To search without carrying of the letters case, convert your dict and the keyword to upper or lower case.
Good news: In Arabic there is only one case. :green