Arabic Dictionary

Farabi · December 30, 2009, 05:18:44 AM

I was thinkin is someone out there can make a string searching faster than mine.

Have a look for this function


TextBoxProc proc hWnd:dword,uMsg:dword,wParam:dword,lParam:dword
	LOCAL lptext:dword
	LOCAL buff[256]:dword
	LOCAL hFile:dword 
	LOCAL nIndex:dword
	
	.if uMsg==WM_KEYUP
			invoke mAlloc,1024
			mov lptext,eax
			xor ecx,ecx
			mov ecx,0fffeh
			mov [eax],ecx
			invoke SendMessageW,hEdit,EM_GETLINE,0,lptext
			invoke memfill,addr buff,1024,0
			invoke WideCharToMultiByte,CP_ACP,0,lptext,38,addr buff,MAX_PATH,0,0
			
			invoke SendMessageW,hEdit,EM_LINELENGTH,0,0
			.if eax!=0
				
				invoke SearchString,fFile,addr buff
				mov nIndex,eax
				
			.endif
			
			dec nIndex
			.if nIndex!=-1
				push nIndex
				pop f_ind
			.endif
			
			invoke SendMessage,lb,LB_SETCURSEL,f_ind,0
			
			invoke GlobalFree,lptext
	.endif
	invoke CallWindowProc,tbp,hWnd,uMsg,wParam,lParam
	ret
TextBoxProc endp

hutch-- · December 30, 2009, 08:49:29 AM

Farabi,

Tell us what character set you are using or more correctly what code page you are using as you should be able to use nearly any search algo to find words that you select. It looks like you are converting from Unicode to a multibyte format and if we know what you are doing, someone may be able to help.

jj2007 · December 30, 2009, 09:28:04 AM

Farabi,
I've tried to get your app assembled but it does not contain hardcoded include paths, so I gave up.
Without testing, the only section that is probably slow is

Quoteinvoke GetWord,lpFile,ecx,1
         mov eax,l_str
      ;   .if [edx].frbWordStruct.nSize>=eax
            invoke szCmpi,[edx].frbWordStruct.lpAddress,lpString,l_str

No idea what GetWord is, it's not found in your *.asm code, but szCmpi is probably the bottleneck. Try replacing it with crt_strstri. Have a look at the strstri docu.

Farabi · December 30, 2009, 02:29:40 PM

Quote from: hutch-- on December 30, 2009, 08:49:29 AM
Farabi,

Tell us what character set you are using or more correctly what code page you are using as you should be able to use nearly any search algo to find words that you select. It looks like you are converting from Unicode to a multibyte format and if we know what you are doing, someone may be able to help.

Im using the conversion only because the Text box is unicode. So the character set should be the standard

Farabi · December 30, 2009, 02:34:41 PM

Quote from: jj2007 on December 30, 2009, 09:28:04 AM
Farabi,
I've tried to get your app assembled but it does not contain hardcoded include paths, so I gave up.
Without testing, the only section that is probably slow is

Quoteinvoke GetWord,lpFile,ecx,1
         mov eax,l_str
      ;   .if [edx].frbWordStruct.nSize>=eax
            invoke szCmpi,[edx].frbWordStruct.lpAddress,lpString,l_str
No idea what GetWord is, it's not found in your *.asm code, but szCmpi is probably the bottleneck. Try replacing it with crt_strstri. Have a look at the strstri docu.

I think I miss uploaded it.
I used my template and some of the component is not on your.

This should work on your system. GetWord is on hNullString

donkey · December 30, 2009, 04:00:26 PM

Hi Farabi,

I haven't checked the code yet, I will though, but from a look at the posts I have to ask why you would convert back and forth to Unicode ? Since the NT API is natively Unicode for the most part, definitely the string functions, you're wasting an enormous amount of time in conversions. I have not written an application in ANSI for quite some time, the overhead for conversion is too costly for any string APIs as well as anything that uses COM. Since your target character set is Arabic, you need to convert back and forth to Unicode to search strings, R2L reading is used and the sorting would be affected since it is a non-roman alphabet. Trying to handle all of these in ANSI would be a chore and would definitely slow things down, again I haven't examined the code yet but switching your application to native Unicode can't do anything but help. Remember that most of the string search algos you'll find here are designed for L2R reading and a roman alphabet and would have to be adapted or discarded as unworkable in your target character set.

daydreamer · December 30, 2009, 07:35:40 PM

I also worked with trying to make a arabic dictionary/translator, it might work, after all arabic unicodes hibyte is same 06 all the time ala all arabic letters are in the shape 06xxh, but maybe you need to reverse the alphabetically search on arabic, after all my small book dictionary is backwards and so is all arabic text, reversed compared to latin letters
all it requires is change the code that handles unicode arabic buffers from this
MOV al,[ebx+buffer] ;ebx holds char number
to
MOV al,[ebx*2+buffer]
you see I mean just change to scale by 2 on all places you access the buffer in a sort algo or whatever algo you need to use there

jj2007 · December 30, 2009, 08:38:49 PM

Quote from: daydreamer on December 30, 2009, 07:35:40 PM
all arabic letters are in the shape 06xxh

If speed is an issue, that opens another road: Translate your source file into a single byte file without the 06 bytes, and use standard strstr or whatever. For displaying, you need to add the 06h again, of course.

Farabi · December 30, 2009, 11:43:45 PM

Quote from: donkey on December 30, 2009, 04:00:26 PM
Hi Farabi,

I haven't checked the code yet, I will though, but from a look at the posts I have to ask why you would convert back and forth to Unicode ? Since the NT API is natively Unicode for the most part, definitely the string functions, you're wasting an enormous amount of time in conversions. I have not written an application in ANSI for quite some time, the overhead for conversion is too costly for any string APIs as well as anything that uses COM. Since your target character set is Arabic, you need to convert back and forth to Unicode to search strings, R2L reading is used and the sorting would be affected since it is a non-roman alphabet. Trying to handle all of these in ANSI would be a chore and would definitely slow things down, again I haven't examined the code yet but switching your application to native Unicode can't do anything but help. Remember that most of the string search algos you'll find here are designed for L2R reading and a roman alphabet and would have to be adapted or discarded as unworkable in your target character set.

No Im not going to make an arabic string searching, thats why I convert the string to an ANSI.

akane · December 31, 2009, 12:16:38 AM

Farabi, to make it fast, you have several options. Here are two of them:
1. Load the dictionary to memory, allocate sorted (important) string pointer array and fill it with string pointers (to each keyword). Use then binary search to find a string in up to log2(stringcount) iterations: max 16 string-compare calls for 64k keywords.
2. If the dict is too large, divide it into blocks, for example each block for word-start letters (A-Z in english). Save block file offset and length somewhere. Now before searching a string, pick the first letter, load the corresponding group into memory as in first step, and again do a binary search.

Code Select

blockindex = pszKeywordToFind[0] - firstletter_ascii  ;  english: pszKeywordToFind[0] - 'A'
loadBlock(hFile, blockOffsets[blockindex], blockSizes[blockindex])

To search without carrying of the letters case, convert your dict and the keyword to upper or lower case.

Binary search:

Code Select

min = 0
max = NumberOfKeywords-1

while (min <= max)
{
	pos = min + ((max - min) / 2)
	cmpr = strcmp(keywords[pos], pszKeywordToFind)
	if (cmpr == 0) then return pos            ; found
	if (cmpr < 0) then min = pos+1
	else max = pos-1
}
return -1 ;not found

Note: string compare function MUST be the same in search and sort subroutines: strcmp, wcscmp, lstrcmp, ...

Farabi · January 02, 2010, 03:13:26 AM

Well, the above code is the best search function I had.
I just wonder if anyone beat it.

jj2007 · January 02, 2010, 09:29:52 AM

Quote from: Farabi on January 02, 2010, 03:13:26 AM
Well, the above code is the best search function I had.
I just wonder if anyone beat it.

We (Hutch, Lingo, ...) can beat any search algo, provided we get a testbed that meets some basic standards, i.e.
1) is written in ml or Jwasm cpmpatible syntax
2) has Masm32 include paths, i.e. include \masm32\include\masm32rt.inc etc
3) contains all necessary libraries and includes.

Larry Hammick · January 04, 2010, 03:44:56 AM

Quote from: akane on December 31, 2009, 12:16:38 AM
To search without carrying of the letters case, convert your dict and the keyword to upper or lower case.

Good news: In Arabic there is only one case. :green

News:

Arabic Dictionary

Larry Hammick