News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Project: Dictionary

Started by CoR, January 08, 2007, 01:19:43 PM

Previous topic - Next topic

CoR

First, hello to all :)

Though I have some general understanding of programming and little background in VB, C/C++ I was always keen to learn ASM.
I am reading Iczelion's tutorial which is awesome. But I really have to change background and font color in .chm file ;) Examples are great and what is more amazing they work!

I set myself a goal to make a dictionary. Completely in asm. I have one year of my life in disposal for that.

In this forum I read a thread "Delimited File vs. Fixed-Field" Is is good but it does not explain database in detail i need. 
Now on to my questions...
How are files really-really written on hdd? If I understand from C and Icz tut files look like long filament or hail strand that begins in one point and ends after 2 or 3 kilometers of data. Figure speech. And only way to locate data you need is by string search or by knowing sequenced position of data i need.
Did I understand this well or there are some other file structures I am not familiar with?

So far I have "mental picture" of delimited file structure for dictionary with index. With flaw :(
I do not know how to CHANGE data in it. Or delete or add some stuff in the middle of DB. If I delete single byte from DB index table will point to wrong string. Any idea who to solve this problem without reindexing whole DB every time I make single change.

I do not want to use mySQL or similar DB. I want to create my own  :green  In asm  :cheekygreen:

sluggy

Quote from: CoR on January 08, 2007, 01:19:43 PM
I do not know how to CHANGE data in it. Or delete or add some stuff in the middle of DB. If I delete single byte from DB index table will point to wrong string. Any idea who to solve this problem without reindexing whole DB every time I make single change.

That is easy - you reserve some space at the end of each word, think of it as a little bit of padding. For instance, you could reserve another 8 bytes on the end of each word, then that word could grow by up to 8 characters without affecting the words around it.

The trick for this is to make your index file accessing simple, and your datafile scanning fast. Use the index file to make the simple binary decision whether a word is spelt correctly or not*, then use the data file to find words of a similar spelling to suggest as alternatives.

*perform a mathematical equation on the word (called 'hashing'), then use the resulting value as an offset into the index file, if there is a byte containing a certain value at the location pointed to by the index offset then you know the word is spelt correctly. All you need is a hashing algorithmn that is suitable fast and produces suitably unique results.


Tedd

How files are written on the surface of the hard-disk is largely down to the filesystem's format, as these can vary, there's little point depending on it. In general they tend to be saved in blocks of 512 bytes, and each of the blocks are placed in order where possible, but can't always be (and it's not necessary, just usually faster to read.)
Your understanding of files seems good enough - they're (usually, depending on the filesystem!) laid out in one 'long line' of data. Any extra information for indexing inside the file is generally down to you.
How I would approach the structure of the file would be to have an index at the start which gives the offsets of the start of each word, and then following this comes the list of words itself. To allow for modifcations, the 'list' is actually an array of 'slots' - each slot is, say, 8 bytes (make a decision based on average word length) - and each word takes up as many slots as needed -- 'a' would take 1 slot, 'something' would need 2. Modifying one word then just means changing it within its own slot, and the other words stay as they are, except in the case where it needs another slot and then you'll have to shift everything down by one slot-size (and adjust the index accordingly.) Inserting new words is simply a matter of inserting enough new slots. Deleting can be as simple as removing it from the index, or you can clean up as well and shift the slots following it back one slot.
Reads are going to be much more frequent than writes, so aim to keep them more efficient. Of course you'd like writes to be efficient too, but not at the expense of reads.
On top of the word index, you could build another index for something like where 'a...' starts, where 'b...' starts, etc. Or any other fancy scheme you might think of.

Don't aim to create a database (unless that is actually what you're aiming to do!) as they are meant for so much more and therefore do so much more - which does mean sooooooooooooo much more work :bdg
No snowflake in an avalanche feels responsible.

gabor

Hello!

I too though of creating a DB backbone, a system capable of storing and retrieving data. So I was thinking about real hard how to merge the fast response with the storage space saving methods. (Finally I got to XML and its possible binary representation.  :8) )

In the case of words, I would not use padding because there is a big scale of word lengths. I would rather use a linked list and a powerfull indexing upon it.
In my design there is a huge heap of data (the words in this case) with different record sizes. There are also several indexings representing a sorting of the data.
The heap is a linked list, this gives fast insertion/remove methods wich don't violate the sorting, but the access is very slow (sequencial). The indexings give entry points into this heap (assigning an index to every single record would become a full indexing that wastes too much space and give too big overhead).
Of course if there are many indexings, it has not much use to have a general sorting, and every record must be supplied with and entry for every indexing.

There can be several methods of indexing; basicly, either it can be hashing like sluggy suggested, or it can be tree-based. For words there is a specialized structure and theory, called "retrievel trees".

The indexes could map any kind of relations between the words (synonims, generalization, specialization, etc.)

This brings me to the idea of the semantical web. There the words (rather concepts) were described with graphs: the vertices are the concepts the edges are relations between them... A straight, but hard road to AI, I guess.  :bg

Greets, Gábor

CoR

Hello guys :)
Thanks for replies!

@sluggy:
You propose fix field for words. But that will not work. What if I have really long words "mountainous" or much longer. I can not find better solution than some kind of EOW (end of word).
Or to allocate really long byte sequence which has to be twice the size for UNICODE :(

My god! Hashing or some equivalent is GREAT idea! Thanks a lot!

@Tedd:
Great! Slot idea is also great.
But it also require re indexing if I need to add whole bunch of data like pictures :(

I already had in mind to speed up reading by putting a,b,c... beginning offsets ;) That's only fancy scheme I thought of :D

But I do not understand what are you saying by " don't aim to create database"? I need to have some king of database for dictionary. If you have some better idea for dictionary plz say :)

@gabor:
Thanks. I need to meditate (to google) upon your "retrievel trees" and that advanced indexing tables. I need to learn much more!


To be quite honest I thought that making a dictionary will be really simple exercise. Almost every program has dictionary in itself. Even Firefox has spell checker when you type.
Wooooo... How I was wrong!



CoR

Ok here it is:
First goes table of letters where a,b,c... points at the index table.
Than goes index of all words. I didn't find any smart way of making index table so I'll use every fifth space which allows inserting of new words without disturbing the whole DB :)
After that goes DB with explanations of words. It will be sorted at first, but any adding will be just appending that explanation to the end of DB. So index will be crucial for quick search.

Tedd

So what happens when you add the fifth 'extra' word and it places between the end of the 'gap' and the next word?
You can put this case off as much as you like, but you will have to deal with it at some point. If you can deal with it then obviously you should, but in that case: why put it off? I'd guess that even just one spare space per word would work just as well.
The actual time to re-index and re-shuffle is pretty negligable, so you could get away with doing it each time the dictionary is closed and re-saved (if it has been modified.)
No snowflake in an avalanche feels responsible.

CoR

Tedd you are absolutely right :)
I just thought it would be "easier" or more time saving to reindex if those gaps are full.

"The actual time to re-index and re-shuffle is pretty negligible"
I didn't know that. I was speaking with my friend an hour ago, and he told me that Access also add new data to the end of file and reindex before app closes. If it's good enough for MS it'll be good for me :D

I just thought there is some smarter method of inserting data in DB. At least without rewriting whole file. If there's not... well life goes on ;)

Tedd

Quote from: CoR on January 22, 2007, 01:46:36 PM
I just thought there is some smarter method of inserting data in DB. At least without rewriting whole file. If there's not... well life goes on ;)
There is, but they tend to use more complex structures for indexing - such as a b-tree.
No snowflake in an avalanche feels responsible.

CoR

I started coding. Is it ok to post all stupid questions here or I must open new thread?
As beginner... my questions will be really  :dazzled:

Mark_Larson


  here is a link of a comparison of hashes.  It also has the code.  I prefer the Zobrist hash, least amount of collisions. 

http://www.burtleburtle.net/bob/hash/doobs.html



BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

CoR



  Ok, as I hear no complains I'll continue to post here  :bg

        First I would like to thanks thomas_remkus for asking a lot of relevant questions to me. And thanks to all the people who provided good answers.

Now on with my set of confusion:
1. Iczelion Tutorial says "You can use any function name instead of WinMain or no function at all."
Is it somehow better not to use winmain function and avoid that stack allocation? Does avoiding WinMain makes program more stable? 

2. In WinMain I put hInstance as parameter and it becomes hInst. Why? I mean if I can normally use hInstance all over the program. Is there any benefit in that transfer and duplicating hInstance variable?
I can not think anything but that's only leftover of C programming... :(

3.
LOCAL  msg:MSG, hwnd, wc:WINDCLASSEX. Why?
Is there any benefit on making variables on the stack? Speed, stability, CPU faster response...
I am noob but somehow it seems better if I put those variables in .DATA or .DATA? Is it possible?
Aplication seems to work fine if:

.DATA
hwnd dd 0


4.
lea eax,wc = OFFSET wc
so I can use lea always instead OFFSET?
As far as I can see mnemonic for ADDR is also lea. Right?




CoR

#12
Ok, now I can answer to my own questions  :green

1. TomRiddle and lingo says that's very OK not to use winmain. It's leftover from C/C++. Iczelion used WinMain so people could easier transfer to asm. Off course, any additional pro-con for "to use or not to use" WinMain will be appreciated.

2 & 3. Benefit is in smaller app. But app will be little faster if U use global variable. TNick presented some good examples here

4. still do not know  :red

Now  I need some major help to understand lingo's advanced code:
.data
wc dd sizeof WNDCLASSEX
   dd CS_HREDRAW or CS_VREDRAW or CS_BYTEALIGNWINDOW
   dd offset WndProc, 0, 0, 400000h, 0, 0
   dd COLOR_WINDOW +1, 0
   dd offset  ClassName, 0

sizeof WNDCLASSEX - should be dec. number 12 because  WNDCLASSEX has 12 elements. Right? So basicly there's no calculation. WNDCLASSEX will always have 12 elements? But anyway how does sizeof work in .DATA ?

next goes window styles for redrawing as constants.

dd offset WndProc - is address for WndProc which hutch explained before. But HOW offset can work in .DATA ???
How offser WndProc becomes lea eax, WndProc? Or there's something else going on?

cbClsExtra, cbWndExtra  defined as zeros.

400000h - this is hInstance. How did lingo know that hInstance will be that value?

0,0 - empty space for hIcon and hCursor. it will be filled in .code

Background color is understandable

dd offset  ClassName

0 - reserved space for small icon



I just made my first simpe window without WinMain. Main thing is that I knew what I have to do with msg, wc and hInstance on .DATA?
And I understood (correctly I hope) that I can do this on applications end:jmp MessageLoop
Exit:
;mov eax,msg.wParam ;return exit code in eax
;push eax
;ret
push msg.wParam
call ExitProcess


Anyway here's my nooblike code. Window without WinMain and with wc in .DATA.
Trouble is that I still do not understand how and why it works :( Any example of OFFSET and SIZEOF will be appreciated.

[attachment deleted by admin]

Rockoon

One common data structure (and algorithm) used in spell checking is the Bloom Filter (aka Bloom Hash)

http://en.wikipedia.org/wiki/Bloom_filter

The advantage of the Bloom over other methods is the storage space. You could easily get away with 2 bytes per word with an acceptably low false positive rate (where the filter would incorrectly report the word as being part of the set of correctly spell words)

Also, its very fast if you use a zobrist hash function (a zobrist can produce an arbitrarily long hash key which can be split up into multiple smaller keys)
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

zooba

Quote from: CoR on January 30, 2007, 10:31:23 PM
4.
lea eax,wc = OFFSET wc
so I can use lea always instead OFFSET?
As far as I can see mnemonic for ADDR is also lea. Right?

Yes, you can always use lea in place of OFFSET. The advantage of OFFSET is that you can avoid using a register. The disadvantage of OFFSET is that it only works on variables allocated at compile-time, ie. variables in .DATA/.DATA?/.CODE/.CONST or another .<section>. You cannot use OFFSET with a stack variable as stack variables are referenced as [ebp+some number] which can't be evaluated at compile-time. In this case, you need to use lea.

ADDR is a special instruction to the INVOKE macro. You can only use it in INVOKE and it compiles directly to lea eax, parameter // push eax. This is why you can't use ADDR and eax in the same invoke statement, since eax is overwritten. You can, however, use OFFSET in an invoke statement, which will compile to push xxxxxxxxh, hence allowing you to use eax as a parameter and saving an instruction. As I said earlier though, you can only use OFFSET on a variable allocated at compile-time.

Quote from: CoR on February 01, 2007, 10:25:18 PM
how does sizeof work in .DATA ?

SIZEOF can work anywhere. Like OFFSET which calculates the address of a fixed variable and uses that as if you'd typed in the address manually, SIZEOF will calculate the number of bytes required for the specified variable or type and use that number as if you'd typed it in. It saves you having to calculate the size yourself and insulates you from changing structures. If you add a member to one of your own structures and then recompile, the SIZEOF operator will update and crashes from under-allocating memory and the like are prevented.

Quote from: CoR on February 01, 2007, 10:25:18 PMdd offset WndProc - is address for WndProc which hutch explained before. But HOW offset can work in .DATA ???
How offser WndProc becomes lea eax, WndProc? Or there's something else going on?

One type of variables which are allocated at compile time are functions. The assembler decides while it is working where in memory the function will be, hence it is able to specify that wherever it is used.

Quote from: CoR on February 01, 2007, 10:25:18 PM400000h - this is hInstance. How did lingo know that hInstance will be that value?

Because Windows gives all applications (not DLLs) the same base address. They don't really have the same base address, Windows lies to them and fixes it up elsewhere. To calculate it properly, use GetModuleHandle with the parameter as zero/NULL.


Cheers,

Zooba :U