Attached is a demonstration of the generation and use of an index into a sorted table. The program reads a sorted list of strings from a text file. It then counts the lines and generates
-- a table of the addresses of those lines in memory
-- a table of dwords, one for each line, in which the lower word is a negative offset and the upper word a positive offset, allowing the main routine (IndexedLookup) to find a string among the 198 given ones after at most 8 string comparisons. :8)
The ASM file is about 350 lines. It's approach would be okay for a list of up to 16K different strings, but the code could be easily adjusted for larger tables.
Thanks Larry.
I was not expecting re-entrant code, but now I see how to use re-entrant code with indexs/arrays/tables.
Of course with me programing, I would have to add many safety checks everywhere, as I know there is a limit before I blow it up.
Steve
Since each re-entry splits a portion of table into two nearly equal pieces, the stack growth is predictable, and it is only on the order of log_2(tablesize). There might be a better way to construct that sort of index. It's easy if the table has 2^n - 1 entries, but not so easy in the general case. For say 15 entries, it looks like this:
1 0 0
2 -1 +1
3 0 0
4 -2 +2
5 0 0
6 -1 +1
7 0 0
8 -4 +4
9 0 0
10 -1 +1
11 0 0
12 -2 +2
13 0 0
14 -1 +1
15 0 0
Can you elaborate more on how this works please, how does the at most 8 searches yeild the found entry always with up to 16k entries? I get you're turning them into dwords or something, but how does signed dwords help you? Also why do you start in the middle of the table?
When you do a serach starting at the midway point, there are three posible outcomes. One: the serach table and the serach item match and you are done, two: the search item is less than the current (midway) value therefore you must seek the value to the right persay, three: the search item is greater than the current (midway) value therefore you must seek the value to the left persay.
Each decision made reduces your search area in approximatelly have.
Example 16 items, midway point is 8, min is 0, search for 2.
first evaluation [16,8,0,2] 8>4 therefore look left (adjust your pointers here min=min, max=oldmid, newmid = (max-min)/2
second evaluate [8,4,0,2] 4>2 therefore leftagain
third evalue [4,2,0,2] 2=2 we are done.
When quanties grow each decision is approximately n/2 away if they don't match up.
E^cube - i think the 8 passes was for 198 entries - sounds like a recursive binary search tree
Wow that's awesome, wikipedia said D. A. Heger (2004)[2] presented a performance comparison of binary search trees.Treap was found to have the best average performance, while red-black tree was found to have the smallest amount of performance fluctuations. How do you think recursive binary search trees rate against those, as thing seems really fast from your explanation.
Dedndave is right.
Use recursive to build your lookup table as Larry as shown.
Then control your three pointers to do the evaluations, no recursive at this level.
With word ptrs we have 64K possible outcomes, divide in have for signed control gives 32K and with 8 decisions you have 16k per side.
Sorry my earilier message was bit off, Daves reply always puts me back on track.
Do a search for multiway trees, you get better examples that the red-black examples.
thanks i'll bing that,also what about really large searches, of files containing megabytes/gigabytes of entries, will this method still be plausible?
8 searches are enough for up to 255 strings. For 2^n - 1 strings you would need n searches, never more, but usually n or n-1.
Quote from: E^cube on December 14, 2009, 01:03:06 PM
thanks i'll bing that,also what about really large searches, of files containing megabytes/gigabytes of entries, will this method still be plausible?
Good question. :D I'm gonna try it, and that's why I am right now writing a tool for making very big obj files containing data only -- world's crudest assembler, written in QBasic for DOS!
In the source code that I posted, the index entries are multiples of 4. Better idea: make them just signed short integers and let the searching routine apply the scaling factor of 4.
Now try skip lists ^^ Those are even cooler
Larry,
Give either FDA.EXE or FDA2.EXE a try, they work is already done to make object modules from data files. FDA and FDA2 are two toys in the masm32 project.
Here's a variant of the re-entrant routine called "tree" that generates an index table. This time the records are 8 bytes apiece, and the scaling factor of 8 is left out of the index. Thus it's general enough, in principle, for tables of up to 2^32 - 1 records.
tree_8 proc bottom:DWORD,top:DWORD
;Re-entrant routine, fills in a table of quadwords with pairs
;of signed integers determined by the size of the table.
;The input variables are the top and bottom of a memory chunk of
;uninitialized 8-byte records. tree_8 fills it in with positive
;and negative offsets (by record number, not record address)
;and returns eax=the address (not record number) of the entry point
;for a search routine.
local midd:DWORD
mov edx,bottom
mov eax,top
sub eax,edx
shr eax,4
shl eax,3 ;sets or clears the zero flag
lea eax,[eax+edx]
mov midd,eax
mov dword ptr[eax],0
jz short @F
invoke tree_8,edx,eax
mov edx,midd
sub eax,edx
sar eax,3
mov [edx],eax
@@: mov edx,midd
add edx,8
cmp edx,top
je short @F
invoke tree_8,edx,top
mov edx,midd
sub eax,edx
sar eax,3
mov [edx+4],eax
@@: mov eax,midd
ret
tree_8 endp
I googled up a word list of 90,000+ words, and my low-tech data object assembler is working, so, soon I'll have a demo of a honkin' fast dictionary-style lookup.
@slugsnack: Very nice evolution to multiway trees/nodes and I like the probalistic (conditional logic) approach.
@E^cube: if you have 300 keys (indexs) per a node and using a multiway tree approach with an order hieght of 5, you can have approximately 6 billions keys to search/find. I have included a small spreadsheet that you can change the number of keys per node and then look at the 5 levels and see how many keys you will have. If you evolutionize to skip tree, I don't know what will happen, but at first glance it appears to be better than multiway process.
@Larry: Question: Your first listing had a cmp 0A0Ah, if zero then exit out. Does this mean a match was found?
Quote from: z941998 on December 16, 2009, 06:10:23 AM
@Larry: Question: Your first listing had a cmp 0A0Ah, if zero then exit out. Does this mean a match was found?
Yes. The strings have been trimmed and the line-ends have been replaced with the line feed character 0Ah.
http://www3.telus.net/ldh/asm/WORDLIST.ZIP
That zip contains the source code and build files for wordlist.dll, wordlist.inc, wordlist.lib, and demo.exe. The dll contains the indexing and lookup routines, and demo.exe specifies a text file (included) of 90,000+ sorted strings, and calls the dll to look up specific strings or determine if they are not present.
Update: that demo breaks down for bigger text files, because of these errors in wordlist.asm:
jc short IL_look_higher
;IL_look_lower:
movsx eax,word ptr[esi+8*ebx+4]
jmp short IL_adjust
IL_look_higher:
movsx eax,word ptr[esi+8*ebx]
IL_adjust:
Should be:
jc short IL_look_higher
;IL_look_lower:
mov eax,[esi+8*ebx+4]
jmp short IL_adjust
IL_look_higher:
mov eax,[esi+8*ebx]
IL_adjust:
I used my old friend QBasic to generate a text file of 450,000+ "words", and after a test or two, I'll upload the demo for files of that size.
Update:
Okay, the above zip file is now corrected and contains the new and bigger word list.
Outstanding work :U I tested and it seems to work correctly :bg this would be great on databases, I think with your permission Hutch should try and add to masm32 so all can benefit from blazing fast searches
Quote from: E^cube on December 25, 2009, 09:32:15 PM
Outstanding work :U I tested and it seems to work correctly :bg this would be great on databases, I think with your permission Hutch should try and add to masm32 so all can benefit from blazing fast searches
Anybody can do anything he likes with that one.
I took the next step to creating an automatic platform for handling Trees/Indexed Lookups for my own library of tools.
I added a few extra features, like being able to control the terminator value when normalizing and added more compare routines (ones that can handle numeric data - integer format at this time).
I can load my test data, create the tree/index's with no apparent problems, however, when I perform the lookup, I am getting GPFs.
I think this is a result of not understanding how each designed level (ie: byte, word, and dword) of Trees/index's should be controlled. I am thinking I may have an issue with creating the Tree and also the lookup to sync with the tree.
I need to get back on track, so have a crack and provide me your feedback. See attachment. Thanks Steve