A demonstration of indexed lookup

Larry Hammick · December 13, 2009, 02:40:24 PM

Attached is a demonstration of the generation and use of an index into a sorted table. The program reads a sorted list of strings from a text file. It then counts the lines and generates
-- a table of the addresses of those lines in memory
-- a table of dwords, one for each line, in which the lower word is a negative offset and the upper word a positive offset, allowing the main routine (IndexedLookup) to find a string among the 198 given ones after at most 8 string comparisons. :8)
The ASM file is about 350 lines. It's approach would be okay for a list of up to 16K different strings, but the code could be easily adjusted for larger tables.

z941998 · December 14, 2009, 07:29:18 AM

Thanks Larry.

I was not expecting re-entrant code, but now I see how to use re-entrant code with indexs/arrays/tables.
Of course with me programing, I would have to add many safety checks everywhere, as I know there is a limit before I blow it up.

Steve

Larry Hammick · December 14, 2009, 11:39:17 AM

Since each re-entry splits a portion of table into two nearly equal pieces, the stack growth is predictable, and it is only on the order of log_2(tablesize). There might be a better way to construct that sort of index. It's easy if the table has 2^n - 1 entries, but not so easy in the general case. For say 15 entries, it looks like this:

Code Select

ecube · December 14, 2009, 11:51:41 AM

Can you elaborate more on how this works please, how does the at most 8 searches yeild the found entry always with up to 16k entries? I get you're turning them into dwords or something, but how does signed dwords help you? Also why do you start in the middle of the table?

z941998 · December 14, 2009, 12:14:03 PM

When you do a serach starting at the midway point, there are three posible outcomes. One: the serach table and the serach item match and you are done, two: the search item is less than the current (midway) value therefore you must seek the value to the right persay, three: the search item is greater than the current (midway) value therefore you must seek the value to the left persay.

Each decision made reduces your search area in approximatelly have.

Example 16 items, midway point is 8, min is 0, search for 2.
first evaluation [16,8,0,2] 8>4 therefore look left (adjust your pointers here min=min, max=oldmid, newmid = (max-min)/2
second evaluate [8,4,0,2] 4>2 therefore leftagain
third evalue [4,2,0,2] 2=2 we are done.

When quanties grow each decision is approximately n/2 away if they don't match up.

dedndave · December 14, 2009, 12:30:39 PM

E^cube - i think the 8 passes was for 198 entries - sounds like a recursive binary search tree

ecube · December 14, 2009, 12:44:53 PM

Wow that's awesome, wikipedia said D. A. Heger (2004)[2] presented a performance comparison of binary search trees.Treap was found to have the best average performance, while red-black tree was found to have the smallest amount of performance fluctuations. How do you think recursive binary search trees rate against those, as thing seems really fast from your explanation.

z941998 · December 14, 2009, 12:50:57 PM

Dedndave is right.

Use recursive to build your lookup table as Larry as shown.

Then control your three pointers to do the evaluations, no recursive at this level.

With word ptrs we have 64K possible outcomes, divide in have for signed control gives 32K and with 8 decisions you have 16k per side.

Sorry my earilier message was bit off, Daves reply always puts me back on track.

z941998 · December 14, 2009, 12:54:01 PM

Do a search for multiway trees, you get better examples that the red-black examples.

ecube · December 14, 2009, 01:03:06 PM

thanks i'll bing that,also what about really large searches, of files containing megabytes/gigabytes of entries, will this method still be plausible?

Larry Hammick · December 15, 2009, 01:40:22 AM

8 searches are enough for up to 255 strings. For 2^n - 1 strings you would need n searches, never more, but usually n or n-1.

Quote from: E^cube on December 14, 2009, 01:03:06 PM
thanks i'll bing that,also what about really large searches, of files containing megabytes/gigabytes of entries, will this method still be plausible?

Good question. :D I'm gonna try it, and that's why I am right now writing a tool for making very big obj files containing data only -- world's crudest assembler, written in QBasic for DOS!

In the source code that I posted, the index entries are multiples of 4. Better idea: make them just signed short integers and let the searching routine apply the scaling factor of 4.

Slugsnack · December 15, 2009, 01:44:37 AM

Now try skip lists ^^ Those are even cooler

hutch-- · December 15, 2009, 02:31:52 AM

Larry,

Give either FDA.EXE or FDA2.EXE a try, they work is already done to make object modules from data files. FDA and FDA2 are two toys in the masm32 project.

Larry Hammick · December 15, 2009, 04:38:41 PM

Here's a variant of the re-entrant routine called "tree" that generates an index table. This time the records are 8 bytes apiece, and the scaling factor of 8 is left out of the index. Thus it's general enough, in principle, for tables of up to 2^32 - 1 records.

Code Select

tree_8 proc bottom:DWORD,top:DWORD
;Re-entrant routine, fills in a table of quadwords with pairs
;of signed integers determined by the size of the table.
;The input variables are the top and bottom of a memory chunk of
;uninitialized 8-byte records. tree_8 fills it in with positive
;and negative offsets (by record number, not record address)
;and returns eax=the address (not record number) of the entry point
;for a search routine.
local midd:DWORD
    mov  edx,bottom
    mov  eax,top
    sub  eax,edx
    shr  eax,4
    shl  eax,3         ;sets or clears the zero flag
    lea  eax,[eax+edx]
    mov  midd,eax
    mov  dword ptr[eax],0
    jz   short @F
      invoke tree_8,edx,eax
    mov  edx,midd
    sub  eax,edx
    sar  eax,3
    mov  [edx],eax
@@: mov  edx,midd
    add  edx,8
    cmp  edx,top
    je   short @F
      invoke tree_8,edx,top
    mov  edx,midd
    sub  eax,edx
    sar  eax,3
    mov  [edx+4],eax
@@: mov  eax,midd
    ret
tree_8 endp

I googled up a word list of 90,000+ words, and my low-tech data object assembler is working, so, soon I'll have a demo of a honkin' fast dictionary-style lookup.

z941998 · December 16, 2009, 06:10:23 AM

@slugsnack: Very nice evolution to multiway trees/nodes and I like the probalistic (conditional logic) approach.

@E^cube: if you have 300 keys (indexs) per a node and using a multiway tree approach with an order hieght of 5, you can have approximately 6 billions keys to search/find. I have included a small spreadsheet that you can change the number of keys per node and then look at the 5 levels and see how many keys you will have. If you evolutionize to skip tree, I don't know what will happen, but at first glance it appears to be better than multiway process.

@Larry: Question: Your first listing had a cmp 0A0Ah, if zero then exit out. Does this mean a match was found?

News:

A demonstration of indexed lookup

Larry Hammick

Larry Hammick

ecube

ecube

ecube

Larry Hammick

Larry Hammick