News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Character Tree code question

Started by hutch--, March 19, 2005, 01:47:23 PM

Previous topic - Next topic

hutch--

I wondered if anyone has some C code floating around that does a character tree without just using a prebuilt HLL routine. I know basically how to code this design but having a look at an existing working one would save me a lot of time.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

doomsday

Do you mean a binary tree which uses an ascii string as the sort key?

regards,
-Brent

Tedd

Or do you mean as in a "tree-view" in textual format?


Here's something I have lying around (in perl/modem-line-noise :bdg)

#@_[0] = hash ref (sub-tree)
#@_[1] = current indenting string
sub prin {
    my $h = @_[0];
    my $s = @_[1];
    my @k = sort keys %{$h};
    if (@k > 0) {
        if (@k > $maxkids) {
            $maxkids = @k;
        }
        foreach my $p (@k) {
            my $ns = ($p eq $k[$#k])?"     ":" |   ";
            print("$s +-- $p\n");
            prin($h->{$p},($s.$ns));
        }
    }
}
No snowflake in an avalanche feels responsible.

GregL

#3
I did a Google search on 'Character Tree', this is what came up first.


:lol

(And nothing related to programming).

hutch--

Its a character based tree that is a normal binary tree using a 3 member structure something like,


    CTREE STRUCT
      left dd ?
      right dd ?
      value dd ?
    CTREE ENDS


The "value" member is multipurpose, can either store the character or if its the end of the word, it stores a pointer to whatever you wish to associate with the word. Its standard node / leaf layout with the leaf holding the pointer while the nodes hold characters.

I have got the recursion going and have rudimentary memory handling working but its useful to look at how its ben done before because its been many years since I have coded a binary tree and I am very rusty in the area. I want to be able to try this technique out against a hash table for words and while I thnk the hash table is probably faster, a binary tree is probably more flexible with unknown word counts which may end up very large.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

This is the logic I am after.


    mov
    movs
    movsb
    movsd
    movsw
    movsx
    movzx

    m
    |
    o
    |
    v
    |
    z---s-------------------0
    |   |                   |
    x   x---w---d---b---0  [7]
    |   |   |   |   |   |
    0   0   0   0   0  [6]
    |   |   |   |   |
   [1] [2] [3] [4] [5]

    1 = movzx
    2 = movsx
    3 = movsw
    4 = movsd
    5 = movsb
    6 = movs
    7 = mov


These have been loaded in reverse order (longer first) so that the zero terminator is among the characters to test. The number in [3] square brackets is a location for a DWORD pointer.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

gabor

Hi!

In my language (hungarian) this theory is called "word-tree". According to a book on algos, the english name for them is trie created from the word retrieval. These trees are used for storing and retrieving strings/words.
Here is the theory:
- we have a set of symbols S (alphabet) containing all the characters that are used for creating words.
- words are finite long sequences of members of S, the set of these words is marked with S*.
- a sorting is needed, in case of words of natural languages the sorting is lexicographic (follows the alphabetic order).
- a special stop symbol is used to indicate the end of the words, this symbol is the last one in the sorting order.
- a node in the tree is either NULL or stores pointers to other nodes. The maximum number of the pointers is |S|+1. So the "biggest" node would contain pointers to all those nodes which hold different characters plus the stop symbol.(example: the alphabet is {a,b,c}, the max number of pointers is then 3+1) This means the trie is not a binary tree!
- words can be retrieved by getting the node of the first character and then simply walking along the branches.

My book states that the trie implemented with arrays is a very efficient lookup for words.

Here's a figure for Hutch's words:


If every node was stored in arrays, the arrays hold the pointers to other arrays (nodes).
In the case shown on the figure this means we have 17 nodes in total
- one 5 long array (node s)
- one 3 long array (node v)
- fifteen 1 long array, seven containing no pointers, just the eos symbal.

I believe there could be many simplifications and optimizations made...

I hope, I could give some ideas or infos.

ps: Writing this post made me keen on implementing a trie building and look up algo. :P

Greets, Gábor

Tedd

Quote from: hutch-- on March 22, 2005, 07:56:34 AM
This is the logic I am after.
...

Well why didn't you say so :bdg

You might want to look into Finite State Recognisers, which are designed for exactly that. Although they're table-based, so you do get a certain amount of wastage depending on the dictionary.
The problem with binary trees is that they become very lop-sided, especially if you insert in-order - which is the point of red-black trees. But then, in this instance, I suppose it's a static tree once it's built, so you could balance it afterwards :bg
No snowflake in an avalanche feels responsible.

gabor

Hi again!

Tedd, please don't have mercy on me and hit me with an explanation or link to resources about the Finite State Recognisers!

I love finite things:
- finit state machines
- filters with finit impulse response
and finite downloading times of course!


Hutch, have you got anywhere? If you won't mind I'd like to learn about your efforts and results.
(I wrote it public, because anyone else might be interested too.)

Thanks and greets, Gábor

hutch--

gabor,

If I can get it going I will post it but its mainly nutting out how to do it efficiently at the moment. I have the recursion and memory allocation working properly but I have not worked out the right way to place the pointers in the left or right locations yet. I have been testing the written output in hex and the charactes and terminators are being writen in the correct location. I need to make the original memory block so it can be reallocated and this means I will have to store an offset from the base address  in the left or right locations as each reallocation of the main memory will change the base address.

I need to have the procedure so it is self contained and re-entrant as it makes it a lot easier to use and will be able to manage a number of trees at one time.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Biterider

Hi Hutch
Reading your post, it comes to my mind that what you are looking for is the structure of the index files of DBase3. The logic and code to generate it is not easy... I've implemented it for ObjAsm32 DBase and Index objects. You can get it from our Homepage from the DBase project.

Regards,

Biterider

Randall Hyde

Quote from: hutch-- on March 19, 2005, 01:47:23 PM
I wondered if anyone has some C code floating around that does a character tree without just using a prebuilt HLL routine. I know basically how to code this design but having a look at an existing working one would save me a lot of time.

Your finite state machine (which is the term you should probably be searching for on the net) will run quite a bit faster if you process four characters at a time rather than one.  Just "blank pad" strings that are not a multiple of four characters and then do dword comparisons.

e.g., assume eax contains first four chars, ebx contains next four chars

   cmp eax, asDword( "movs" )
   jb tryMov
   ja tryMovz

   cmp ebx, asDword( "d    " )
   ja tryMovsXW
   jb tryMovs_B
  ; << it's movsd>>


tryMov: cmp eax, asDword( "mov " )
           jne NotReservedWord
          ; << it's mov >>

tryMovz: cmp eax, asDword("movz" );
            ja etc
            jb etc2
            cmp ebx, asDword( "    " )
            jne NotReservedWord
            ; it's movz


etc., etc.
The only problem with a pure table-driven FSM is that you'll bash a lot of cache memory with access to the table (ditto for accessing elements of a binary search tree). True, the code above also bashes the cache by accessing lots of code, but code access to the cache is often more efficient than data access.  The nice thing about the code above is that it uses 1/4 the number of instructions for the comparisons (as it is comparing four bytes at a time rather than one).  This is comparable to the scheme that I use in HLA v2.0, though I use a hash function rather than a pure finite-state machine in order to quickly jump to the sequence that checks for a given reserved word. At a cost of only two instruction per character for the hashing function, I'm fairly sure the result is a bit faster than using a pure FSM.

One other problem I see with your code is that you explicitly assume that there is an EOS marker. To me, this suggests that you've already scanned the word in memory and moved it to a string buffer (marked with an EOS, such as zero). In fact, this scanning and copying can be combined with the word lookup, sparing you the copy operation (which can be expensive). Of course, you need a slightly improved FSM machine that recognizes a set of "not identifier symbols" rather than just EOS to mark the end of your strings.
Cheers,
Randy Hyde

Mark_Larson


  I use hashing for strings.  My favorite is a Zobrist Hash.  Inserting a new string and finding a string that already exists is extremely fast.   It also does not suffer from the problem of having a lopsided tree.  When you have inserted 1000 words, inserting a 1001st word requires a lot of compares with the kind of tree you are looking at.  With Zobrist Hashing ( and with hashing in general) the insertion and retrieval times do not grow as the number of strings grow.  They both stay the same.  Zobrist hashing XORs the characters in the string with an array of randomly generated dwords, and then ANDs it with the table size, which is a power of 2.  If you want to learn more, let me know.   
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

hutch--

Mark,

I have hash algos that work fine and are fast enough but I am looking at tree construction even though it is probably slower as it can be repeatedly reallocated as the memory for the data is filled.

Randy,

Thanks for the example, I will have a play with it.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

gabor

Hi there!

I must say those are nice solutions you've come up with. But I must also say that I'm feeling a bit low, because nobody commented about what I wrote. Is it so because I didn't post any code? Anyway I promise I'll implement it soon.

Hutch,
I don't really see why do you stick to binary trees. I think binary is not the best way when handling non-boolean variables.

Hashing in general is faster than trees, but does not have as flexible structure as trees. If navigation and complex operations (like pattern search, join) is not needed hashing is quite adequate.

B the way, how many strings do you want to store? Is it possible that the brute force method, storing in an array with an index array would be enough?

I "feel" that for really large databases (with million entries) a combined algo would be satisfactory.

I would like to see a working hashing code, I I'll ship a working code for word trees! Deal? :bg

Greets, Gábor