Print Page - Character Tree code question

Title: Character Tree code question
Post by: hutch-- on March 19, 2005, 01:47:23 PM

I wondered if anyone has some C code floating around that does a character tree without just using a prebuilt HLL routine. I know basically how to code this design but having a look at an existing working one would save me a lot of time.

Title: Re: Character Tree code question
Post by: doomsday on March 21, 2005, 05:48:52 PM

Do you mean a binary tree which uses an ascii string as the sort key?

regards,
-Brent

Title: Re: Character Tree code question
Post by: Tedd on March 21, 2005, 06:37:24 PM

Or do you mean as in a "tree-view" in textual format?

Here's something I have lying around (in perl/modem-line-noise :bdg)

Code Select


#@_[0] = hash ref (sub-tree)
#@_[1] = current indenting string
sub prin {
    my $h = @_[0];
    my $s = @_[1];
    my @k = sort keys %{$h};
    if (@k > 0) {
        if (@k > $maxkids) {
            $maxkids = @k;
        }
        foreach my $p (@k) {
            my $ns = ($p eq $k[$#k])?"     ":" |   ";
            print("$s +-- $p\n");
            prin($h->{$p},($s.$ns));
        }
    }
}

Title: Re: Character Tree code question
Post by: GregL on March 21, 2005, 08:14:33 PM

I did a Google search on 'Character Tree', this is what came up first.
(http://graphics.samsclub.com/images/products/0009114178030_LG.jpg)

:lol

(And nothing related to programming).

Title: Re: Character Tree code question
Post by: hutch-- on March 22, 2005, 01:37:59 AM

Its a character based tree that is a normal binary tree using a 3 member structure something like,

Code Select


    CTREE STRUCT
      left dd ?
      right dd ?
      value dd ?
    CTREE ENDS

The "value" member is multipurpose, can either store the character or if its the end of the word, it stores a pointer to whatever you wish to associate with the word. Its standard node / leaf layout with the leaf holding the pointer while the nodes hold characters.

I have got the recursion going and have rudimentary memory handling working but its useful to look at how its ben done before because its been many years since I have coded a binary tree and I am very rusty in the area. I want to be able to try this technique out against a hash table for words and while I thnk the hash table is probably faster, a binary tree is probably more flexible with unknown word counts which may end up very large.

Title: Re: Character Tree code question
Post by: hutch-- on March 22, 2005, 07:56:34 AM

This is the logic I am after.

mov
movs
movsb
movsd
movsw
movsx
movzx

m
|
o
|
v
|
z---s-------------------0
| | |
x x---w---d---b---0 [7]
| | | | | |
0 0 0 0 0 [6]
| | | | |
[1] [2] [3] [4] [5]

1 = movzx
2 = movsx
3 = movsw
4 = movsd
5 = movsb
6 = movs
7 = mov

These have been loaded in reverse order (longer first) so that the zero terminator is among the characters to test. The number in [3] square brackets is a location for a DWORD pointer.

Title: Re: Character Tree code question
Post by: gabor on March 22, 2005, 10:38:50 AM

Hi!

In my language (hungarian) this theory is called "word-tree". According to a book on algos, the english name for them is trie created from the word retrieval. These trees are used for storing and retrieving strings/words.
Here is the theory:
- we have a set of symbols S (alphabet) containing all the characters that are used for creating words.
- words are finite long sequences of members of S, the set of these words is marked with S^*.
- a sorting is needed, in case of words of natural languages the sorting is lexicographic (follows the alphabetic order).
- a special stop symbol is used to indicate the end of the words, this symbol is the last one in the sorting order.
- a node in the tree is either NULL or stores pointers to other nodes. The maximum number of the pointers is |S|+1. So the "biggest" node would contain pointers to all those nodes which hold different characters plus the stop symbol.(example: the alphabet is {a,b,c}, the max number of pointers is then 3+1) This means the trie is not a binary tree!
- words can be retrieved by getting the node of the first character and then simply walking along the branches.

My book states that the trie implemented with arrays is a very efficient lookup for words.

Here's a figure for Hutch's words:
(http://members.chello.hu/nemeth.gabor1/trie.gif)

If every node was stored in arrays, the arrays hold the pointers to other arrays (nodes).
In the case shown on the figure this means we have 17 nodes in total
- one 5 long array (node s)
- one 3 long array (node v)
- fifteen 1 long array, seven containing no pointers, just the eos symbal.

I believe there could be many simplifications and optimizations made...

I hope, I could give some ideas or infos.

ps: Writing this post made me keen on implementing a trie building and look up algo. :P

Greets, Gábor

Title: Re: Character Tree code question
Post by: Tedd on March 22, 2005, 12:28:04 PM

Quote from: hutch-- on March 22, 2005, 07:56:34 AM
This is the logic I am after.
...

Well why didn't you say so :bdg

You might want to look into Finite State Recognisers, which are designed for exactly that. Although they're table-based, so you do get a certain amount of wastage depending on the dictionary.
The problem with binary trees is that they become very lop-sided, especially if you insert in-order - which is the point of red-black trees. But then, in this instance, I suppose it's a static tree once it's built, so you could balance it afterwards :bg

Title: Re: Character Tree code question
Post by: gabor on March 24, 2005, 08:26:44 AM

Hi again!

Tedd, please don't have mercy on me and hit me with an explanation or link to resources about the Finite State Recognisers!

I love finite things:
- finit state machines
- filters with finit impulse response
and finite downloading times of course!

Hutch, have you got anywhere? If you won't mind I'd like to learn about your efforts and results.
(I wrote it public, because anyone else might be interested too.)

Thanks and greets, Gábor

Title: Re: Character Tree code question
Post by: hutch-- on March 24, 2005, 08:45:42 AM

gabor,

If I can get it going I will post it but its mainly nutting out how to do it efficiently at the moment. I have the recursion and memory allocation working properly but I have not worked out the right way to place the pointers in the left or right locations yet. I have been testing the written output in hex and the charactes and terminators are being writen in the correct location. I need to make the original memory block so it can be reallocated and this means I will have to store an offset from the base address in the left or right locations as each reallocation of the main memory will change the base address.

I need to have the procedure so it is self contained and re-entrant as it makes it a lot easier to use and will be able to manage a number of trees at one time.

Title: Re: Character Tree code question
Post by: Biterider on March 24, 2005, 09:03:38 AM

Hi Hutch
Reading your post, it comes to my mind that what you are looking for is the structure of the index files of DBase3. The logic and code to generate it is not easy... I've implemented it for ObjAsm32 DBase and Index objects. You can get it from our Homepage (http://objasm32.tripod.com) from the DBase project.

Regards,

Biterider

Title: Re: Character Tree code question
Post by: Randall Hyde on March 24, 2005, 05:18:39 PM

Quote from: hutch-- on March 19, 2005, 01:47:23 PM
I wondered if anyone has some C code floating around that does a character tree without just using a prebuilt HLL routine. I know basically how to code this design but having a look at an existing working one would save me a lot of time.

Your finite state machine (which is the term you should probably be searching for on the net) will run quite a bit faster if you process four characters at a time rather than one. Just "blank pad" strings that are not a multiple of four characters and then do dword comparisons.

e.g., assume eax contains first four chars, ebx contains next four chars

cmp eax, asDword( "movs" )
jb tryMov
ja tryMovz

cmp ebx, asDword( "d " )
ja tryMovsXW
jb tryMovs_B
; << it's movsd>>

tryMov: cmp eax, asDword( "mov " )
jne NotReservedWord
; << it's mov >>

tryMovz: cmp eax, asDword("movz" );
ja etc
jb etc2
cmp ebx, asDword( " " )
jne NotReservedWord
; it's movz

etc., etc.
The only problem with a pure table-driven FSM is that you'll bash a lot of cache memory with access to the table (ditto for accessing elements of a binary search tree). True, the code above also bashes the cache by accessing lots of code, but code access to the cache is often more efficient than data access. The nice thing about the code above is that it uses 1/4 the number of instructions for the comparisons (as it is comparing four bytes at a time rather than one). This is comparable to the scheme that I use in HLA v2.0, though I use a hash function rather than a pure finite-state machine in order to quickly jump to the sequence that checks for a given reserved word. At a cost of only two instruction per character for the hashing function, I'm fairly sure the result is a bit faster than using a pure FSM.

One other problem I see with your code is that you explicitly assume that there is an EOS marker. To me, this suggests that you've already scanned the word in memory and moved it to a string buffer (marked with an EOS, such as zero). In fact, this scanning and copying can be combined with the word lookup, sparing you the copy operation (which can be expensive). Of course, you need a slightly improved FSM machine that recognizes a set of "not identifier symbols" rather than just EOS to mark the end of your strings.
Cheers,
Randy Hyde

Title: Re: Character Tree code question
Post by: Mark_Larson on March 24, 2005, 06:51:13 PM

I use hashing for strings. My favorite is a Zobrist Hash. Inserting a new string and finding a string that already exists is extremely fast. It also does not suffer from the problem of having a lopsided tree. When you have inserted 1000 words, inserting a 1001st word requires a lot of compares with the kind of tree you are looking at. With Zobrist Hashing ( and with hashing in general) the insertion and retrieval times do not grow as the number of strings grow. They both stay the same. Zobrist hashing XORs the characters in the string with an array of randomly generated dwords, and then ANDs it with the table size, which is a power of 2. If you want to learn more, let me know.

Title: Re: Character Tree code question
Post by: hutch-- on March 25, 2005, 12:55:21 AM

Mark,

I have hash algos that work fine and are fast enough but I am looking at tree construction even though it is probably slower as it can be repeatedly reallocated as the memory for the data is filled.

Randy,

Thanks for the example, I will have a play with it.

Title: Re: Character Tree code question
Post by: gabor on March 25, 2005, 09:07:28 AM

Hi there!

I must say those are nice solutions you've come up with. But I must also say that I'm feeling a bit low, because nobody commented about what I wrote. Is it so because I didn't post any code? Anyway I promise I'll implement it soon.

Hutch,
I don't really see why do you stick to binary trees. I think binary is not the best way when handling non-boolean variables.

Hashing in general is faster than trees, but does not have as flexible structure as trees. If navigation and complex operations (like pattern search, join) is not needed hashing is quite adequate.

B the way, how many strings do you want to store? Is it possible that the brute force method, storing in an array with an index array would be enough?

I "feel" that for really large databases (with million entries) a combined algo would be satisfactory.

I would like to see a working hashing code, I I'll ship a working code for word trees! Deal? :bg

Greets, Gábor

Title: Re: Character Tree code question
Post by: hutch-- on March 25, 2005, 09:35:32 AM

The tree design I want is a simple binary tree as I am going to store words in it that are user defined that contain a user value, usualy a pointer to other text or data. I have hash tables working fine and I have a very fast known word technique but I want a standard binary tree because of its flexibility.

The test pieces I have played with so far have been recursive but I think it can be done simpler by iteration without the stack overhead and parameter passing required. I want the simple binary tree because the tree build time is a factor as well as the word lookup time. With this task, I need only build the tree and test words, I don't need deletion as the task will allow me to dump the entire memory once its finished.

Title: Re: Character Tree code question
Post by: Biterider on March 25, 2005, 01:07:32 PM

Hi Mark!
I hope that the offer for Hutch also valid for me. I'm interested in the Zobrist algo. If you have some info or an implementation to share, please let me know. :wink

Regards,

Biterider

Title: Re: Character Tree code question
Post by: Mark_Larson on March 25, 2005, 01:38:42 PM

Quote from: Biterider on March 25, 2005, 01:07:32 PM
Hi Mark!
I hope that the offer for Hutch also valid for me. I'm interested in the Zobrist algo. If you have some info or an implementation to share, please let me know. :wink

Regards,

Biterider

Sure. Here's a place to get some sample source code and some information. They actaully compare quite a few different hashes and show how fast they run relative to each other, and how many collisions they get. Zobrist hashing has the least amount of collisions for the testing he did.

http://burtleburtle.net/bob/hash/doobs.html

He also has a main page that has lots of hash stuff

http://burtleburtle.net/bob/hash/

Title: Re: Character Tree code question
Post by: Biterider on March 25, 2005, 04:22:39 PM

Thanks :U

Title: Re: Character Tree code question
Post by: rea on March 26, 2005, 01:28:20 AM

Hi there.

Time a go I was playing with something like the attached picture, tought I was stoped because reading something aparently the idea was called a dictionary algorithm??? and because I thinked already such thing exist, I stop my work :D.. LOL (sorry Im that way :S.. trying change that) and also not continue reading more deep in the dicrionary algorithm then at the end I dont know if my taught that what I was doing where the same or not, was only a feeling.......

OK, anyway, here is the "graph" I start doing that for match words at runtime and for generate words with a variety of nice repetition... like supose we have goooooooooooooooool that can be generated with g->o->l... where the nuber of "o" is not handled or puted there for mark ;).. anyway perhaps I will do a little work at this???

gabor, interesting that, then I supose I must define some rules and things like that (a more mathematical way...) in my way of think I supose, also was suposed to compact the data necesary for represent the complete subset, also have some adventages in some cases, for example how you check if goooooooooool is valid with a hash (because is mmore like static) tought like with a tree can be generated and checked very easely, because handled in some way can generate derived words but not defined in the first tree like gogogogooool tought with a hash will be diicult to find that such thing mean gol ;), with this type of tree can be checked and generated very easely... I think ... whant a proof???? :P.....

with the tree (also I remember when I constructructed that I was not thinking in a tree, I was thinking in some more like a spider (or a "enredadera") ), then is more a spider... perhaps I will do more work on this if you say me that is not like the Dictionary algorithm or representation.... :P or perhaps no more than only think in the posbilities????

Im thinking that I have a friend (girl) that I call normally (nickname) gogomounstrua but I can generate (if playng futball) a little modification like gogoooolmoustra not only mean that is the same nickname that I say to she but also that a gol (goal) was marked ;)....

By the way, I also make a mistake typing the second time, I delete a "n" between "u" and "s" but a human at less I normally cal she in any of the two ways.. gogomoustrua or gogomounstrua... and I still get that is she ;).

By the way I thing that I not clarificate in the graph....., in each node can by attached a information related to the path followed..... in the mean time for identificate if was a gol in any of the ways: gol, gooooogogogogogogoooooooool, goooooooooool or any other or was a diferent path like gogo or go (the last mean go) the anterior can mean go or the dance ala gogo .... I supose ;)

[attachment deleted by admin]

Title: Re: Character Tree code question
Post by: hutch-- on March 26, 2005, 05:00:40 AM

I found some reasonable reference for creating a simple binary tree.

The document allows it to be reproduced under these conditions.

Quote
This is article #110 in the Stanford CS Education Library. This and other free CS materials are available at the
library (http://cslibrary.stanford.edu/). That people seeking education should have the opportunity to find it.
This article may be used, reproduced, excerpted, or sold so long as this paragraph is clearly reproduced. Copyright
2000-2001, Nick Parlante, nick.parlante@cs.stanford.edu.

[attachment deleted by admin]

Title: Re: Character Tree code question
Post by: James Ladd on March 26, 2005, 05:27:52 AM

hutch,
I have examples of many algo's in C.
There are also many on the net.
If you want one in C I can give you one, I could also translate it if you wish.
Let me know and Ill help.

Title: Re: Character Tree code question
Post by: hutch-- on March 26, 2005, 07:53:45 AM

James,

A tree is C would be fine if you have the source available. I have been slowly grinding through test pieces getting bits of it to work but a complete working example in C would be very useful.

Title: Re: Character Tree code question
Post by: rea on March 26, 2005, 04:27:46 PM

hutch--

Your fist representation not feet completely a definition of a binary tree ;)..... (one father 2 childs that have the same level)

A little forat show where is the problem ;).

Code Select


 m
   \
    o
      \
       v
      /  \
     z    s-------------------0
     /\
    0  x   x---w---d---b---0  [7]
   /  /     |   |   |   |
 [1] 0      0   0   0  [6]
    /       |   |   |
   [2]     [3] [4] [5]

I suguest what gabor as suguested plus the suguestion of randall to search for "whole" parts and that give my suguestion ;).

I suguest go in a construction like:

Code Select


         |->s
         |
         |->sb
         |
         |->sd
         |
[mov,6]---->sw
         |
         |->sx
         |
         |->zx



                    |->w
                    |
[mov,2]---->[s,4]----->b
         |          |
         |->zx      |->d
                    |
                    |->x

Also you can get more sense if you use separate emanings ;) (something I have talked attached to each node ;) )

Code Select



                    |->w <-|
                    |      |
[mov,2]---->[s,4]----->b <-|
         |          |      |
         |->zx      |->d <-|
                    |      |
                    |->x   |
[stos,3]-------------------|
[scas,3]-------------------|

Title: Re: Character Tree code question
Post by: James Ladd on March 26, 2005, 10:23:10 PM

hutch,
how soon do you need it ?

Title: Re: Character Tree code question
Post by: hutch-- on March 27, 2005, 02:40:33 AM

Yesterday. :toothy

Title: Re: Character Tree code question
Post by: James Ladd on March 27, 2005, 04:27:33 AM

Title: Re: Character Tree code question
Post by: hutch-- on March 27, 2005, 05:48:20 AM

James,

Just the C code if you can find it, simpler the better.

rea,

Thanks for your comments, I have a particular approach with handling the terminator that does not require another leaf node. Place the return value in the left structure member and exit. For testing words against the tree, if it find a zero character that does not match, it will branch right to test for the next character. The tree looks something like this.

Code Select


movsx      => 0   0 <= movs
               \ /
movsb    => 0   x
             \ /
movsw  => 0   b   0 <= movsz
           \ /     \
movsd=> 0   w       x   0 <= mov
         \ /         \ /
          d           z
           \         /
            \       /
             \     /
              \   /
               \ /
                s   -  <= not used
                 \ /
                  v   -  <= not used
                   \ /
                    o   -  <= not used
                     \ /
                      m

Title: Re: Character Tree code question
Post by: James Ladd on March 27, 2005, 10:33:01 PM

hutch,
i have it right here, just have to type it in. :)

Title: Re: Character Tree code question
Post by: rea on March 28, 2005, 12:59:45 AM

I have a little one ;).. finded with look Im cleaning my documents :) and have finded a app for the school, was something for search for capital/country it is not a balanced tree or some like that in that way is a raw tree, see the bt.c (I remember that work ok... :P)

[attachment deleted by admin]

Title: Re: Character Tree code question
Post by: Tedd on March 29, 2005, 10:53:16 AM

Quote from: gabor on March 24, 2005, 08:26:44 AM
Tedd, please don't have mercy on me and hit me with an explanation or link to resources about the Finite State Recognisers!

Just for you :lol

Strings to recognize
mov movs movsb movsd movsw movsx movzx

Table for the recognizer

.abcdefghijklmnopqrstuvwxyz
0 -------------1-------------
1 ---------------2-----------
2 ----------------------3----
3 #------------------4-------
4 #-#-#------------------##-5
5 ------------------------#--

the . means null (end of string) - it's required when a string is also the start of another string (eg. mov is the start of movs/movsb/etc)

Examples
input = "movsw"
start @ state 0
(m,0) = 1 ;in state 0, with char 'm', therefore move to state 1
(o,1) = 2
(v,2) = 3
(s,3) = 4
(w,4) = # ;reached accept-state, so "movsw" is recognized

input = "mov"
start @ state 0
(m,0) = 1
(o,1) = 2
(v,2) = 3
(.,3) = # ;actually no more input, so check '.' ==> recognize

input = "moo"
start @ state 0
(m,0) = 1
(o,1) = 2
(o,2) = - ;failed!! (not recognized)

input = "jmp"
start @ state 0
(j,0) = - ;failed!! (not recognized)

for general instruction set (no fpu/mmx/sse) ~ 260 instructions, alphabet = {null,a..z,0..9}: table is 220 states :- size < 8kB

(inserting strings by length, rather than alphabetically, allows to use state indices as offsets from current - allowing >256 states; while still allowing indices to be bytes)

Title: Re: Character Tree code question
Post by: QvasiModo on March 31, 2005, 08:30:38 PM

Hutch--, the biggest problem with this trees is not speed but size. Maybe you could "compact" the tree by storing variable length strings instead of just characters (I think this variant of trie is called "patricia").

Title: Re: Character Tree code question
Post by: hutch-- on April 01, 2005, 06:08:56 AM

I have the test piece nearly working but its not much related to any of the tree structures I have so far seen. With the example by Rea and what I could scour off the internet, it seems what I had in mind is different to the common notion of a binary tree. The one I have nearly going is iterative, not recursive and it does the comparison on match/nomatch, not greater/less than.

Its shape will be dictated by the order of words written to it and it does not look like it will balance all that well but as an iterative algo it should be fast enough anyway but I won't know that until its up and reliable so I can benchmark it.

What is critical to its operation is that is can be used while scanning text to both add new words and test words so running a balancing phase is not really possible. More to come later.

The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: hutch-- on March 19, 2005, 01:47:23 PM