Is there any way to build a hash table in assembler for key/value pairs, where key is dword and value - string?
Or it is better to work with API such as AddAtom, GetAtomName etc?
If the Atom api's for the trick then use them.
If you look through this site then you will find an excellent HashTable API by Hutch.
It should not be too hard to change the key to hash of a DWORD.
Rgs
Hi!
Exactly where can this hash api be found?
Greets, gábor
Hi,
Use two tables. One with keys and the other with the values such that the key for value 1 is at index 1 and so on. You can get the correspoding value for a key using the key's index.
Thomas :U
This is the last piece I remember doing that did word replacements from an equates file on a source file. It does its hash from an array of primes. It will stop with an error if the equates file had duplicates or other problems and it seems to run reasonably fast.
[attachment deleted by admin]
What is a hash table and what are they used for?
Aero,
A classic but not exclusive use is with source files where you replace all of the equates with their actual values. It is done from a normal style of include file that is loaded into the hash table then when the source file is scanned at a word by word level, it replaces any of the words found in the list with the replacement value.
Some databases use a much larger and more powerful hash system for indexing keys to data items in the database.
So it is just a lookup table? Why then is it called a "
hash" table?
Quote from: hutch-- on May 02, 2005, 03:37:34 PM
It does its hash from an array of primes.
What does this mean?
Here is a link to the corresponding wikipedia article: http://en.wikipedia.org/wiki/Hashtable
That should answer all of your questions regarding hashtables.
Aero,
There are various ways of matching a word to a replacement value and a hash table is only one of them. With very simple stuff you can scan an array for a value, you can create a slightly better version which is a linked list but if you are serious about finding words FAST, you start looking at data structures.
The main players are either hash table layouts or binary tree type structures, the hash table generally being faster where the tree is more flexible. Many of the older asemblers/compilers used hash tables but you generally allocate them first then fill them up so you got the problem of running out of hash space. Better and later stuff tends to use tree structures that can be reallocated on the fly if more memory is needed.
A hash table works by allocating fixed size slots in memory to a known number of slots (generally called buckets) and then for each word added to the hash table, it uses the characters of the word to try and generate a unique number that fits within the range of the table. The basic theory is simple but because characters are base 256 instead of base 10 with numbers, you can never get a perfect hash by this method so at tmes you get two different words trying to use the same location. The term is "collision".
A hash table design MUST handle collisions in a reliable way and there are different techniques to do this. The simplest try the next slot to see if its empty but this leads to a problem called "clustering". A better technique is to use a variable to try another slot and a very convenient one is the length of the word. The hash table must have a wrap around method so you don't run off the top of it and to make sure you don't endless loop, the hash tabel count should always be a prime number.
A hash table if its well designed only slows up as it is getting close to being filled and this means a good hashing algo to get the widest rangle of numbers from unique words and a fast collision handling technique. A tree design never fills up if its writen to be reallocated and while they do get slower as they get bigger, its a linear time increase, not a sudden rise like a hash table before it fils up.
Algo design is the PHUN stuff in assembler and while it can be hard work, it is almost restriction free and open to different architectures that take advantage of speed and size.
Using a hash table with a fixed size array is known as open addressing - see the wikipedia link by mnemonic.
I generally prefer a hash table using linked lists (chaining). You can use whatever allocation functions available to expand the table when it fills the first set of table entries. The number of hash values (buckets) is fixed, but the number of entries is open-ended. The search time is dependent on the average length of lists in each bucket. If you have lots of input data that tend to cluster, you can potentially get better average performance by moving found entries to the head of the list. The hash table with linked list does not have the sudden degradation of open addressing, as you are only searching through entries that have the same hash value.
Yes,
It is a good distinction of type but chaining at a given bucket location to a linked list or similar is a substantial performance hit and I normaly associate this type of design with languages like java that are so slow it does not matter. An open addressing hash is substantially faster that a linked list hash and from memory a tree structure outperforms the linked list type as well while retaining its basic flexibility advantage.
Perhaps SmallAlloc could be useful here.
http://www.masmforum.com/simple/index.php?topic=1006.0
A probablistic skip list could also be used to maintain the links in the bucket.
But hey, it really depends on what you want to use the table for.
Hi!I've read through the post and seems like Hutch knows a lot about the theory. I think he explained briefly all the important points.
I would like to add some stuff, what I learned and experienced so far
The power of hashing shows itself when an average performance optimum is needed. When the task is to store and retrieve (look up) many records (dword, string, structures), just like in the mentioned situation of symbol tables in compilers.
I too bumped into hashing with my LZW encoder, which also stores symbols. (Actually I found that my linear storing and retrieving is a very brute and primitive solution causing my algo run incredibly slow).
Linear storing
- best case and worst case are the same: I put the next record at the end of the row. It is done in 1 step. This part is fast...
Linear retriving
- best case: the first record is looked for, in 1 step it is retrieved
- worst case: the last record is needed, so all the records must be read through. This is n step, very slow
- when the record is not stored this can be found out only when the whole list is read through. Again n step, very bad performance
HashingStoring and retriving base upon the same idea, a hash function which produce keys for the records.
- best case: the keys generated by the hash function are always unique, for all records a different key is given. The cost is the steps of creating the key plus 1 step storing the record.
- worst case: all keys are the same (the probability of this is 0 in practice) the insert takes n steps and the cost of generating the key.
When more records get the same key a collision occures (like Hutch explained) which can be handled in different ways. The primitivest way is the linear way again, this method tries to store the data in the next space next to that of the key, when this neighbor is occupied then the next comes and so on... This can end in a long cluster.
A better solution uses pseudo-random indexes. I learned that the best performance is achieved by using a second hash function, this is double hashing.
I'm about to create my own hash library, I'll post it when finished.
I have a question as well: what do you recommend, how could strings with variable length be used to create a 32bit key? (This could be usefull not only for hashing)
Greets, gábor
Hi,
I have a system in my OS for matching the commands to their coressponding functions. First I go through a list of commands declared as
length,"command"
and find the index of the command entered. Now I use this index into a table of addresses of functions. Is this a hash table?
Thomas :U
NO, a hash table is where you use a hash function to make your lookup table (can't think why you would wnat to do that, but hey)
Aero,
Have a look at the example I posted, for any word in the equate file, it will replace that equate in a source file with the value of that equate and the important thing with this type of application is that it does ALL of the replacements in 1 pass. The example is a hash table but you can also use a tree to do the same thing. In comparison a look up table is very slow if you have to call it from beginning until you find the word you are searching for.
Typical use of this type of capacity is for assemblers and compilers and another application for this type of data structure is for the keys to data in a BIG database where searching it sequentially would be extremely slow.
With both hash tables and trees you can build either as you scan through text while you are testing other words to see if they are already in the data structure. Neither technique is simple to code but they can do things that are much faster than simplistic tecniques.
Actually, it appears that both Thomas and AeroASM are correct although the method described by Hutch and Thomas is not one that I am familiar with. Programming in Perl uses a method to build a hash table 'on the fly' similar to what AeroASM describes so I guess he is referring to that method. It is used to build a table as a result of a search function involving a large database. The data is then hex encoded in a certain way as an ascii representation with is why it is called a 'hash.' This method described by Thomas and Hutch is not a 'proper' hash and I wonder why it is called such?
Paul
The logic of a hash table is the use of a "hash"algo to correlate a numeric value derived from a word or text generally to a location in an array of predetermined member lengths.the method that TenKey described handles collisions by appending to or aranging a sequential list at every bucket that is used where the alternative is looking for the next empty bucket.
The method Thomas uses is a length based tree with a lookup table for each length. Works fine on small word counts but feed it 100000 words and you will understand why you need a more sophisticated data structure.
Speed is the method of comparison as a well written collision handling technique is fast so you balance the hash complexity against its speed and you trade between collisions and complexity to pick the fastest method. The advantage of a hash over a lookup is its calculation time and direct jump to the location is shorter on any reasonable amount of data than methods of sequentially scanning for words.
Does it work like this: you compute the hash of a string, then look that hash up using the hash value as an index into an array, then the other field of that row contains a pointer or something?
Who are you asking, me or Hutch. If it is me then that kind of explains it. Here is an example:
# Read in all the variables set by the form
read (STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
@cgiPairs = split(/&/,$buffer);
foreach $cgiPair (@cgiPairs)
{
# Create the hash
($name,$value) = split(/=/,$cgiPair);
$name =~ s/\+/ /g;
$name =~ s/%(..)/pack("c",hex($1))/ge;
$value =~ s/\+/ /g;
$value =~ s/%(..)/pack("c",hex($1))/ge;
# Plug SSI security hole
$value =~ s/<!--(.|\n)*-->//g;
$form{$name} .= "\0" if (defined($form{$name}));
$form{$name} .= "$value";
}
This is actually a cgi example that takes input from html forms. This is not really proper discussion for this forum so I think we should go back to talking about the other version of a hash which does.
Paul
Aero,
That is basically what a has table does. The action is in producing a wide enough variation within the number range of the array or "buckets" so that you get a natural distribution that does not get too many collisions. What I did with the example that I posted was to use an array of prime numbers in conjunction with the characters of each word to try for a fast method of getting the final number within the array for each word.
It has a reasonable distribution but the next step was to try to make the collision handling as fast as possible as you always get collisions. From testing and in conjunction with most of the hash table design I researched, fast collision handling allows you to reduce the complexity of the hashing algo to get its speed up.
You can think of a hash table as a one step tree where you have a root node but very many branches.
Hutch,
You are missing his point, which is "are you computing the hash of the string," I think not, it is a search in ascii which is something very different.
Here is an example of a hashed string:
"Hi there" is the same as "9E472F09082BB797429D07545C323836B7D175C0" minus the quotes, of course.
If I am wrong, then I apologize.
Paul
Paul,
Its probably among the range of algo variation although I am not familiar with the technique you have used, it does look like an interesting one. The crypto guys use the term hash in a similar way but they use far more complicated hashes because of the application they are working with.
It is Perl's method and yes, the Crypto guys use the same method. Perl needs the method for CGI as all transmissions to HTML are in the clear, it needs to somehow be encoded. This was the result and has been around for a long time.
Paul
Hutch,
I think I have wasted enough of this topic, I just thought it was interesting and I knew the method you are using is really a souped up look up table and not really a hash though everyone calls it a hash (pretty cheeky). I will stop posting in this topic, now.
Paul
Just for completeness, this is not a hash table:
(http://www.virtualcities.com/ons/me/o/meo8503c.jpg) (but google lists lots of real hash tables.) :bg
Mark,
Is that before or after you have digested it ?
Before... else it would be a "gnash" table. :toothy
I don't know why that photo came up when searching google for "hash table"... I can't think of anything that is "hashed" for breakfast. :)
Sorry, could not ressist,
We eat hash potatoes with our breakfast, or is that just something we do in New England? I also like Corned Beef Hash with Poached Eggs on top.
Paul
Quote from: pbrennick on May 04, 2005, 12:36:54 PM
We eat hash potatoes with our breakfast, or is that just something we do in New England?
Yeah it is - you're freaks :bg
Tedd,
Yep, we certainly are a strange bunch, even the rest of the US thinks we are wierd! I don't care, they are just jealous! :lol
Paul
We do that in "Old" england too.
Aero,
Believe it or not, we still miss the Monarchy and I guess we still pattern our lives accordingly.
Paul
Really? I hate the monarchy and both the two biggest parties.
Aero,
Nobody really appreciates what they have until it is gone. Be happy. You live in a wonderful country with an incredibly rich history.
Paul
Hi!
I was wondering when will we turn back to the original topic: hashing.
Does anyone have an implemented algo, that works according to the theory?
I created a raw code, not optimized and surely not bug free, and I could stick it here if anyone interessted.
I'm a bit shy and pessimistic, because till now it didn't have any significant advantages since no one responded... :'(
But I still believe in asm and in you people. :U Please tell me it is worth to share everyone's own efforts!
Greets, gábor
Gabor,
The one I posted is reasonably straight forward, its just that it does its own scan of the equaes file and does the replacements in the source file as well.
Hi Hutch and folks!
Thanks for your code. I would say I mainly see the point of the algo. You applied bucket hashing for strings, and I'd say the hash key provider proc is a special one.
I have some questions:
- Did I understand it right: you sum the char codes multiplied by primes to generate a key?
- The string length is not limited? What if overflow occures?
- Can it happen that more strings get the same key?
I have suggestions about this questions, but to be sure I would like to have the "official" answers.
BTW, I thought of a hash lib with all the neccessary stuff of a skeleton: variables, structures and procedures. I post mine right away, after I added some comments. Later I'll try to utilize it with your key gen function and your application if you don't mind...
Greets, gábor
Its basic design is pretty straight forward, it is different with the array of primes as I was looking for a technique that would give a wider range of numbers that was fast enough. You determine the maximum string length by what you allocate in the main array making sure you have enough room for both source and replacement words and in production code you would place a limit on the string length based on that array member size.
It has a collision handling method that skip up to the next bucket based on the word length so it does avoid clustering this way to a reasonable extent.
Feel free to design whatever you think will work well. The inportant things with this style of hash table is to make the array count a prime number so you cannot get an infinite loop after it wraps around. You should probably set a count limit for the number of items you can put in it and this would be best determined empirically to see when its starting to fill up where it starts to slow down.
Thanks a lot Hutch!
I've already encounterd with the infinitive loop anomaly, and did not take the time to dig deeper searching for reasons and solutions. Honestly I just canceled creating a complex key generator, instead I used a random generator and a linear method. In a few hours I'll post my code, you will easily understand what I wanted to say.
I did not mentined, that I need something like open addressing, that version of hashing with bucket size of exactly one record. So when the array is full, then that's all, no more storing. I need all this stuff to finally complete my LZW encoder/decoder. (But it will come handy later on.)
Greets, gábor
Hi!
Here is what I promised. Please check it out and add your comments. And also feel free to ask, suggest and modify!
Greets, gábor
[attachment deleted by admin]