News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Hash table

Started by lamer, April 20, 2005, 08:51:25 PM

Previous topic - Next topic

lamer

Is there any way to build a hash table in assembler for key/value pairs, where key is dword and value - string?
Or it is better to work with API such as AddAtom, GetAtomName etc?

James Ladd

If the Atom api's for the trick then use them.
If you look through this site then you will find an excellent HashTable API by Hutch.
It should not be too hard to change the key to hash of a DWORD.
Rgs

gabor

Hi!

Exactly where can this hash api be found?

Greets, gábor

thomasantony

Hi,
  Use two tables. One with keys and the other with the values such that the key for value 1 is at index 1 and so on. You can get the correspoding value for a key using the key's index.

Thomas :U
There are 10 types of people in the world. Those who understand binary and those who don't.


Programmer's Directory. Submit for free

hutch--

This is the last piece I remember doing that did word replacements from an equates file on a source file. It does its hash from an array of primes. It will stop with an error if the equates file had duplicates or other problems and it seems to run reasonably fast.

[attachment deleted by admin]
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

AeroASM

What is a hash table and what are they used for?

hutch--

Aero,

A classic but not exclusive use is with source files where you replace all of the equates with their actual values. It is done from a normal style of include file that is loaded into the hash table then when the source file is scanned at a word by word level, it replaces any of the words found in the list with the replacement value.

Some databases use a much larger and more powerful hash system for indexing keys to data items in the database.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

AeroASM

So it is just a lookup table? Why then is it called a "hash" table?

Quote from: hutch-- on May 02, 2005, 03:37:34 PM
It does its hash from an array of primes.

What does this mean?

mnemonic

Here is a link to the corresponding wikipedia article: http://en.wikipedia.org/wiki/Hashtable
That should answer all of your questions regarding hashtables.
Be kind. Everyone you meet is fighting a hard battle.--Plato
-------
How To Ask Questions The Smart Way

hutch--

Aero,

There are various ways of matching a word to a replacement value and a hash table is only one of them. With very simple stuff you can scan an array for a value, you can create a slightly better version which is a linked list but if you are serious about finding words FAST, you start looking at data structures.

The main players are either hash table layouts or binary tree type structures, the hash table generally being faster where the tree is more flexible. Many of the older asemblers/compilers used hash tables but you generally allocate them first then fill them up so you got the problem of running out of hash space. Better and later stuff tends to use tree structures that can be reallocated on the fly if more memory is needed.

A hash table works by allocating fixed size slots in memory to a known number of slots (generally called buckets) and then for each word added to the hash table, it uses the characters of the word to try and generate a unique number that fits within the range of the table. The basic theory is simple but because characters are base 256 instead of base 10 with numbers, you can never get a perfect hash by this method so at tmes you get two different words trying to use the same location. The term is "collision".

A hash table design MUST handle collisions in a reliable way and there are different techniques to do this. The simplest try the next slot to see if its empty but this leads to a problem called "clustering". A better technique is to use a variable to try another slot and a very convenient one is the length of the word. The hash table must have a wrap around method so you don't run off the top of it and to make sure you don't endless loop, the hash tabel count should always be a prime number.

A hash table if its well designed only slows up as it is getting close to being filled and this means a good hashing algo to get the widest rangle of numbers from unique words and a fast collision handling technique. A tree design never fills up if its writen to be reallocated and while they do get slower as they get bigger, its a linear time increase, not a sudden rise like a hash table before it fils up.

Algo design is the PHUN stuff in assembler and while it can be hard work, it is almost restriction free and open to different architectures that take advantage of speed and size.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

tenkey

Using a hash table with a fixed size array is known as open addressing - see the wikipedia link by mnemonic.

I generally prefer a hash table using linked lists (chaining). You can use whatever allocation functions available to expand the table when it fills the first set of table entries. The number of hash values (buckets) is fixed, but the number of entries is open-ended. The search time is dependent on the average length of lists in each bucket. If you have lots of input data that tend to cluster, you can potentially get better average performance by moving found entries to the head of the list. The hash table with linked list does not have the sudden degradation of open addressing, as you are only searching through entries that have the same hash value.
A programming language is low level when its programs require attention to the irrelevant.
Alan Perlis, Epigram #8

hutch--

Yes,

It is a good distinction of type but chaining at a given bucket location to a linked list or similar is a substantial performance hit and I normaly associate this type of design with languages like java that are so slow it does not matter. An open addressing hash is substantially faster that a linked list hash and from memory a tree structure outperforms the linked list type as well while retaining its basic flexibility advantage.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark Jones

"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

James Ladd

A probablistic skip list could also be used to maintain the links in the bucket.
But hey, it really depends on what you want to use the table for.

gabor

Hi!

I've read through the post and seems like Hutch knows a lot about the theory. I think he explained briefly all the important points.
I would like to add some stuff, what I learned and experienced so far

The power of hashing shows itself when an average performance optimum is needed. When the task is to store and retrieve (look up) many records (dword, string, structures), just like in the mentioned situation of symbol tables in compilers.

I too bumped into hashing with my LZW encoder, which also stores symbols. (Actually I found that my linear storing and retrieving is a very brute and primitive solution causing my algo run incredibly slow).
Linear storing

  • best case and worst case are the same: I put the next record at the end of the row. It is done in 1 step. This part is fast...
Linear retriving

  • best case: the first record is looked for, in 1 step it is retrieved
  • worst case: the last record is needed, so all the records must be read through. This is n step, very slow
  • when the record is not stored this can be found out only when the whole list is read through. Again n step, very bad performance

Hashing
Storing and retriving base upon the same idea, a hash function which produce keys for the records.

  • best case: the keys generated by the hash function are always unique, for all records a different key is given. The cost is the steps of creating the key plus 1 step storing the record.
  • worst case: all keys are the same (the probability of this is 0 in practice) the insert takes n steps and the cost of generating the key.

When more records get the same key a collision occures (like Hutch explained) which can be handled in different ways. The primitivest way is the linear way again, this method tries to store the data in the next space next to that of the key, when this neighbor is occupied then the next comes and so on... This can end in a long cluster.
A better solution uses pseudo-random indexes. I learned that the best performance is achieved by using a second hash function, this is double hashing.

I'm about to create my own hash library, I'll post it when finished.



I have a question as well: what do you recommend, how could strings with variable length be used to create a 32bit key? (This could be usefull not only for hashing)


Greets, gábor