News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Hash functions to optimize

Started by gabor, May 20, 2005, 07:59:41 AM

Previous topic - Next topic

gabor

Hello folks!

I was very pleased about the lot of posts and suggestions given to my other optimization topic http://www.masmforum.com/simple/index.php?topic=1467.0.
Now here I have another. It has a connection with the LZW encoder since it is recommended to collect the LZW dictionary entries in a hash (open addressing) structure.


I used a structure:


HASH1_STRUCT    STRUCT

                START       dd ?        ; start of hash array
                rSIZE       dd ?        ; size of one record
                COUNT       dd ?        ; max entry number
                FREE        dd ?        ; number of free slots 
                GETKEY      dd ?        ; pointer of hashing function
                PROBE       dd ?        ; pointer of probe function
                MAXPROBE    dd ?        ; maximum number of probes
               
HASH1_STRUCT    ENDS
[/size]

There are two main functions:


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;
; Stores a data in HASH
;
; IN:     hd - pointer to a HASH structure
;        eax - key of record
;     lpData - PTR to record to store
;
; OUT: ebx - key/index
;
; -------------------------------------------------------------------------
HASH1_store     PROC USES eax ecx edx esi edi hd:DWORD,lpData:DWORD
align 4
                mov     edx,hd
                ASSUME edx:PTR HASH1_STRUCT
                cmp     [edx].FREE,0
                jz  @_4
                xor     ecx,ecx                         ; init number of probes with 0
                mov     esi,[edx].START

                call    DWORD PTR [edx].GETKEY          ; hash function: ebx - key/index
@_1:
;PrintHex ebx,'hashkey'
                push    eax
                push    edx
                mov     eax,[edx].rSIZE                 ; size of record
                add     eax,1                           ; size of key
                shl     eax,2                           ; size in DWORDs
                mul     ebx
                pop     edx
                mov     edi,eax
                pop     eax

                cmp     DWORD PTR [esi+edi],0           ; is the slot empty?
                jz @_2

                add     ecx,1                           ; increase number of probes
                call    DWORD PTR [edx].PROBE           ; hash function2 >> ebx: key/index
                jmp @_1
@_2:
                add     esi,edi
                mov     [esi],eax                       ; store the key
;PrintHex eax,'key'
                mov     ebx,lpData
                push    ecx
                mov     ecx,[edx].rSIZE
                mov     eax,lpData
align 4
@_21:
                mov     eax,[ebx+ecx*4-4]               ; store the...
                mov     [esi+ecx*4],eax                 ; ...record
;PrintHex ecx,'cycle'
;PrintHex eax,'data'
                sub     ecx,1
                jnz @_21
                pop     ecx
                cmp     ecx,[edx].MAXPROBE              ; current number of probes greater?
                jc  @_3
                mov     [edx].MAXPROBE,ecx              ; store new number of probes
@_3:
                sub     [edx].FREE,1
                ASSUME edx:NOTHING
@_4:
                ret

HASH1_store     ENDP


and


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
;
; Searches for data in HASH
;
; IN:       hd - pointer to a HASH structure
;          eax - key of record
;
; OUT:  esi - PTR to found slot
;           - NULL when not found
;
; -------------------------------------------------------------------------
HASH1_search    PROC USES ecx edi hd:DWORD
align 4
                mov     edx,hd
                ASSUME edx:PTR HASH1_STRUCT
                mov     ecx,[edx].MAXPROBE
                call    DWORD PTR [edx].GETKEY          ; hash function: ebx - key/index
align 4
@_2:
                mov     esi,[edx].START
                push    eax
                push    edx
                mov     eax,[edx].rSIZE
                add     eax,1
                shl     eax,2
                mul     ebx
                pop     edx
                add     esi,eax
                pop     eax
;PrintHex esi
                mov     edi,[esi]                       ; read key
                or      edi,edi                         ; is empty?
                jz @_3
                cmp     edi,eax                         ; key found?
                jz @_4
                call    DWORD PTR [edx].PROBE           ; hash function: ebx - key/index
;PrintHex ecx,'probe'
                sub     ecx,1                           ; number of probes
                jnc  @_2
                ASSUME edx:NOTHING
; ----------------
; Empty...
@_3:
                xor     esi,esi
                ret
; ----------------
; Found...
@_4:
                add     esi,4
                ret

HASH1_search    ENDP


Furthermore the two functions hash and probeIt are important, I would really like to have your suggestions on them.
I created two default functions they are in the code as well, please see them in the zip file. There is an init and a destroy function as well.

I thought as first it would be nice to optimize those store and search procedures.

I am looking forward for your posts.

ps.: in the zip there is a test.asm to test the hash with timing. The timing macros are those of MichaelW and probably you will need to change the include path...

Greets, gábor



[attachment deleted by admin]

Mark_Larson


Hey Gabor.  I would first look at different hash algorithm implementations before trying to optimize any.  As a general rule you want to pick hashes and hash table sizes that either greatly reduce collisions or completely get rid of them.  There is a webpage I use a lot that has a comparison of a large number of hash functions.  My favorite is the Zobrist hash.  You might want to take a look see and see, which ones fit your application best.  As a general rule I always use power of 2 hash algorithms because they let you do an "and" to do the Modululs, where Prime based algorithms require a "div" to get the Modulus.

http://burtleburtle.net/bob/hash/doobs.html


he also has a whole bunch of different hash table webpages and source code here.

http://burtleburtle.net/bob/hash/

 
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

gabor

Thank you very much Mark!

These are handy resources indeed. In my posted code I used a random keygen function for the hash, it is quite good. But I must admit, for a specific purpose (like LZW encoding) a specific hash function is more effective. That's why I defined the "hash library" so that it can take user defined hash and probe functions. Again in the post there is a probe function to resolve collisions, now that's a stupid one: simple linear method. In my testings this makes the whole hash store/retrieve very slow.
I wanted another random function, but allways ended up in an endless loop. Any ideas how to avoid this? I suspect this has to do something with moduluses not being relative primes (is this the right expression for numbers which biggest common divider is 1, like 3,4 or 12,35... )

Well I'll read the stuff and create better hash and probe functions. When done, I'll post them.

Greets, gábor


QvasiModo

Hi :)

The hash function should produce evenly distributed outputs, and the probe should let you access anywhere in the hash table. Using the modulus and a prime for the hash table size is probably the simplest method to archieve this (but not the only one of course).

In any case separate chaining is much better... (in case you don't know what this is, it means not to use probes at all, instead make each element in the array a pointer to the head node of a linked list). Surprisingly that simple approach works fast enough for most applications, and imposes no size limits to your table.