News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Fastest data structure (generally)

Started by Seb, March 02, 2008, 01:23:08 PM

Previous topic - Next topic

u

Quote from: Jimg on March 03, 2008, 04:27:24 PM
All this really depends upon how many items there are, and how many times Seb needs to look up a value.  A one time sort of the addresses of a million items would take under a second, and then a binary search each time would probably be very quick and much simpler and quicker to initially set up than a linked list.
Yeah, I forgot to add to the list of things one has to know before starting to design,  :red... the most important thing - the user-input data info XD. Namely: how much data, how often, what types, what operations, how many users in parallel.. take whatever info on what the design/app needs to handle.
Please use a smaller graphic in your signature.

hutch--

I would second the comment to use a hash table in this context as I have one I use regularly for data structures like this. The trick is to use a hash design that has a decent distribution across the has table range and very fast collission handling. The general test for a hash design is at what percentage of the hash table it starts to slow down and the good ones start to slow down when about 80% full which is not a problem as they are usually used up to about 50%.

Here is a PB assembler function that hashes a word against a reversed table of prime numbers.


' «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

FUNCTION szhash_word(ByVal wrd1 as ASCIIZ PTR,ByVal cnt as DWORD) as DWORD

    #REGISTER NONE

    ! jmp nxt

  ' ******************************
  ' reverse table of prime numbers
  ' ******************************
  phtbl:
    ! dd 2939,2927,2917,2909,2903,2897,2887,2879,2861,2857,2851,2843,2837,2833,2819,2803
    ! dd 2801,2797,2791,2789,2777,2767,2753,2749,2741,2731,2729,2719,2713,2711,2707,2699
    ! dd 2693,2689,2687,2683,2677,2671,2663,2659,2657,2647,2633,2621,2617,2609,2593,2591
    ! dd 2579,2557,2551,2549,2543,2539,2531,2521,2503,2477,2473,2467,2459,2447,2441,2437
    ! dd 2423,2417,2411,2399,2393,2389,2383,2381,2377,2371,2357,2351,2347,2341,2339,2333
    ! dd 2311,2309,2297,2293,2287,2281,2273,2269,2267,2251,2243,2239,2237,2221,2213,2207
    ! dd 2203,2179,2161,2153,2143,2141,2137,2131,2129,2113,2111,2099,2089,2087,2083,2081
    ! dd 2069,2063,2053,2039,2029,2027,2017,2011,2003,1999,1997,1993,1987,1979,1973,1951
    ! dd 1949,1933,1931,1913,1907,1901,1889,1879,1877,1873,1871,1867,1861,1847,1831,1823
    ! dd 1811,1801,1789,1787,1783,1777,1759,1753,1747,1741,1733,1723,1721,1709,1699,1697
    ! dd 1693,1669,1667,1663,1657,1637,1627,1621,1619,1613,1609,1607,1601,1597,1583,1579
    ! dd 1571,1567,1559,1553,1549,1543,1531,1523,1511,1499,1493,1489,1487,1483,1481,1471
    ! dd 1459,1453,1451,1447,1439,1433,1429,1427,1423,1409,1399,1381,1373,1367,1361,1327
    ! dd 1321,1319,1307,1303,1301,1297,1291,1289,1283,1279,1277,1259,1249,1237,1231,1229
    ! dd 1223,1217,1213,1201,1193,1187,1181,1171,1163,1153,1151,1129,1123,1117,1109,1103
    ! dd 1097,1093,1091,1087,1069,1063,1061,1051,1049,1039,1033,1031,1021,1019,1013,1009

  nxt:
    ! xor ecx, ecx
    ! xor edi, edi
    ! mov esi, wrd1

    ! jmp inlp

  stlp:
    ! add edi, 1
    ! add esi, 1
    ! mov ebx, phtbl[edi*4]         ; extract prime based on character position
    ! mul ebx                       ; mul character by location
    ! add ecx, eax                  ; add result to total
  inlp:
    ! movzx eax, BYTE PTR [esi]     ; zero extend character
    ! test eax, eax
    ! jnz stlp

  lpout:
    ! mov eax, ecx                  ; multiply the total by itself
    ! mul ecx

    ! mul cnt                       ; mul EAX by cnt
    ! mov FUNCTION, edx             ; return remainder

End FUNCTION

' «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Seb

Thanks for all the replies, guys, it's been interesting to read! After doing some thinking and comparison of algorithms on my own, I ended up identifying hash tables as the most efficient and suitable method for suiting my needs.

Am I right in thinking that hash tables are efficient and fast, and suit my needs perfectly with this issue? If so, I'm not entirely sure how to use hash tables with structure [pointers], or if it's even possible at all. I need to look into that if I do use hash tables.

Regards,
Seb

Rockoon

Hmm. Lets recap the situation.

Ok, you have a single collection of items where each item has a unique 'value' field. You will never delete items from this collection, but you will be adding and altering items dynamically, but will never alter the value field of an existing item.

You intend to perform many queries. These queries entail asking "Does an item with this value exist?" and "If true then where is it?"

If the above recap is so far true, then can you detail how often (overall) these queries will fail?

This is important in deciding what data structure is probably "best".

A high rate of "does it exist?" failures means that that initial query efficiency is significantly more important than anything else. In extreme cases the "where is it?" could be very very slow (such as reading them off disk, or rescanning an entire file!) without any noticable impact in performance.

Also, can you detail the limits of the 'value' field? Are they strings? 32-bit integers? If they are 32-bit integers then consider that 4 gigabits is "only" 512 megabytes, such that a single bit test instruction could infact perform your initial query of "does it exist?" with perfect recall and zero hashing. Its hard to imagine a faster method for that initial query, but of course it does nothing to help the second query out. Statistical details are important.

An additional consideration is that the initial "does it exist?" query doesnt have to be 100% accurate. False positives are the mainstay of several hashing algorithms (for instance, the Bloom (Filter) Hash) .. these algorithms focus on trading away significant space complexity for insignificant false positive rates (when the query returns "true" a slower method kicks in to prove that its true) .. if you only used 24-bits of a 32-bit key (a 2 megabyte bit table) but the collection itself only holds 1 million items, then false positives could be fairly rare while that bit table would actualy fit in a modern L3 memory cache, meaning very few cache misses.

Details...

How many items will this collection actualy hold?
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

Seb

Rockoon,

thanks for the answer.

Just a quick comment on the first piece of your answer: I will only be adding elements to the collection/array once, that is, when I'm parsing a file (I could also apply sorting or whatever here in case the algorithm I end up using needs it).

How often will the queries fail, generally speaking? I'd say very often.

The structure members are both QWORDs, DWORDs and strings at the moment (but as I mentioned before, they go below 50 bytes (the current structure size is 48 bytes)). The string member (an array of 24 chars.) will not be used to find items in the collection, at least not right now (the implementation might change in the future).

The algorithm/data structure type I end up using may not give off a lot of false positivies, though - none in fact, or perhaps I misunderstood your point there.

The collection will grow progressively (that is, as I add more and more stuff to the file which's then parsed into a collection of structures), so it may be 1000 today and 1 million in the future.

Regards,
Seb

Rockoon

Quote from: Seb on March 14, 2008, 07:07:20 PM
The collection will grow progressively (that is, as I add more and more stuff to the file which's then parsed into a collection of structures), so it may be 1000 today and 1 million in the future.

Is 1,000,000 an upper limit? I ask because currently you say the structure is 48 bytes per element so at a minimum you are looking at 48 megs of memory. You might run into design issues if both of these figures were to tripple (now over 400 megs.)

If the upper limit is indeed only 48 megs of data, then you could easily shoot for a 128 meg regular hash table with some smaller table leveraged to handle collisions.
When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.