Need help with Hash

mineiro · November 21, 2011, 03:36:53 PM

Hello Sr's, sorry if this is not the place to post.
I'm need a help about hash.
Anybody can suggest something like a "one byte hash"? Or if possible, can be a bit more longer, but I'm counting bits, so If possible, less than 16 bits.
About hash, I have some questions:
What is the minimum data size to generate some hash? I'm dealing with ascii table, so what can be the minimum hash that can be generated, without ambiguity?
I have found in net some hash of 1 byte, but their results can generate ambiguity.
Any comments, ideas are so much welcome.
PS: I'm out of school for a long time, so I cannot understand formulas, if you can express using logic or assembly, is better to me understand.
I have talked with a friend, and he have suggested to get the last digit of result of MD5. Can be? yes, no, why? Sorry about my english.
Thank you so much, have a nice day.

clive · November 21, 2011, 05:36:31 PM

I don't think any of these methods guarantees to generate a "unique" code, just that the code will be consistent for the same data. Good hashes will change drastically with marginal differences. It would be quite difficult to find an MD5 collision, even harder for text on a collision making any sense.

To stand any chance of being unique, the hash bit length would need to exceed the bit length of the data being hashed.

You could use a check byte, using something like CRC-7 or CRC-8, or as suggested a truncated portion of MD5, or SHA

Tedd · November 21, 2011, 05:45:02 PM

You simply can't guarantee a unique value for each set of data when you only have 256 possibilities (unless there can only ever be 256 or less sets of unique data.)
If it's only for simple error checking, I'd go for a zero-checksum or CRC8. If you want a good check for uniqueness, you're going to need to sacrifice more bits.

If you can tell us how you'll use it, we might have better ideas :wink

mineiro · November 21, 2011, 06:07:02 PM

QuoteTo stand any chance of being unique, the hash bit length would need to exceed the bit length of the data being hashed.

QuoteIf you can tell us how you'll use it, we might have better ideas

So, in this case, is best use criptographos? I'm just supposing, because cripto algo generates more data than I'm dealing, and the consequence are unique values. Is this?
I'm working with sequence of 8 bytes per moment(not absolute true), so, in your opinion, and avoid cripto things, crc8 generate less collision to these scheme? And if I mix some different schemes, something like 2 consecutive calculation, like crc 7 and 8, and in the final pass, add both, or xor, do you think that with this I can generate less collision?
So, about cripto, have some algo that can generate in these 8 bytes a result of 10 bytes?

Oh, I'm working in some compression program, this is why I need avoid collision(ambiguity). And, to I have gain I need some algo that can deal with less than 4 bits to 1 byte. My first think was add all values, because with this I can (ignoring the first byte) represent each byte with one bit while doing add.
If I use MD5, or SHA, I know that can be done, but in this case, time will be a problem. If the algo need a big sequence of bytes, so calculation grow, and time too.

Apreciated Sr Clive and Sr Tedd, thank you.

clive · November 21, 2011, 06:36:14 PM

I've built several compression engines that use hash tables to improve the speed at finding potential matches, they all use a chaining algorithm because collisions will occur, and because you don't know the length of the best match before you start looking.

The bit size of your hash will then control the size of the table. Ideally this would be your minimal match length, ie 16-bits for two characters, 14-bits if you got clever with ASCII. Now if you use the first two bytes as the hash, you won't need to compare them again as you search for better matches, and if you have a hit on the hash table, you're guaranteed to have at least a 2 byte match.

You cannot for example use a 16-byte MD5 signature to represent a unique block of 1600 bytes, and achieve 100:1 compression.

If you want to compress to the least number of bits, a list of some possible values, you might want to look a Shannon-Fano, and Huffman.

Edit: spelling

mineiro · November 21, 2011, 08:04:41 PM

I have read about Claude Shannon (entrophy law), and many strange trees too, and I think, more than 10 different compression schemes, from rle, passing to LZ and their implementations, predicted ... and after arithmetic. So, after this, I have tried to implement some ternary logic (I think is kleene the creator), but no luck.
I know that is impossible to compress more than the entropic law permit, but, looking as a novice, I know too that all files can be compressed (is difficult imagine that have one file that cannot be compressed), just because if you sit in front of that file, you can reduce it with some logic, but this way uses the "all the time of the world".
If I use, let's suppose, MD5 with some strange compression logic while compressing, so I can use the brute force approach to decompress, but this is non sense, and takes all the time question.
Again, thank you for your reply Sr, be in peace.

jj2007 · November 21, 2011, 10:11:53 PM

Just to give you an idea.

Tedd · November 21, 2011, 10:15:27 PM

It's a nice thought that it should be possible to compress any data, even if only by 1 bit - simply keep applying the compression until you compress the entire data to 1 bit. It's even possible that's true. The problem is that it's not possible decompress it unambiguously again.
So, you're limited to compression schemes that leave enough data remaining for you to recover the original data unambiguously again.

You could possibly 'compress' fixed-size blocks by taking both the SHA and MD5 hash values, and storing just those. But again, the problem is re-obtaining the original data, and that would likely require a full search of both value-spaces, which would obviously take more time than is practical. Also, you'd probably find that the required block size to ensure success (a unique decoding) would be smaller than the 'compressed' data.

Anyway, 'better than entropy' compression can be obtained (not not always guaranteed) using arithmetic encoding, or even most other compression methods will (in most cases.)

A note: unless there is a bias in the values of the data you're trying to compress, many of these schemes won't help you very much, so you'll have to go with the more common general compression schemes, such as LZW. (Binary data has little bias because all byte values are almost equally likely, but language text has a good bias because some values are much more likely than others.) So it also depends what data you're trying to compress.

mineiro · November 22, 2011, 02:07:44 AM

Thank you for the link Sr jj2007, that is usefull to create some dictionary approach.
Sr Tedd, you have said exactly about the compression problems, progression and recursion(eg: we cannot compress again the same file).

QuoteSo it also depends what data you're trying to compress.

I get the point of yours explanations, so I think I need say about what have in my head.
After looking some kids playing with LEGO, the logic of compression follows.
The intention was compress with recursion an entropic file.

Some entropic bytes follows:
D0 5A FE 49 ...
1101 0000 0101 1010 1111 1110 0100 1001
Now the lego part, try to combine 3, if not possible, try 2, if not, 1, if not, put at right side, get next.

Code Select


11010000 
       0101
        1010
            1111
             1110 
                0100
                 1001

Final result(compressed):
110100001010111101001
*   *  **   **  **

So, the original have 32 bits, the compressed have 21 bits. Note that all nibbles of source appears in the final result (asterisk symbol below that sequence). And I can separate that compressed sequence in a group of nibbles and do compression again(in this example not, but in my tests yes).
Have numbers that do not glue one in each other, but when I shift some previous one, this changes all...

Now, If I try some MD5 of whole file, I think can be done and avoiding ambiguity.
Thank you for your time and patience with me.

Tedd · November 22, 2011, 12:32:25 PM

Interesting idea, but you still have the problem of unambiguously decoding the compressed data.

Given those 21 bits, there are numerous (even infinite, though you'd store the original filesize) possible correct decodings.
As I understand it, you're hoping to store the MD5 hash of the whole file, and then enumerate every possible decoding until you find one with the matching hash. But, even for files of only a few kB, this isn't going to complete in any decent amount of time. Compression will be quick, decompression will take.. weeks.. months.. Also, even if you found a decoding that matched, it's not assured this was the correct one - MD5 (or any other hashing method) can have collisions, and that chance increases with the amount of data you're hashing.
You could split the file into blocks of a 'good' size and hash each of those, allowing a decoding in at least a reasonable amount of time, but obviously that reduces the compression achieved considerably, and the blocks would most likely be too small to make it worthwhile.
It may be better to indicate the offsets for the start of each nibble (as you did in the example, with '*') and find a way to store those efficiently.

jj2007 · November 22, 2011, 01:30:48 PM

Quote from: mineiro on November 22, 2011, 02:07:44 AM
Thank you for the link Sr jj2007, that is usefull to create some dictionary approach.

I realise you know much more about the technical bits :bg

> The first 100 make up about one-half of all written material
That means you could encode 50% of a text with a single byte, not using the sign bit yet, and keeping the dictionary outside of the archive. I wonder how much compression one could achieve on assembler source code ;-)

Just made a test with WinZip vs FreeArc 0.666 on a bunch of small assembler files:
Compression

Code Select

57.9	WinZip (max portable)
82.6%	FreeArc (-mx setting)

Here is an overview of current compressors. FreeArc is really good, there are some that compress better but at the expense of horribly long encoding times. For example, a 300 MB corpus takes 83.9 seconds to compress and 33.5 seconds to decompress with FreeArc, 75.22% compression ratio. To get 80% compression (PAQ8), decompression time jumps to 22019 seconds, i.e. over six hours ::)

mineiro · November 22, 2011, 03:36:41 PM

I visited that link Sr jj2007, very usefull comparations. In the past, BBS times, I do not have compression programs, so I used .gif files(lz..)(lossless) to compress my files, but manually. PAQ(arithmetic) is what make me study more.
You get the idea Sr Tedd, and if I store their positions(*), so the compression need compress more than the half of the file.
What I have figured is, in the compressed sequence, if I find 4 same consecutives bits (zero or F), they are in they correct position in relation to others, they appear inside the source file,... . But I cannot think how this can be usefull. In some tests, If I ignore both nibbles, and store the positions of others, the compressed grows more than the original.
My other approach was create a dictionary with division numbers to store their positions, so in the table have the number 6 and 7, if I div, I get a big number 6/7=0,857142..., but my skills with mathematic are not good.
Thank you, have a nice day.

clive · November 22, 2011, 03:42:21 PM

Quote from: TeddIt's a nice thought that it should be possible to compress any data, even if only by 1 bit - simply keep applying the compression until you compress the entire data to 1 bit. It's even possible that's true. The problem is that it's not possible decompress it unambiguously again.

I remember seeing an article on this in The Journal of Irreproducible Results. http://www.jir.com/index.html

mineiro · November 22, 2011, 05:41:24 PM

I'm convinced now that hash cannot offer solution to this problem. Thank you for all answers, I'm abbandoning this idea.
From the clive links.

QuoteProgramming in visual basic, the author was able to spiff up Infinite Loop v23.6 with a progress bar at the bottom of the screen

:toothy

Farabi · November 24, 2011, 11:01:08 AM

Ow, I missunderstood it with hash table for the parser. So hash in here mean a MD5 hash table right?

News:

Need help with Hash