News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

100 to 1 Compressor

Started by Sergiu FUNIERU, February 23, 2010, 11:15:06 PM

Previous topic - Next topic

Sergiu FUNIERU

Quote from: Slugsnack on February 25, 2010, 08:50:03 PMJust to confirm, when you say 'any file'. Do you mean you could continually compress a given file that had already been previously compressed with your algo ?
No, that's why I had mentioned on my first post "no matter what OTHER algorithm was used to compress it before."

It's like trying to zip a zip file. I was able to 7z a particular rar file (with negligible compression), but I wasn't able to 7z a 7z file, or, for that matter, to recompress a compressed file, by using the same algorithm.

MichaelW

The attachment uses the cryptographic service provider to generate a 100000-byte sequence of random numbers, and zlib to test the compressibility (internally). Per the Microsoft documentation for the CryptGenRandom function:
Quote
The data produced by this function is cryptographically random. It is far more random than the data generated by the typical random number generator such as the one shipped with your C compiler.

Two sequences are generated, first the one generated by CryptGenRandom, and then a modified version of that sequence with every eighth byte set to zero. The zlib compressibility results are determined internally, and each of the sequences is stored in a file so it can be tested externally. In my tests of zlib, winzip, 7-zip, and winrar, for goodrand.bin they all produced an archive that was larger than goodrand.bin.
eschew obfuscation

dedndave

QuoteIn my tests of zlib, winzip, 7-zip, and winrar, for goodrand.bin they all produced an archive that was larger than goodrand.bin.

probably as good a test for randomness as there is   :P
i would take that over the chi-squared value, any day
although - every 8th byte a zero - if the compression algo knew that was there, it might be able to take advantage of it

Sergiu FUNIERU

Quote from: Gunner on February 24, 2010, 11:03:19 PMIf you can do what you say you can do, I am sure I can get some backers and buy it from you for a couple of mil!  Would make that back in no time!
Okay, how do we do that?

I ran my first test on some files. My algorithm was able to compress them up to 120 times better than 7z (which was better than zip). Some files were compressed only 20 times better than 7z, but there is a lot of room for improvement. I simply lack the skills to write my program to that extent.

I contacted Microsoft and they told me, and I quote "We are not interested in new ideas at this time, but you can call other time if you wish".

For years, the idea that something heavier than air might fly made people suffocating with indignation. Please put the skepticism aside for a moment and tell me if you can help me present my idea to the right investors. Needless to say, you'll have your cut, when all this will be ending.

I'm also very interested in publishing a paper that explains the algorithm, at least the basic of it. I don't know how can I make the investor accepting this, if (s)he's going to patent my idea.

I need an investor with an open mind. Someone who will not reject my idea without listening first, just because all the books he read told him is not possible to do it. Is definitely possible and is very logical why.

I also need a programmer to implement the full algorithm. It might take me a while until I do it myself.

In my opinion, the idea of how to compress an already compressed file worth all the money. The implementation part is just a detail.

oex

Did you test on the random data in this post or indeed a photo to less than jpeg?

I can compress a row of 65535 0's to 3 bytes (21845 to 1) but that is rather irrelevent. It is very easy to create data sets to test on that are unreflective of real world scenarios but that compress well.... Another good example is counting numbers 1,2,3,4,5,6,7,8,9,10 etc, even numbers, odd numbers, numbers generated from formulae. The problem is that by the time you have put all this information in the header the data does not reflect real world scenarios or the header is too big.

As far as I'm aware people publish papers on proof of concept not just concept.

I dont wish to deride your idea, implementation is the only detail that matters, Bill Gates didnt say I have an idea called DOS, he wrote it. If he hadnt someone else would have made his billions. Youtube was bought by google after it had been running for 3 months. If you dont know how to write it the algorithm must not be complete in which case you have no proof of concept. ASM is basically math so if you have a completed algorithm it shouldnt take more than a few weeks or months to implement it whatever your current skill level in ASM.

Anyways, I wish you the best with your project Sergiu or in finding a backer.
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dedndave

you compressed it, but did you expand it and compare the result to the original ?

As for working with information prior to obtaining a patent, there is something called a "non-disclosure agreement".
It protects the idea or invention during discussion and development.
You would want any potential investor to sign such an agreement before divulging the details.
You would also want one signed by a programmer that works on it.

You can also get a certain amount of protection by describing the idea in detail on paper.
Get someone who is technically knowledgeable to read it.
When they are done, get them to sign and date the paper with the words "Read and Understood by...(signature/date)"

MichaelW

Sergiu,

I think you need to spend some time researching the science behind data compression. Over the last 60-70 years a great deal of effort has been put into developing, and applying, this science. For your claim to be possible there would have to be some very large errors in this science, errors that no one else has uncovered.

And one detail that has bothered me from the start, why 100:1? Why not a more believable 20:1, or 1000:1, or ??  How did you arrive at this value?
eschew obfuscation

hutch--

Sergiu,

There is nothing wrong with good ideas but they must be good ideas. Except is extremely rare cases like massively long identical data where you can run algorithms like RLE, nothing comes close and it is alwayss data dependent. Simply generate an encryption standard random string that passes the range of tests in a tool like ENT and you are missing the necessary redundancy to compress the data. Without some form of repeatable byte sequences it simply cannot be done. Even at a bit level where some forms of compression algorithms utilise byte frequency priorities by writing the byte data in less than 8 bytes you can only handle a very limited number of cases before the mapping overhead to put everything back into place is larger than the data it is compressing.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Sergiu FUNIERU

Quote from: MichaelW on March 02, 2010, 02:31:27 AM
I think you need to spend some time researching the science behind data compression. Over the last 60-70 years a great deal of effort has been put into developing, and applying, this science.
I agree that studying what others have done is always the first thing to do. And it was the first thing I did.


Quote from: MichaelW on March 02, 2010, 02:31:27 AM
For your claim to be possible there would have to be some very large errors in this science, errors that no one else has uncovered.
I think that many people took for granted the algorithms they learned in school and tried to optimize them further, without thinking there might be other better algorithms. I'm sure some people thought of other unconventional methods to compress the date. Who knows, maybe while we speak, someone else is writing a new algorithms, totally different than the ones we use today? Until that person reveals his algorithm, all we have are the classic ones.

I studied compression algorithms myself, when I was in college. We had one year dedicated to this subject. But not even once I heard the teacher saying there might be better ways to compress. He presented the facts, like there were the best ways to do compression. I agree, there were the best ways known at that time.

It seems to me that you trust some opinions just because they were accepted by a large majority. However, history shows that it's possible even for large masses to be wrong. Before relativity theory, all believed the Newtonian physics, even the very educated people. Why? Maybe because they had no reasons to doubt it until someone else proved otherwise.

It might take me months, or even years to write my program the way I wanted. I agree with everyone here that a proof is more spectacular and I'm working on it.

The worst thing that happened to me in school was that I started to defend the classic theories. For a while, I ceased to think "What if?". Nobody proved that a zip compressed file can't be compressed anymore. I'm curios if, when my program will be ready, someone will be actually willing to try it or everybody will say "It doesn't worth to try, because is not possible what the program claims to do." And I'm not talking here about the idea "Want more space on your hard drive? Delete Windows!". I'm talking about a different way to compress.

"It's a known fact!", "Other people, smarter than you, didn't get better results", "If were possible, somebody else would have came with the idea until now" - I wish I got $1 each time I heard these.

I don't want to sound like Yoda, but I strongly believe that a person must let go the idea "it's not possible" before actually writing a better compression algorithm.

Quote from: MichaelW on March 02, 2010, 02:31:27 AM
And one detail that has bothered me from the start, why 100:1? Why not a more believable 20:1, or 1000:1, or ??  How did you arrive at this value?
100:1 was a compromise between size and speed. I put 100 as a nice number. If you bought a compression program that promises you a 100:1 compression, and delivers, for some files, even 115:1 compression, you'll not be upset, isn't it?

oex

Quote from: Sergiu Funieru on March 02, 2010, 03:13:32 PM
someone will be actually willing to try it or everybody will say "It doesn't worth to try, because is not possible what the program claims to do."

Sergiu,

You keep putting down claim after claim like the boy who called wolf with absolutely no evidence. People will fast become disinterested in your opinion if you claim to be able to do something or to know something but never deliver. Clever people with something to offer are generally far more reserved over their work until they have hard proof of concept or backing for their idea. At this stage your idea consists of a completely unsubstanciated brag concerning a problem that some of the best minds in the world would be unable to crack.

My advice would be not to brag but to actually *do the work*, then find a backer then you wont need to brag because you will be able to brush shoulders with people far nearer your intellect level. Sorry for my outburst, I have a short temper when people go on and on about something hypothetical with no substance :lol but we are rather straying from the concept of this forum in this article down to the level of a childish blog.

Proof of concept is all, people dont pay for concepts since the .com bubble and professionals on this forum get paid little enough for *real* work :lol

:'( And it's only Tuesday
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

Sergiu FUNIERU

Quote from: oex on March 02, 2010, 04:50:23 PMMy advice would be not to brag
If you go back to the original question, I simply wanted to know if people would be interested in such a compressor. I asked that question then because the answer I received from the creator of a certain commercial compressor.

I'm sorry if you believe that I brag. I only tried to explain why I'm so confident about my algorithm.

Maybe I need a childish point of view to actually believe that I can do what others refuse to believe is possible.

To stop offending more people with my claims, I'll answer again to this thread when my program is done.

MichaelW

Quote100:1 was a compromise between size and speed. I put 100 as a nice number. If you bought a compression program that promises you a 100:1 compression, and delivers, for some files, even 115:1 compression, you'll not be upset, isn't it?

So there was no science involved in your selection? You simply selected a ratio that you thought would interest or impress people? And you're surprised that that no one seems to be interested in your idea?
eschew obfuscation

clive

Quote from: Sergiu Funieru on March 01, 2010, 04:23:56 PM
For years, the idea that something heavier than air might fly made people suffocating with indignation

Last I checked birds, arrows and cannon balls where all heavier than air, there were even dinosaurs which could fly and/or glide. So I think this is a poor example of a situation where evidence of things being possible is not available.

What you are proposing looks more like Alchemy, or frankly snake-oil, to those of us who have actually looked at the science and experimented with such things.

It is unsurprising that people selling existing technologies are uninterested, for they view your claims with significant skepticism. I'm sure they've heard from dozens of people claiming similar magic, who have been equally unable to deliver. If you could even demonstrate 2:1 compression and decompression on some of the example data provided on this thread, it would be outstanding.

No doubt there is some data which can be expressed mathematically as an equation or function, but to be able to derive them from a random stream of data would appear to be a herculean task.

Heck generating an MD5 sum of a 4K block would get you an apparent 256:1 compression, although it would take you a while to find all the possible collisions, and you'd want to cache all your examples.

-Clive
It could be a random act of randomness. Those happen a lot as well.

Neo

Quote from: Sergiu Funieru on February 24, 2010, 12:00:06 AM
Quote from: dedndave on February 23, 2010, 11:55:37 PM
100 to 1 isn't very likely - not without data loss
of course, it depends on what the file has in it
but i seriously doubt any routine can maintain that on average
Not on average - EVERY time. Without any data loss. The algorithm doesn't contradict the Shannon's entropy.
I can't believe I'm even responding to this, as this is like responding to someone who's claimed to have created an algorithm that can always correctly decide whether a program halts in finite time, or someone who claims to have a perpetual motion machine, but I sincerely hope you're just trolling.  To humour you, in case you're not trolling, please read the following simple explanation of why this is impossible.

Suppose you want to encode bit sequences of at most n bits (for simplicity, but can be chosen as arbitrarily large), and suppose that the indication of the length doesn't take up any extra space (also for simplicity, but can be done in logn extra space if you really want).  There are 2^(n+1) - 1 unique bit sequences of length n or less.  In order to decompress all of those sequences correctly, all of those sequences must have a unique compressed sequence, meaning that there must be exactly 2^(n+1) - 1 unique compressed sequences.  Encoding that many unique sequences requires at least n bits.  Therefore, no algorithm, no matter how complicated, even given infinite time and space, can compress uniform random data on average and get back the original sequence correctly every time.

If you do compress uniform random data on average, you will not always be able to correctly decompress the data, and compressing it by more than 1 bit on average means that on average you will get back wrong data.

That doesn't mean your algorithm isn't useful, it just isn't what you claim it is.

Arnold Archibald

#59
I'm on Sergiu's side on this but above all he needs to get off his ass and write even a basic comp/decomp EXE.

All the current algos are too busy looking for patterns in some wild attempt to simulate neural networks.
Most usable data is restricted to patterns indeed, but most could be handled by optimizing known files and their structure.
To speed up image file rendering alot of the header data is wasted space that could be better stored, so we have a issue of balance between speed and size.
BMP palette size, for example, only needs one byte (or even 5bits) but it is stored on a word boundary taking up two bytes.
Even using something as simple as a subtraction filter would be better optimized, in the case of a 24 or 32bit photo BMP, by rearranging the colour channels first.
So a level of preparation can be applied to a file before compression actually takes place.
But the more repeatedly the data is passed over the longer it takes, whilst hopefully achieving better compression.

As for the tests on an algo using random data, they are useless, given the current contextual usage of random meaning data that for no other reason contains non usable noise that is without usage context.
We are missing a perspective on this in that depending upon how any given program decodes and uses its native files these files will when interpretted by another program appear either the same (or similar) or effectively as noise.
It is relativity on so many levels that changes the way we see the world/data.
Is it a knife or a sword? Isn't a knife a small sword or a sword a large knife. Aren't both simply blades with handles. (Mmmm, classes.) Why store both when you can store a template and the differences.
Have you had trouble recalling a whole bunch of similar memories but ease at recalling different memories?

What of a header growing larger?
If the algo is fit enough to perform it should be able to compress that aswell as all headers must have some structured quality.
Then write out a new smaller header containing instruction on decomp of the larger one.

Clear your head of rules, and start to play around with the relationships between things.

Is this a dead-end field? By no means.
I too am interested in reinventing the wheel, compression wise (and for many more fields for that matter), so naturally I was attracted to this thread.
For those that truly understand random data we are aware of it as THE universal set.
Inside this set are useless and useful data and all the grey levels inbetween.

A true random generator CAN create a file that contains RLE encodable data (the term "clustering" is used for this data type).
It can also create data that is tricky to shrink, so we understand that minus a header any file being compressed has about a 50/50 chance of actually compressing.
Whether it can compresses to a worthy level is down to the algo, so inevitably we must accept this chance of not compressing at least 50% of any of a set of files.

And so we come to just that. Clustering. Because I am also aware of another programmer in my city that is working on high powered compression.
He has investors and previous patents, and apparently has caught the interests of a telecommunications company.
So this field is by no means dead.

This might help.

1. Make your plan for the project.
    Keep accurate records, diaries, plans, diagrams and dated backups of all iterations of your work.
    Keep it simple for now so you can actually finish the prototype.

2. Test it out yourself on as many files as you can.
    The internet is full of test files for just this purpose. (Another sign this field isn't dead)

3. Write a kick arse EULA. 'nuff said.
    (There are some weird and wacky ones out there! I believe it was tuneXP that had the most succinct one)

4. This was mentioned earlier and it is very important: Non Disclosure Statements.
    These are essentially EULAs with a contractual twist.
    If you are paranoid about leaking of the prototype install countermeasures in the EXE so you can track who has distributed it or simply give the EXE an expiry date, get creative.
    The NDS should include a statement denying the testers from sharing copies and any EXE must come directly from you.

5. Get your friends and peers to test the prototype out.
    Get as detailed a feedback as possible. Perhaps build an activity logger inside the EXE.

6. If they are impressed get a whole bunch of them to become investors.
    Best that none of them have connections to organised crime.
    Write up a sweet contract that all parties agree to, especially you.

7. Patent that sucker.
    How else are you gonna make money from such a great piece of intellectual property.
    This is where the records come in handy, make sure to never let the originals out of your sight.

8. Publicity.
    The Hutter Prize is a great start, and because you kept the software simple it wont require much alteration to fit their rules.
    Plus you only need to provide the decomp EXE and the compressed file to win the cash.

9. Send out some very polite letters to whom you may consider interested parties.
   If you still have some money left over from the investors run a simple viral campaign extolling your wonderous invention.

10. Wait.