The zip contains a program with source I have worked on for a couple of days involving rle byte compression/decompression.
It is buggy and in the alpha stage but feel free to test and comment away.. it works best on files less than 2Mb as it
generates a two-byte key which would probably exist in a larger file and mess up the decompression algo.
Click the button 'Load File' to set the input filename and 'Save File' to set the output (restored) filename. A temporary file
is created on your harddisk called 'rlecoded.tmp'.
Click the 'encode' button, it will display actual and compressed file size - and write the temp file.
Click the 'decode' button, it will reverse the numbers - actual compressed and compressed label meaning actual restored size
- and write to the save file name.
That's pretty much it for now, I'm sick as a dog and it sucks. Some of the restored filenames come back large as in .exe's, not
sure why yet but for the most part they execute and seem to function normally.
http://drarem.ms11.net/rlepkr.zip - approx 9Kb download
drarem,
Im interested in this topioc and will look at your source.
Can you explain the algorithm you are using ?
Rgs, striker
yes it looks at the data byte by byte, for anything repeating.
It is put into a generic string such as this:
[key][occurances][character]
For the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.
[A][#occurances][byte]
At the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.
So if you had a data file:
XXOOOOOOXXXXABABCCC
Compressed would look like:
BBXXBB*XXXXABABCCC
* = funny symbol which is value of 6.
Getting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.
It doesn't save much atm - disk stored rounds up apparently to the closest 4096k so storage isn't really advantage unless you have a file full of repeating characters that could take advantage of it.
Very interesting.
Have you looked at other schemes or are you doing this to try and teach yourself ?
I look forward to seeing some more.
Yes and yes, I have a small data compression book written in the 90's, by Gilbert Held. He explains the theories, I had to come up with a workable algo.
I will attempt other methods such as huffman - he says it's stored as a tree.
Nice work :U.
Quote from: draremXXOOOOOOXXXXABABCCC
Compressed would look like:
BBXXBB*XXXXABABCCC
Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC
Quote from: draremFor the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.
You could use an 'escape from the escape' to handle this situation. E.g. let a repeat count of 0 mean insert only a single symbol, then, if e.g. AB is the digraph that occurs the fewest times, you can do:
XXXXXXABXXXXXX...
compresses to
ABAB*X
AB+ABAB*X...
where * is 6 and + is 0, and the bold part encodes the occurence of AB.
Quote from: draremAt the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.
This means you could subtract 4 from the repeat count -- this way you are able to encode repeats in the range 4..259 instead of 0..255. If you add the 'escape from escape' you would subtract 3 instead.
Quote from: draremGetting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.
Playing with compression can sometimes drive you mad .. but it's lot's of fun :U.
Huffman is definitely worth a look .. Lempel-Ziv compression is also fairly easy and fun.
Quote
XXOOOOOOXXXXABABCCC
Compressed would look like:
BBXXBB*XXXXABABCCC
Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC
sorry, I was in a rush (and still am) - now I'm working on php, odbc and a couple of access databases - this like so many other projects of mine is taking a back burner.
I'll come back to the compression eventually.
Could you explain the escape from an escape? Almost sounds like an indexed dictionary compression - maybe I'm reading too much into it. It is late :/
Thanks.
Hi!It was funny, that these days I've been thinking about an RLE algo too. :bg It is a bit different and comes from an old picture format used by Deluxe Paint, but as far as I know RLE was used in PCX as well.
Maybe someone doesn't know the meaning of RLE: it is Run Length Encoding, since it encodes the length of pattern repetition. (how long a pattern runs :bg)This algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.
Here is an example:
Index: | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 0A | 0B | 0C | 0D | 0E | 0F |
Data: | 12 | 12 | 12 | 38 | 6B | E0 | E0 | 00 | 0F | 00 | 00 | 00 | 00 | 00 | 00 | 00 |
[/b]
Compressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00 - at value 80 the MSB is set indicating that compressed data follow, the count is 00+3=3. So 12 is written 3 times.
- at value 06 the MSB is not set indicating that uncompressed data follow, the count is 6, So the next 6 bytes are copied just as they are.
- finally 84 indicates that 7 bytes of 0 were compressed, so 0 is written 7 times.
I am coding this algo, hope to be ready after the weekend...but have tons of other thing to do, thats the brute reality :'(
Please feel free to ask or comment, I like discussing asm coding topics!
BTW: I could write some word about Huffmann and LZW too, if anyone interested.
Greets to all and nice coding time for everyone!
QuoteThis algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.
Here is an example:
Index: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
Data: 12 12 12 38 6B E0 E0 00 0F 00 00 00 00 00 00 00
Isn't it possible for the MSB to be elsewhere in a larger binary file? How would the decoder differentiate between a valid MSB and a binary piece of data? Are you talking about RLE on ascii files?
What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.
Gabor,
Very cool! :U
Drarem,
Read his post again. :wink
QuoteCompressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00
80 12 specifies 3 -12's
06 specifies skip over (dont de-compress) the next six bytes: 38 6B E0 E0 00 0F
84 00 specifies 7 - 00's
I'd be interested to see compression results!
Quote
Isn't it possible for the MSB to be elsewhere in a larger binary file? How would the decoder differentiate between a valid MSB and a binary piece of data? Are you talking about RLE on ascii files?
I'm sorry I'm afraid I didn't really get the point. Since the MSB is the
most significant bit (highest bit of a byte) it is everywhere in the file. Inthis algo the MSB of the count is used as a flag. According to this flag the decoder can decide wheter tha data is compressed or not.
Say, the compressed data looks like this:
80 12 06 38 6B E0 E0 00 0F 84 00
Now, the decoder takes 80. MSB is set, so the next data must be written out more than once. Precisely 00+3 times. 00 comes from 80 after MSB is cleared. About the 3 I wrote before. So the result is 3 times the next data: 12 12 12
The decoder now reads 06. MSB is not set, so the next 6 data is uncompressed. So 38 6B E0 E0 00 0F is written out.
Finally it reads 84. This will end up in writing out 0 7 times: 00 00 00 00 00 00 00
This algo just like the others can be used on binary and ascii files as well. But the efficiency is very different, and ge4nerally not very nice... :'(Quote
What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.
Ok, RLE is not a dictionary based algo, but Huffman and LZW is indeed.
I dig out my university books and post a pseudo algo for them if interested.
For now this:
Huffmann: gathers info about the frequncy of each token (character/byte or word). This is done by reading the whole file and counting the occurence of every character. After this the algo builds up a binary tree and assigns a code to every token: the more frequent tokens get shorter code. Important is, that the code is not "byte based", it is "bit based" so the result is a bit stream. After creating the codewords the input file is read once again and encoded. However, there are modified algos which read the input only once (Adaptive Huffmann).
LZW: This smart algo reads the input, builds the dictionary and encodes the same time. I don't remember this algo that well so I don't get into the details...
I post my RLE algo tomorrow.
Good work, bye then!
ps.: the Dude of the Dude posted a shorter and yet clear explanation to the first question while I was writing this. Sorry for my laziness about not rewriting my post.
so you're saying, I can only encode the data if the character byte is between 3 and 127, or something like that respectively. I am encoding 0 to 255 for EACH BYTE. That is why a unique key is required in my 'hack' method. so if I have FD FD FD FD FD it becomes 41 42 * FD if I use the MSB of the byte that would confuse the decoder, no?
Did I say RLE was a dictionary? No I didn't, I implied the LZW limpman whatever method, building indexed dictionary.
I think this thread is very interesting.
Another thing I find interesting is that the encoding "appears" to me to be byte based and not bit based.
Does any of the above encode into a stream of bits?
Got excited about RLE this afternoon and came up with this as an enconder: :bg
.data?
TempBuffer db 4096 dup (?)
pSingleStart dd ? ;pointer to start of single bytes
pSequenceStart dd ? ;pointer to start of repeating bytes
bSkipSingleInstance dd ?
.code
EndSingletonChain proto
EndRepeatingChain proto
SingleInstance proto
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PackIt proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mov esi, Source
mov edi, Destination
xor ecx, ecx
mov pSequenceStart, 0
mov pSingleStart, 0
.repeat ;Main Loop
mov al, [esi] ;current byte
;=== byte is repeating ===
.if [esi+1] == al
;--- Start of Series ---
.if !pSequenceStart ;it's the first byte of a series
.if [esi+2] == al ;ensure min 3-byte sequence
mov pSequenceStart, esi ;save pointer to start of series
.else ;<3 same bytes, treat as singletons
jmp Singletons
.endif
.endif
invoke EndSingletonChain
.else
;=== Byte is not repeating ===
Singletons:
mov bSkipSingleInstance, FALSE
invoke EndRepeatingChain
.if bSkipSingleInstance == FALSE
invoke SingleInstance
.endif
.endif
inc esi
inc ecx
.until ecx == cbSource
;=== End Current Chain ===
invoke EndRepeatingChain
invoke EndSingletonChain
sub edi, Destination
mov eax, edi ;return cb of compressed file
ret
PackIt endp
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EndSingletonChain proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;--- End chain of singletons ---
.if pSingleStart
push esi
push ecx
mov ecx, esi
sub ecx, pSingleStart ;cb of singletons
mov esi, offset TempBuffer ;ptr to saved singletons
.while ecx > 127
mov byte ptr[edi], 01111111b ;max single chars - 127 (LSB clear)
inc edi
push ecx
mov ecx, 127
cld
rep movsb ;copy 127 singletons
pop ecx
sub ecx, 127
.endw
mov byte ptr[edi], cl ;cb of singletons (LSB clear)
inc edi
cld
rep movsb
pop ecx
pop esi
mov pSingleStart, 0 ;clear pointer (end of singetons)
.endif
ret
EndSingletonChain endp
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EndRepeatingChain proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;--- End of a repeating series---
.if pSequenceStart
mov edx, esi
sub edx, pSequenceStart ;edx = number of repeating bytes
add edx, 1
;--- we can only represent a max of 130 with one entry ---
.while edx >= 130
mov byte ptr[edi], 0ffh ;130 repetitions
mov byte ptr[edi+1], al ;of <byte>
add edi, 2
sub edx, 130
.endw
;--- now edx < 130 ---
.if dl >2
sub dl, 3
or dl, 10000000b ;set MSB (aka escape for sequence)
mov byte ptr[edi], dl ;number of repetitions
mov byte ptr[edi+1], al ;of <byte>
add edi, 2
.else
.if dl == 2
mov byte ptr[edi], 2
mov byte ptr[edi+1], al
mov byte ptr[edi+2], al
add edi, 3
.elseif dl == 1
mov byte ptr[edi], 1
mov byte ptr [edi+1], al
add edi, 2
.endif
.endif
mov pSequenceStart, 0 ;clear pointer (end of series)
mov bSkipSingleInstance, TRUE
.endif
ret
EndRepeatingChain endp
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SingleInstance proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;--- Single Instance of Byte ---
.if !pSingleStart ;first singleton in series
mov pSingleStart, esi ;save pointer to start of singletons
.endif
mov edx, esi
sub edx, pSingleStart ;index into temp buffer
add edx, offset TempBuffer
mov [edx], al ;save singleton in tempbuffer
ret
SingleInstance endp
My tempbuffer could probably be larger (for streams of singles over 4096)
Here's the decoder:
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UnPackIt proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
inc cbSource
mov ecx, 0
mov esi, Source
mov edi, Destination
.repeat
mov al, [esi]
.if (al & 10000000b) ;unpack
and al, 01111111b ;get byte count
push ecx
xor ecx, ecx
mov cl, al
add cl, 3
mov al, [esi+1] ;byte to copy
cld
rep stosb
pop ecx
add esi, 2
add ecx, 2
.else ;string of single bytes
and eax, 0ffh
add ecx, eax
inc ecx
push ecx
xor ecx, ecx
mov cl, al ;byte count
inc esi ;ptr to start of bytes to copy
cld
rep movsb
pop ecx
.endif
.until ecx == cbSource
sub edi, Destination
mov eax, edi ;return cb of uncompressed file
ret
UnPackIt endp
Here is your test, dude of dudes - original size=351k, compressed=0 bytes, uncompressed=0 bytes
That would be an awesome algo, if it worked...
Look at the functions RLE_ENCODE_2 and RLE_DECODE_2, I cut and paste with very slight mods.
[attachment deleted by admin]
Hi again!
A bit delayed, but finally here is my RLE algo. rle.zip (http://members.chello.hu/nemeth.gabor1/rle.zip)
It is wrapped very simply in a Win32 process, since the goal was to test the encoder/decoder. Please test it!
I have some ideas how to optimize it, but I thing RLE is a bit old method, and it's simply not worth working on it for that much time.
The proggy works like this:
1. throws an OpenFileName Dialog to get the file to be compressed.
2. throws a SaveFileName Dialog to get the output file.
3. throws an OpenFileName Dialog to get the file to be decompressed.
4. throws a SaveFileName Dialog to get the output file.
My results show what I've expected: the compression ratio is not very good, for text files it is a disaster. Once again RLE can be used for images with having large areas of the same color, or for sound samples of low voices...
Note:
- it uses memory buffers to read from and write to. The buffers are 64k long, so longer files are truncated.
- the rle.asm is full of debuginfo generating code, but they are commented out. If no debug is needed all this stuff with all the includes can be deleted.
I hope it works for every case.
Furthermore I would like to discuss the other packing algos, Huffmann and LZW. Does anyone agree??
By for now!
Gábor
Nice work, i was able to encode/decode a small .zip file and a ms paint file.
My packed files were 5 to 200 bytes larger than yours, I see how the key system works now.
Thanks for the post.
Fixed my previous code, bundled it into (yet another!) example. Just drag and drop the file to pack/unpack onto Packer.exe's icon. It also retains the file extension when unpacking. Will take a look at the other examples when I have some time! :bg
[attachment deleted by admin]
Good Job Dude :U!
I didn't check your packer completly, however I found one bug (we could live with it :bg). When running the program directly, not dropping a file on it, it gives a memory access exception.
The results: as expected your packer gives the same compression as mine, this means we both implemented the algo correctly :toothy
Interested in LZW?
Greets, Gábor
Yes, it does throw an exception. If you check the source, you'll see the commandline handling is quite simplistic which is why it causes an exception! :red It could actually be re-worked quite a bit, I threw it together in quite a hurry.
I'd be very interested in other compression methods!
wwooooopps... hate to beat dead horses but apparently with characters greater than 128 your decompressor gets confused, try the bitmap below, the picture shifts a few bytes to the right. How can the decompressor know which is an actual MSB and which is what is supposed to be there unless you know for certain an entire binary file normally uncompressed won't contain an MSB?
Like I said, I wonder what it would do with larger files in binary.
[attachment deleted by admin]
Hello!
I thought I'd better not bore all of you with packing algos, so I've put some explaination on http://members.chello.hu/nemeth.gabor1. At the moment it mainly consists of the the stuff already discussed here. But, I've also put there a not finished discription about LZW. (But I will finish it real soon!)
Greets, Gábor
Hmm... Im not sure what you mean, Drarem. I have no trouble with the bmp. I verified my decompression by examining Hex dumps of original and decompressed files. I check for matching length and randomly select offsets and compare their values. If the Image was shifted the offsets would not contain the same value, they would be shifted.
The decompressor processes the file in order like this:
The first byte is read and it determines what the following byte(s) mean. If the msb is set, the following byte is repeated (value of bits 0-6 add 3 times). If the msb is NOT set, its a literal value of how many single bytes follow. In either case, it can be determined where the offset of the next 'control charater' is. In the case of the repeating sequence, the next control character is 2 bytes past the current control character. In the case of a string of single bytes, the next control character is (control char value) + 1 bytes past the current control char. So as you can see, whether the msb is set in the bytes of the source file has no effect on the decompression. The key is getting the offset of the next control character, and the next, and so on. Hope this makes sense!
Dude, I'd say a clean 'n tidy explanation!
And I'd say could we jmp to the LZW, I would like anyone else (but me) to come up with an encoder/decoder code. As I've posted before, I put on a webpage the info about the LZW algo.
Greets, Gábor
Maybe I will code something when I have time..
I have started to code the LZW compressor/uncompressor. Just give me some time and you guys would get to see it.
I've been working on a version as well. Still buggy. The de-compression is by far more work! I think my understanding of the Algo may be wrong. Specifically during de-compression. This is what I am using:
read a character k
output k
w = k
loop
read a character
entry = dictionary entry for k
output entry (fully translate all codes to final characters)
add w + first char of entry to the dictionary
w = entry
endloop
I think my problem is when k is a yet undefined Dictionary entry. This is how I am handling it:
If k = next uncreated Dictionary entry, last k outputted is appended to w and creates the next Dictionary Entry
In a small test file i used a repeating series of numbers (1-5). At the 12th series it decodes to 1,2,3,4,5,5,2,3,4,5,1,2,3,4,5,1,1,3,4,5
Any ideas?
:dance: Solution: :dance:
If k = next uncreated Dictionary entry, FIRST CHAR of last code output is appended to w and creates the next Dictionary Entry
(big sigh!)
Same as my RLE Packer. Drag and drop files onto Packer icon to compress/decompress.
Works GREAT with most text files (up to 50% compaction), not as good with exe's (%30 maybe)
[attachment deleted by admin]
Hi Dude!
Nice work. Are your entries in the dictionary only 2 character long? I would say you could try a bigger value, and you could use other code length not just 12 bit (the number of entries, so the size of the dictionary would change as well).
Greets, Gábor
Here's a 16-bit version. Little better compression. Here's results from Windows.inc from the MASM package.
Windows.inc - 995kb
12bit lzw(2 byte entries) - 503kb
16bit lzw (2 byte entries) - 377kb :cheekygreen:
I am still using 2 bytes entries. I believe increasing the dictionary size or dictionary entry size achieves the same result, does it not (please tell me if I am wrong)?
[attachment deleted by admin]
Hi!
I'm still having problems with my encoder/decoder. I want to make it possible to adjust code word length and entry string length...
As soon as I'm finished, I post my code and test results.
Greets, Gábor
The problem is that LZW is supposed to be streamed, so you have to decide on your codeword size beforehand because once you start writing the compressed data you can't go back and change it. If your dictionary gets full, then that's it, you just have to keep on munching and spit out 'badly' compressed data.
Of course, you could be unconventional and compress the whole file once - find out how many entries are required - and then recompress it with the appropriate 'optimal' codeword size.
Hi Dude!
Huhh, it took me lot of time till I understood what you have done there. Maybe I wasn't reading your post correctly, but I can't remember for that solution you applied. You use 2 byte long strings that's almost true: more precisely you use a hibrid - a directory index for the first characters and another for the last character. Why iddn't come this on MY mind! :)) I must bow before this thinking. And yes, when handling the entries this way, longer entries doesn't seem to have any effect...
After this I feel a bit ashamed for my solution, it is very slow, wastes memory and seems to be buggy as well...
What really knocked me off and turned me away of packing things was when tried to compress windows.inc with rar. The result was only 183kB and lasted a few seconds only. Well, there is still much work...
Greets, Gábor
Due too i am still learning ASM when i came to this thread i triedto give an implemetation of rle algo based on some document on the net.
When we encounter to simbols that are the same we write the pair and after count how many there are after and write this count so if we have a string like that:
aaaabcdeee
the output string will be
aa2bcdee1
The program actally still read from a static buffer ( i am going to make it read from a file oninput) and have still a problem (or maybe more than one :lol i am still trying to enter in an asm mind) due too i write the occurence in a byte if a simbol appear more than 256 time this algo fail
include \masm32\include\masm32rt.inc
.data
szInput db "aaaaaaabcdfffffghilmooooppqr",0
OutBuffer db 128 dup(0ffh)
.code
start:
call main
exit
main PROC
lea esi, szInput
lea edi, OutBuffer
lodsw
the_loop:
test ah,ah
jz SHORT eos
test al,al
jz SHORT eos
cmp ah,al
je SHORT the_same
not_the_same:
stosb
dec esi
lodsw
jmp SHORT the_loop
the_same:
stosw
call ContaBytes
mov eax, ecx
stosb ;scrivi il risultato del conteggio della procedura nel buffer
dec esi
lodsw
jmp SHORT the_loop
eos:
ret
main endp
ContaBytes PROC
xor ecx,ecx
the_loop:
lodsb
cmp ah,al
jnz the_end
inc ecx
jmp SHORT the_loop
the_end:
ret
ContaBytes endp
end start
I am wrinting the decoder as i finished it i will post it up.
Any suggestions are welcome
Marko
Ps i dunno if anyone has done this type of implemetation because i dunno i can't access the code in this thread if anyone did forgive me