News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

RLE - encoding/decoding program and source

Started by drarem, March 11, 2005, 07:09:58 PM

Previous topic - Next topic

drarem

The zip contains a program with source I have worked on for a couple of days involving rle byte compression/decompression.
It is buggy and in the alpha stage but feel free to test and comment away..  it works best on files less than 2Mb as it
generates a two-byte key which would probably exist in a larger file and mess up the decompression algo.

Click the button 'Load File' to set the input filename and 'Save File' to set the output (restored) filename.  A temporary file
is created on your harddisk called 'rlecoded.tmp'.

Click the 'encode' button, it will display actual and compressed file size - and write the temp file.
Click the 'decode' button, it will reverse the numbers - actual compressed and compressed label meaning actual restored size
- and write to the save file name.


That's pretty much it for now, I'm sick as a dog and it sucks.   Some of the restored filenames come back large as in .exe's, not
sure why yet but for the most part they execute and seem to function normally.


http://drarem.ms11.net/rlepkr.zip  - approx 9Kb download

James Ladd

drarem,
Im interested in this topioc and will look at your source.
Can you explain the algorithm you are using ?
Rgs, striker

drarem

yes it looks at the data byte by byte, for anything repeating.

It is put into a generic string such as this:

[key][occurances][character]

For the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.

[A][#occurances][byte]

At the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.

So if you had a data file:

XXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

* = funny symbol which is value of 6.

Getting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.

It doesn't save much atm - disk stored rounds up apparently to the closest 4096k so storage isn't really advantage unless you have a file full of repeating characters that could take advantage of it.

James Ladd

Very interesting.
Have you looked at other schemes or are you doing this to try and teach yourself ?
I look forward to seeing some more.

drarem

Yes and yes, I have a small data compression book written in the 90's, by Gilbert Held.  He explains the theories, I had to come up with a workable algo.

I will attempt other methods such as huffman - he says it's stored as a tree.

Jibz

Nice work :U.

Quote from: draremXXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC

Quote from: draremFor the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.

You could use an 'escape from the escape' to handle this situation. E.g. let a repeat count of 0 mean insert only a single symbol, then, if e.g. AB is the digraph that occurs the fewest times, you can do:

XXXXXXABXXXXXX...

compresses to

ABAB*XAB+ABAB*X...

where * is 6 and + is 0, and the bold part encodes the occurence of AB.

Quote from: draremAt the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.

This means you could subtract 4 from the repeat count -- this way you are able to encode repeats in the range 4..259 instead of 0..255. If you add the 'escape from escape' you would subtract 3 instead.

Quote from: draremGetting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.

Playing with compression can sometimes drive you mad .. but it's lot's of fun :U.

Huffman is definitely worth a look .. Lempel-Ziv compression is also fairly easy and fun.

drarem

#6
Quote
XXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC

sorry, I was in a rush (and still am) - now I'm working on php, odbc and a couple of access databases - this like so many other projects of mine is taking a back burner.
I'll come back to the compression eventually.

Could you explain the escape from an escape? Almost sounds like an indexed dictionary compression - maybe I'm reading too much into it.  It is late :/

Thanks.

gabor

Hi!

It was funny, that these days I've been thinking about an RLE algo too.  :bg It is a bit different and comes from an old picture format used by Deluxe Paint, but as far as I know RLE was used in PCX as well.
Maybe someone doesn't know the meaning of RLE: it is Run Length Encoding, since it encodes the length of pattern repetition. (how long a pattern runs :bg)

This algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.

Here is an example:




Index:000102030405060708090A0B0C0D0E0F
Data:121212386BE0E0000F00000000000000
[/b]
Compressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00
- at value 80 the MSB is set indicating that compressed data follow, the count is 00+3=3. So 12 is written 3 times.
- at value 06 the MSB is not set indicating that uncompressed data follow, the count is 6, So the next 6 bytes are copied just as they are.
- finally 84 indicates that 7 bytes of 0 were compressed, so 0 is written 7 times.

I am coding this algo, hope to be ready after the weekend...but have tons of other thing to do, thats the brute reality :'(

Please feel free to ask or comment, I like discussing asm coding topics!

BTW: I could write some word about Huffmann and LZW too, if anyone interested.

Greets to all and nice coding time for everyone!

drarem

QuoteThis algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.

Here is an example:

Index: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
Data: 12 12 12 38 6B E0 E0 00 0F 00 00 00 00 00 00 00

Isn't it possible for the MSB to be elsewhere in a larger binary file?  How would the decoder differentiate between a valid MSB and a binary piece of data?  Are you talking about RLE on ascii files?

What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.

The Dude of Dudes

Gabor,

Very cool!  :U

Drarem,

Read his post again.  :wink

QuoteCompressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00

80 12 specifies 3 -12's
06  specifies skip over (dont de-compress) the next six bytes:  38 6B E0 E0 00 0F
84 00 specifies 7 - 00's

I'd be interested to see compression results!

gabor

Quote
Isn't it possible for the MSB to be elsewhere in a larger binary file?  How would the decoder differentiate between a valid MSB and a binary piece of data?  Are you talking about RLE on ascii files?
I'm sorry I'm afraid I didn't really get the point. Since the MSB is the most significant bit (highest bit of a byte) it is everywhere in the file. Inthis algo the MSB of the count is used as a flag. According to this flag the decoder can decide wheter tha data is compressed or not.
Say, the compressed data looks like this:
80 12 06 38 6B E0 E0 00 0F 84 00
Now, the decoder takes 80. MSB is set, so the next data must be written out more than once. Precisely 00+3 times. 00 comes from 80 after MSB is cleared. About the 3 I wrote before. So the result is 3 times the next data: 12 12 12
The decoder now reads 06. MSB is not set, so the next 6 data is uncompressed. So  38 6B E0 E0 00 0F is written out.
Finally it reads 84. This will end up in writing out 0 7 times: 00 00 00 00 00 00 00
This algo just like the others can be used on binary and ascii files as well. But the efficiency is very different, and ge4nerally not very nice... :'(

Quote
What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.
Ok, RLE is not a dictionary based algo, but Huffman and LZW is indeed.
I dig out my university books and post a pseudo algo for them if interested.

For now this:

Huffmann: gathers info about the frequncy of each token (character/byte or word).  This is done by reading the whole file and counting the occurence of every character. After this the algo builds up a binary tree and assigns a code to every token: the more frequent tokens get shorter code. Important is, that the code is not "byte based", it is "bit based" so the result is a bit stream. After creating the codewords the input file is read once again and encoded. However, there are modified algos which read the input only once (Adaptive Huffmann).

LZW: This smart algo reads the input, builds the dictionary and encodes the same time. I don't remember this algo that well so I don't get into the details...

I post my RLE algo tomorrow.

Good work, bye then!

ps.: the Dude of the Dude posted a shorter and yet clear explanation to the first question while I was writing this. Sorry for my laziness about not rewriting my post.

drarem

so you're saying, I can only encode the data if the character byte is between 3 and 127, or something like that respectively.  I am encoding 0 to 255 for EACH BYTE. That is why a unique key is required in my 'hack' method.  so if I have FD FD FD FD FD  it becomes  41 42 * FD      if I use the MSB of the byte that would confuse the decoder, no?

Did I say RLE was a dictionary?  No I didn't, I implied the LZW limpman whatever method, building indexed dictionary.

James Ladd

I think this thread is very interesting.
Another thing I find interesting is that the encoding "appears" to me to be byte based and not bit based.
Does any of the above encode into a stream of bits?

The Dude of Dudes

#13
Got excited about RLE this afternoon and came up with this as an enconder: :bg


.data?

TempBuffer db 4096 dup (?)

pSingleStart dd ? ;pointer to start of single bytes
pSequenceStart dd ? ;pointer to start of repeating bytes
bSkipSingleInstance dd ?
.code

EndSingletonChain proto
EndRepeatingChain proto
SingleInstance proto

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PackIt proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


mov esi, Source
mov edi, Destination
xor ecx, ecx

mov pSequenceStart, 0
mov pSingleStart, 0

.repeat ;Main Loop

mov al, [esi] ;current byte

;=== byte is repeating ===
.if [esi+1] == al

  ;--- Start of Series ---
.if !pSequenceStart ;it's the first byte of a series
.if [esi+2] == al ;ensure min 3-byte sequence
mov pSequenceStart, esi ;save pointer to start of series
.else ;<3 same bytes, treat as singletons
jmp Singletons
.endif
.endif
  invoke EndSingletonChain
   
.else
;=== Byte is not repeating ===
Singletons:
mov bSkipSingleInstance, FALSE
invoke EndRepeatingChain

  .if bSkipSingleInstance == FALSE
invoke SingleInstance
  .endif

.endif

inc esi
inc ecx

.until ecx == cbSource

;=== End Current Chain ===

invoke EndRepeatingChain
invoke EndSingletonChain
sub edi, Destination
mov eax, edi ;return cb of compressed file
ret

PackIt endp


;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EndSingletonChain proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- End chain of singletons ---

.if pSingleStart
push esi
push ecx

mov ecx, esi
sub ecx, pSingleStart ;cb of singletons
mov esi, offset TempBuffer ;ptr to saved singletons

.while ecx > 127
mov byte ptr[edi], 01111111b ;max single chars - 127 (LSB clear)
inc edi
push ecx
mov ecx, 127
cld
rep movsb ;copy 127 singletons
pop ecx
sub ecx, 127
.endw

mov byte ptr[edi], cl ;cb of singletons (LSB clear)
inc edi
cld
rep movsb

pop ecx
pop esi
mov pSingleStart, 0 ;clear pointer (end of singetons)
.endif

ret

EndSingletonChain endp

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
EndRepeatingChain proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- End of a repeating series---
.if pSequenceStart
mov edx, esi
sub edx, pSequenceStart ;edx = number of repeating bytes
add edx, 1
;--- we can only represent a max of 130 with one entry ---
.while edx >= 130
mov byte ptr[edi], 0ffh ;130 repetitions
mov byte ptr[edi+1], al ;of <byte>
add edi, 2
sub edx, 130
.endw


;--- now edx < 130 ---
.if dl >2
sub dl, 3
or dl, 10000000b ;set MSB (aka escape for sequence)

mov byte ptr[edi], dl ;number of repetitions
mov byte ptr[edi+1], al ;of <byte>
add edi, 2

.else
.if dl == 2
mov byte ptr[edi], 2
mov byte ptr[edi+1], al
mov byte ptr[edi+2], al
add edi, 3
.elseif dl == 1
mov byte ptr[edi], 1
mov byte ptr [edi+1], al
add edi, 2
.endif
.endif
mov pSequenceStart, 0 ;clear pointer (end of series)
mov bSkipSingleInstance, TRUE
.endif

ret

EndRepeatingChain endp

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SingleInstance proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- Single Instance of Byte ---
.if !pSingleStart ;first singleton in series
mov pSingleStart, esi ;save pointer to start of singletons
.endif

mov edx, esi
sub edx, pSingleStart ;index into temp buffer
add edx, offset TempBuffer
mov [edx], al ;save singleton in tempbuffer

ret

SingleInstance endp


My tempbuffer could probably be larger (for streams of singles over 4096)

Here's the decoder:

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UnPackIt proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


inc cbSource
mov ecx, 0
mov esi, Source
mov edi, Destination
.repeat
mov al, [esi]
.if (al & 10000000b) ;unpack
and al, 01111111b ;get byte count
push ecx
xor ecx, ecx
mov cl, al
add cl, 3
mov al, [esi+1] ;byte to copy
cld
rep stosb
pop ecx
add esi, 2
add ecx, 2
.else ;string of single bytes
and eax, 0ffh
add ecx, eax
inc ecx
push ecx
xor ecx, ecx
mov cl, al ;byte count
inc esi ;ptr to start of bytes to copy
cld
rep movsb
pop ecx
.endif



.until ecx == cbSource

sub edi, Destination
mov eax, edi ;return cb of uncompressed file
ret

UnPackIt endp

drarem

Here is your test, dude of dudes - original size=351k, compressed=0 bytes, uncompressed=0 bytes

That would be an awesome algo, if it worked...


Look at the functions RLE_ENCODE_2 and RLE_DECODE_2, I cut and paste with very slight mods.



[attachment deleted by admin]