RLE - encoding/decoding program and source

drarem · March 11, 2005, 07:09:58 PM

The zip contains a program with source I have worked on for a couple of days involving rle byte compression/decompression.
It is buggy and in the alpha stage but feel free to test and comment away.. it works best on files less than 2Mb as it
generates a two-byte key which would probably exist in a larger file and mess up the decompression algo.

Click the button 'Load File' to set the input filename and 'Save File' to set the output (restored) filename. A temporary file
is created on your harddisk called 'rlecoded.tmp'.

Click the 'encode' button, it will display actual and compressed file size - and write the temp file.
Click the 'decode' button, it will reverse the numbers - actual compressed and compressed label meaning actual restored size
- and write to the save file name.

That's pretty much it for now, I'm sick as a dog and it sucks. Some of the restored filenames come back large as in .exe's, not
sure why yet but for the most part they execute and seem to function normally.

http://drarem.ms11.net/rlepkr.zip - approx 9Kb download

James Ladd · March 11, 2005, 10:14:49 PM

drarem,
Im interested in this topioc and will look at your source.
Can you explain the algorithm you are using ?
Rgs, striker

drarem · March 12, 2005, 01:46:07 AM

yes it looks at the data byte by byte, for anything repeating.

It is put into a generic string such as this:

[key][occurances][character]

For the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.

[A][#occurances][byte]

At the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.

So if you had a data file:

XXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

* = funny symbol which is value of 6.

Getting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.

It doesn't save much atm - disk stored rounds up apparently to the closest 4096k so storage isn't really advantage unless you have a file full of repeating characters that could take advantage of it.

James Ladd · March 12, 2005, 03:11:39 AM

Very interesting.
Have you looked at other schemes or are you doing this to try and teach yourself ?
I look forward to seeing some more.

drarem · March 12, 2005, 01:31:44 PM

Yes and yes, I have a small data compression book written in the 90's, by Gilbert Held. He explains the theories, I had to come up with a workable algo.

I will attempt other methods such as huffman - he says it's stored as a tree.

Jibz · March 14, 2005, 09:10:28 AM

Nice work :U.

Quote from: draremXXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC

Quote from: draremFor the key I generate a 2-byte key, default is AB - is not complete but would add 1 to A and search through the file looking for the key combination - in order to create a unique key that wont confuse the decompressor.

You could use an 'escape from the escape' to handle this situation. E.g. let a repeat count of 0 mean insert only a single symbol, then, if e.g. AB is the digraph that occurs the fewest times, you can do:

XXXXXXABXXXXXX...

compresses to

ABAB*XAB+ABAB*X...

where * is 6 and + is 0, and the bold part encodes the occurence of AB.

Quote from: draremAt the beginning of the compressed file is the key header, the key which the decompressor uses to decompress the file. Since the key sequence takes up 4 bytes, there's no point in counting anything repeating less than 4 as the key sequence would add to the file.

This means you could subtract 4 from the repeat count -- this way you are able to encode repeats in the range 4..259 instead of 0..255. If you add the 'escape from escape' you would subtract 3 instead.

Quote from: draremGetting the algo to work took lots of effort and banging my head into the wall, and it is still off it seems on some files.

Playing with compression can sometimes drive you mad .. but it's lot's of fun :U.

Huffman is definitely worth a look .. Lempel-Ziv compression is also fairly easy and fun.

drarem · March 19, 2005, 04:51:58 AM

Quote
XXOOOOOOXXXXABABCCC

Compressed would look like:

BBXXBB*XXXXABABCCC

Isn't there an O missing? .. something like BBXXBB*OXXXXABABCCC

sorry, I was in a rush (and still am) - now I'm working on php, odbc and a couple of access databases - this like so many other projects of mine is taking a back burner.
I'll come back to the compression eventually.

Could you explain the escape from an escape? Almost sounds like an indexed dictionary compression - maybe I'm reading too much into it. It is late :/

Thanks.

gabor · March 19, 2005, 09:14:33 AM

Hi!

It was funny, that these days I've been thinking about an RLE algo too. :bg It is a bit different and comes from an old picture format used by Deluxe Paint, but as far as I know RLE was used in PCX as well.
Maybe someone doesn't know the meaning of RLE: it is Run Length Encoding, since it encodes the length of pattern repetition. (how long a pattern runs :bg)

This algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.

Here is an example:

Index:	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
Data:	12	12	12	38	6B	E0	E0	00	0F	00	00	00	00	00	00	00

[/b]
Compressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00
- at value 80 the MSB is set indicating that compressed data follow, the count is 00+3=3. So 12 is written 3 times.
- at value 06 the MSB is not set indicating that uncompressed data follow, the count is 6, So the next 6 bytes are copied just as they are.
- finally 84 indicates that 7 bytes of 0 were compressed, so 0 is written 7 times.

I am coding this algo, hope to be ready after the weekend...but have tons of other thing to do, thats the brute reality :'(

Please feel free to ask or comment, I like discussing asm coding topics!

BTW: I could write some word about Huffmann and LZW too, if anyone interested.

Greets to all and nice coding time for everyone!

drarem · March 19, 2005, 12:34:34 PM

QuoteThis algo does not use keys but the MSB of the byte containing the repeat count. So it makes possible the compression of 127 bytes only. But this can be modified...
The count and the data byte makes two bytes, so only 3 byte sequences should be compressed this can be used as written in one post earlier: The count for compressed data is between 3 and 130, but remains between 0-127 for uncompressed data.

Here is an example:

Index: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
Data: 12 12 12 38 6B E0 E0 00 0F 00 00 00 00 00 00 00

Isn't it possible for the MSB to be elsewhere in a larger binary file? How would the decoder differentiate between a valid MSB and a binary piece of data? Are you talking about RLE on ascii files?

What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.

The Dude of Dudes · March 19, 2005, 04:38:48 PM

Gabor,

Very cool! :U

Drarem,

Read his post again. :wink

QuoteCompressed data would look like this:
80 12 06 38 6B E0 E0 00 0F 84 00

80 12 specifies 3 -12's
06 specifies skip over (dont de-compress) the next six bytes: 38 6B E0 E0 00 0F
84 00 specifies 7 - 00's

I'd be interested to see compression results!

gabor · March 19, 2005, 05:16:35 PM

Quote
Isn't it possible for the MSB to be elsewhere in a larger binary file? How would the decoder differentiate between a valid MSB and a binary piece of data? Are you talking about RLE on ascii files?

I'm sorry I'm afraid I didn't really get the point. Since the MSB is the most significant bit (highest bit of a byte) it is everywhere in the file. Inthis algo the MSB of the count is used as a flag. According to this flag the decoder can decide wheter tha data is compressed or not.
Say, the compressed data looks like this:
80 12 06 38 6B E0 E0 00 0F 84 00
Now, the decoder takes 80. MSB is set, so the next data must be written out more than once. Precisely 00+3 times. 00 comes from 80 after MSB is cleared. About the 3 I wrote before. So the result is 3 times the next data: 12 12 12
The decoder now reads 06. MSB is not set, so the next 6 data is uncompressed. So 38 6B E0 E0 00 0F is written out.
Finally it reads 84. This will end up in writing out 0 7 times: 00 00 00 00 00 00 00
This algo just like the others can be used on binary and ascii files as well. But the efficiency is very different, and ge4nerally not very nice... :'(

Quote
What I want to know about is building an indexed dictionary - putting different repeating sequences into a table and using predefined symbols to encode them.

Ok, RLE is not a dictionary based algo, but Huffman and LZW is indeed.
I dig out my university books and post a pseudo algo for them if interested.

For now this:

Huffmann: gathers info about the frequncy of each token (character/byte or word). This is done by reading the whole file and counting the occurence of every character. After this the algo builds up a binary tree and assigns a code to every token: the more frequent tokens get shorter code. Important is, that the code is not "byte based", it is "bit based" so the result is a bit stream. After creating the codewords the input file is read once again and encoded. However, there are modified algos which read the input only once (Adaptive Huffmann).

LZW: This smart algo reads the input, builds the dictionary and encodes the same time. I don't remember this algo that well so I don't get into the details...

I post my RLE algo tomorrow.

Good work, bye then!

ps.: the Dude of the Dude posted a shorter and yet clear explanation to the first question while I was writing this. Sorry for my laziness about not rewriting my post.

drarem · March 19, 2005, 08:35:51 PM

so you're saying, I can only encode the data if the character byte is between 3 and 127, or something like that respectively. I am encoding 0 to 255 for EACH BYTE. That is why a unique key is required in my 'hack' method. so if I have FD FD FD FD FD it becomes 41 42 * FD if I use the MSB of the byte that would confuse the decoder, no?

Did I say RLE was a dictionary? No I didn't, I implied the LZW limpman whatever method, building indexed dictionary.

James Ladd · March 19, 2005, 10:52:09 PM

I think this thread is very interesting.
Another thing I find interesting is that the encoding "appears" to me to be byte based and not bit based.
Does any of the above encode into a stream of bits?

The Dude of Dudes · March 20, 2005, 12:48:32 AM

Got excited about RLE this afternoon and came up with this as an enconder: :bg

Code Select


.data?

TempBuffer	db	4096	dup	(?)

	pSingleStart	dd	?	;pointer to start of single bytes
	pSequenceStart	dd	?	;pointer to start of repeating bytes
	bSkipSingleInstance	dd	?
.code

	EndSingletonChain	proto
	EndRepeatingChain	proto
	SingleInstance		proto

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	PackIt	proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	
	
	mov esi, Source
	mov edi, Destination
	xor ecx, ecx 

	mov pSequenceStart, 0
	mov pSingleStart, 0

.repeat				;Main Loop
	
	mov al, [esi]			;current byte

;=== byte is repeating ===
	.if [esi+1] == al	
	
	  ;--- Start of Series ---	
		.if !pSequenceStart		;it's the first byte of a series
			.if [esi+2] == al	;ensure min 3-byte sequence
				mov pSequenceStart, esi	;save pointer to start of series
			.else			;<3 same bytes, treat as singletons
				jmp Singletons
			.endif
		.endif
	  invoke EndSingletonChain	
   
	.else
;=== Byte is not repeating ===	
	Singletons:
	mov bSkipSingleInstance, FALSE		
	invoke EndRepeatingChain

	  .if bSkipSingleInstance == FALSE
		invoke SingleInstance
	  .endif

	.endif

	inc esi
	inc ecx

.until ecx == cbSource 

;=== End Current Chain ===

	invoke EndRepeatingChain
	invoke EndSingletonChain
	sub edi, Destination
	mov eax, edi		;return cb of compressed file
	ret

PackIt endp


;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	EndSingletonChain	proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- End chain of singletons ---

		.if pSingleStart		
			push esi
			push ecx

			mov ecx, esi
			sub ecx, pSingleStart	;cb of singletons
			mov esi, offset TempBuffer	;ptr to saved singletons

			.while ecx > 127
				mov byte ptr[edi], 01111111b	;max single chars - 127 (LSB clear)
				inc edi
				push ecx
				mov ecx, 127
				cld
				rep movsb	;copy 127 singletons
				pop ecx
				sub ecx, 127
			.endw
			
			mov byte ptr[edi], cl		;cb of singletons (LSB clear)
			inc edi
			cld
			rep movsb

			pop ecx
			pop esi	
			mov pSingleStart, 0		;clear pointer (end of singetons)
		.endif

			ret

EndSingletonChain endp

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	EndRepeatingChain	proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- End of a repeating series---
		.if pSequenceStart	
			mov edx, esi
			sub edx, pSequenceStart	;edx = number of repeating bytes
			add edx, 1
		;--- we can only represent a max of 130 with one entry ---
			.while edx >= 130
				mov byte ptr[edi], 0ffh	;130 repetitions
				mov byte ptr[edi+1], al	;of <byte>
				add edi, 2
				sub edx, 130
			.endw
			
			
		;--- now edx < 130 ---
			.if dl >2
			sub dl, 3
			or dl, 10000000b	;set MSB (aka escape for sequence)
			
			mov byte ptr[edi], dl	;number of repetitions
			mov byte ptr[edi+1], al	;of <byte>
			add edi, 2

			.else
				.if dl == 2
					mov byte ptr[edi], 2
					mov byte ptr[edi+1], al
					mov byte ptr[edi+2], al
					add edi, 3
				.elseif dl == 1
					mov byte ptr[edi], 1
					mov byte ptr [edi+1], al
					add edi, 2
				.endif					
			.endif
			mov pSequenceStart, 0	;clear pointer (end of series)
			mov bSkipSingleInstance, TRUE
		.endif

		ret

EndRepeatingChain endp

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	SingleInstance		proc
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

;--- Single Instance of Byte ---
		.if !pSingleStart		;first singleton in series	
			mov pSingleStart, esi	;save pointer to start of singletons
		.endif			
	
		mov edx, esi
		sub edx, pSingleStart	;index into temp buffer
		add edx, offset TempBuffer
		mov [edx], al		;save singleton in tempbuffer
		
		ret

SingleInstance endp

My tempbuffer could probably be larger (for streams of singles over 4096)

Here's the decoder:

Code Select

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	UnPackIt	proc, Source: dword, Destination: dword, cbSource: dword
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	
	
	inc cbSource
	mov ecx, 0
	mov esi, Source
	mov edi, Destination
.repeat
	mov al, [esi]
	.if (al & 10000000b)	;unpack
		and al, 01111111b	;get byte count
		push ecx
		xor ecx, ecx
		mov cl, al
		add cl, 3
		mov al, [esi+1]		;byte to copy
		cld
		rep stosb
		pop ecx
		add esi, 2
		add ecx, 2
	.else			;string of single bytes
		and eax, 0ffh
		add ecx, eax
		inc ecx
		push ecx
		xor ecx, ecx
		mov cl, al	;byte count
		inc esi		;ptr to start of bytes to copy
		cld
		rep movsb
		pop ecx
	.endif



.until ecx == cbSource

	sub edi, Destination
	mov eax, edi		;return cb of uncompressed file
	ret

UnPackIt endp

drarem · March 20, 2005, 04:01:19 PM

Here is your test, dude of dudes - original size=351k, compressed=0 bytes, uncompressed=0 bytes

That would be an awesome algo, if it worked...

Look at the functions RLE_ENCODE_2 and RLE_DECODE_2, I cut and paste with very slight mods.

[attachment deleted by admin]

News:

RLE - encoding/decoding program and source

drarem

James Ladd

drarem

James Ladd

drarem

Jibz

drarem

gabor

drarem

The Dude of Dudes

gabor

drarem

James Ladd

The Dude of Dudes

drarem