Compression

Parse · December 21, 2005, 05:08:12 PM

Hi, I would like to write my own compression engine and I'm not sure where to start. I've read some texts about Huffman compression, bubble sort and Lev-Zimpel algorithms, and I think I understand most of what they talk about, but I don't know where to start to write code. I've also looked at other people's work like apLib, LZO and UCL but it seems abit too advanced for me. Does anyone have any simpler programs or can give me a hint on where to start. thanks

P1 · December 21, 2005, 05:43:11 PM

It's best to not re-invent the wheel, unless you are really, really committed to this.

Something to check out:
http://www.zlib.net/
http://www.collakesoftware.com/ When he has his website back up.

Regards, P1 :8)

Tedd · December 22, 2005, 01:08:29 PM

Try writing a simple RLE compressor first. Which will basically just read in a file, compress runs of bytes, and write the result to a new file. Forget about any special headers or fancy stuff at first, just get it working. And then you can add to it once you have something :wink
Part 2: write the decompressor :lol

Scalemail Ted · March 07, 2010, 11:01:03 PM

First, I hate to raise the dead, but seeing as this topic already existed I didn't want to waste it.

Ok, now to the point!

I am trying to build my own simple compressor/decompressor. Since I am still new to masm I decided to take an incremental approach in the implementation of this project by establishing clearly defined goals.

I set three goals for myself:

1.) BUILD AN INTERFACE - status: Complete!
I started by designing an interaction pane that ask the user if they want to COMPRESS or DECOMPRESS and then it jumps to the selected procedure. If neither is selected then it loops. I call this my main menu. :p

2.)ADD FILE READING/WRITING/CREATION FUNCTIONALITY - status: Complete!
Next, I decided to expand my program's functionality. When the user selects to compress/decompress, the system now prompts them to enter a file name. Once passed in, it reads the file's contents and saves it into "buffer". Then the user is prompted to enter a new file name. My program than creates a new file with this name and writes the "buffer" to it. On both the initial read and the final write I print to the screen the byte size of the file.
Yay! I just built "copy+paste". :p

3.)ADD COMPRESSOR - status: INCOMPLETE. :(
Final stage- the last boss is hard!
My strategy was to read a file, save its content into a "buffer", run a procedure to compress the buffer, then write the compressed "buffer" to a new file. I have completed everything up to the COMPRESS PROC. Now I have hit a wall in my understanding.

...Then I found this thread, and it sounds like Tedd's suggestion is the route I'm already on. Question is: "Where can i learn about these simple RLE compressors? Would I be able to write it as a PROC in my existing code? In other words, Do I have a sound proof of concept?"

Quote
Try writing a simple RLE compressor first. Which will basically just read in a file, compress runs of bytes, and write the result to a new file. Forget about any special headers or fancy stuff at first, just get it working. And then you can add to it once you have something wink
Part 2: write the decompressor

Thanks for any help!

FORTRANS · March 07, 2010, 11:19:01 PM

Hi,

Look for the "packbits" algorithm. It was on Apple's site
among others. That is a good beginning to RLE.

Good luck,

Steve N.

Scalemail Ted · March 07, 2010, 11:28:41 PM

Quote from: FORTRANS on March 07, 2010, 11:19:01 PM
Hi,

Look for the "packbits" algorithm. It was on Apple's site
among others. That is a good beginning to RLE.

Good luck,

Steve N.

Thanks for the quick response and the resource!

I'll check it out and update the thread as I advance! :)

BlackVortex · March 08, 2010, 12:36:52 AM

aplib is a very good compression library (suitable for small files)

It's free and has asm library support
www.ibsensoftware.com/products_aPLib.html

For bigger files, lzma is probably a good choice.

Scalemail Ted · March 08, 2010, 02:44:05 AM

It appears that RLE seems best suited for compressing data that is in long redundant uninterrupted segments composed of like pieces.

Which may be problematic since I want to try to compress text.

I think I may try either a Huffman or LZ compressor!

I'm attempting to build a Huffman model first.

Quote
Huffman, The algorithm is:

1. List all possible messages with their probabilities.
2. Locate the two messages with the SMALLEST probabilities.
3. Replace these by a single set containing them both, with a
new probability equal to the sum of the two probabilities.
4. Repeat until the list contains only one member.

For STEP1 I built a procedure that creates a frequency table. It scans the string and counts the quantity of each character within the string. I did this by iterating an array that is 256 elements large. Each assignable HEX address for every element within the array (00-FF) represents the hex equivalent for that alphanumeric character, and whatever number is stored in that location represents that characters count within the string

For step2 I tried to make a reverse bubblesort procedure to get the count from highest to lowest. My thought that I could pop them into a stack to get them in reverse order. But then I realized that this moves my numbers from their location and I lose their HEX character value. :-/

joemc · March 08, 2010, 06:04:24 AM

Quote from: Scalemail Ted on March 08, 2010, 02:44:05 AM
Which may be problematic since I want to try to compress text.

compressing text should be fairly different than most other compression. Is all of the text in a specific language (or two)? Does it only use ASCII characters? Does it have predictable content (e.g. html file)?

hutch-- · March 08, 2010, 06:12:16 AM

Joe,

Have a look at JIBZ's stuf as BlackVortex mentioned, JIBZ has been at compression for years and from memory he did a text compression algo that worked well.

Scalemail Ted · March 08, 2010, 11:36:52 AM

I checked out Jibz' APlib Compressor/Decompressor. It's a very impressive resource but I have to admit that I have a tough time following it.

The coding is obviously sharp and efficient, but its also spread across several asm files... and its also more advanced than what I can proficiently read at the moment.

I have searched for hours but I can't find any compressor that is consolidated into a single asm.

I guess tomorrow I'll have to print everything out and try to follow it through the comments. Always a scary prospect ;p

I need to give up tonight... too frustrated and drained to continue.

FORTRANS · March 08, 2010, 10:56:56 PM

Hi,

I thought I had written a program for Huffman coding, but
apparently not. So I coded a piece of one using an example
I was working on by hand. My reference was "The Data
Compression Handbook", by Mark Nelson. He suggests using
a tree structure to hold the information.

So I started off with:

Code Select


NumSymb EQU    16      ; Make this an EQUate to ease translation to 256

TreeNode        STRUC
 Count  DD      0       ; Value of the node, 0 = inactive.
 DCount DD      0       ; Debug value.
 Leaf0  DW      0       ; The 1, 0 branches of the binary tree.
 Leaf1  DW      0
 Stem   DW      0       ; The node that this is a leaf of.
TreeNode        ENDS

NodeSize EQU    SIZE TreeNode

; Need twice the number of symbols of nodes, plus a guard value.
Nodes   TreeNode        2*NumSymb DUP(<>), < 0FFFFFFFFH,,,>

He scans the array/tree to find the two smallest elements/nodes
to combine into a new node and then zeroes those to to show that
they are now inactive. As he notes, the debug value is optional. I
put in the "Stem" value to debug as well. He scans the array
without needing such.

Regards,

Steve

hutch-- · March 09, 2010, 03:37:07 AM

Tewd,

Haqve a look at the masm32 example using an early version of JIBZ's library. Its no big deal to use and the asm files are for decompressing the data. You will note that it places a 4 byte header at the start of the data so that the uncompressed length is known.

Scalemail Ted · March 17, 2010, 02:52:04 PM

UPDATE!!!!

I am getting closer to getting my compressor complete!

It uses Huffman's and so far:

I can read a file, one character at a time, on this first read the program constructs a frequency table and the total character size of the file is also calculated. Next I have a procedure that sorts my freq table and contructs a new array with only the relevant letters from the file. I order the array from most common to least common.

next step is to read the file again, evaluate the character , check its spot on the sorted array and shift the bits over that number of times. then write that keycode to a file...

So far most of my time has been spent learning how to read a file... now I'm having issues writing to a file.

are there any good resources for writing out to a file?

oex · March 17, 2010, 05:51:07 PM

test OutputFile(FilePath,Data,Length), eax

News:

Compression

Parse

Scalemail Ted

Scalemail Ted

Scalemail Ted

Scalemail Ted

Scalemail Ted