View on GitHub

LZWImplementation

Download this project as a .zip file Download this project as a tar.gz file

Welcome

The objective is to read dictionary-based data compression techniques from David Salomon's "Data Compression - The Complete Reference" and to implement the famous LZW algorithm.

I was hooked on to the book the moment I read the introduction! The book has some great quotes that would make the reader remember certain points while reading and very engaging tone of communication.
The last second paragraph in the introduction did bring some smile. On why the name is "The Complete Reference", he says as James Joyce used to claim that if Dublin of his time were to be destroyed, it could be reconstructed from his books, substantial part of knowledge about data compression could be reconstructed from this book. viola!

Contents of the repository

LZWImplementation

Java Implementation of LZW compression algorithm.

Canterbury Corpus

The Canterbury Corpus is a collection of files introduced in 1997 to provide an alternative to the Calgary corpus for evaluating lossless compression methods. You can read the "Introduction" chapter on David Salomon's "Data Compression - The Complete Reference" for more details.


If you want to know high levels concepts of Data compression and dictionary based compression algorithms, read on ..

Introduction

Data compression is the process of converting an input data stream (the source stream or the original raw data) into another data stream (the output / bitstream / compressed stream) that has a smaller size. This suggests that there is redundancy in input stream and that by assigning short codes to common events and long codes to rare events, one can create an output that is smaller than the input. Even though redundancy would not be commonplace in a text created by random choice of letters in the alphabet, the fact is that the relatively few files that have redundancy and can be compressed are the ones that we want to compress. They have redundancy, are nonrandom and are therefore useful and interesting.

Key terms and concepts related to compression are discussed below to give a better understanding of the topic before we go into the details of LZW compression algorithm. A nonadaptive compression method is rigid and does not modify its operations, its parameters, or its tables in response to the particular data being compressed. Such a method is best used to compress data that is all of a single type. Lossy/lossless compression: Certain compression methods are lossy. They achieve better compression by losing some information. When the compressed stream is decompressed, the result is not identical to the original data stream. Symmetrical compression is the case where the compressor and decompressor use basically the same algorithm but work in “opposite” directions.

The compression ratio is defined as
 Compression Ratio = Size of the output stream / Size of the input stream.

The inverse of the compression ratio is called the compression factor.

Dictionary based data compression

Dictionary- based compression methods do not use a statistical model, nor do they use variable-size codes. Instead they select strings of symbols and encode each string as a token using a dictionary. The dictionary holds strings of symbols, and it may be static or dynamic (adaptive). The former is permanent, sometimes allowing the addition of strings but no deletions, whereas the latter holds strings previously found in the input stream, allowing for additions and deletions of strings as new input is being read.

LZ77 & LZ78 Compression - Overview

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978.
They are both theoretically dictionary coders. LZ77 maintains a sliding window during compression. This was later shown to be equivalent to the explicit dictionary constructed by LZ78—however, they are only equivalent when the entire data is intended to be decompressed. LZ78 decompression allows random access to the input as long as the entire dictionary is available while LZ77 decompression must always start at the beginning of the input.

LZW Compression

This is a variant of LZ78 and was developed by Terry Welch in 1984. It is similar to LZ77 and LZ78 but the main difference is that it doesn’t need the second field of a token. An LZW token is just a pointer to the dictionary.

Overview of LZW algorithm

At a high level, the method consists of initializing the dictionary with codes for known symbols and as new strings are encountered, new codes are added which are later used when these string occur in the input stream. The core principle is that the compressor reads one string (x) at a time and searches for it in the dictionary. If the lookup succeeds, the process continues and the next time the string to be searched is the previous string concatenated with the new one (xy). If lookup fails for this, the compressor puts the new string into the dictionary as the next available entry.

Example to illustrate LZW

This can be best illustrated with this example. We will use a portion of the example given in “Data Compression – The Complete Reference”1 for the sake of brevity - “sir sid eastman”. We’ll represent space with “_”

The steps are as follows:

  1. Initialize: The dictionary is initialized with the first 256 characters to give codes from 0 to 255.
  2. Process “s”: The compressor reads one character at a time and the first one is “s”. The lookup returns true and since it is found, the process continues.
  3. Process “si”: The current word/phrase = current character + previous word/phrase found in the dictionary. In the current step, it would be “si” a. Since “si” is not found in the dictionary, the compressor does the following – i. Outputs code for s – 115 ii. Stores si as the next entry in the dictionary with code 256. iii. Sets “prev word” = i
  4. Process “i”: The current word/phrase = current char + prev word. In the current step, it would be “ir” a. Since “ir” is not found in the dictionary, the compressor does the following – i. Outputs code for i – 105 ii. Stores ir as the next entry in the dictionary with code 257. iii. Sets “prev word” = r  
  5. The process continues. The table below shows all the steps for “sir_sid_eastman”

x In dictionary New Entry Output

s Y
si N si:256 115(s) i Y
ir N ir:257 105(i) r Y
r_ N r_:258 114(r ) _ Y
s N _s:259 32() s Y
si Y
sid N sid:260 256(si) d Y
d_ N d_:261 100(d) _ Y
e N _e:262 32() e Y
ea N ea:263 101(e) a Y
as N as:264 97(a) s Y
st N st:265 115(s) t Y
tm N tm:266 116(t) m Y
ma N ma:267 109(m) a Y
an N an:268 97(a) n Y
n_ N n_:269 110(n) _ Y
eof N 32()

The complete output stream is (only the numbers are output, not the strings in parentheses) as follows: 115 (s), 105 (i), 114 (r), 32 (), 256 (si), 100 (d), 32 (), 101 (e), 97 (a), 115 (s), 116 (t), 109 (m), 97 (a), 110 (n), 32(_), eof.