Data Compression is used just about everywhere. Data compression involves the development of a compact representation of information. Most representations of information contain large amounts of redundancy. Redundancy can exist in various forms. Internet users who download or upload files from/to the web, or use email to send or receive attachments will most likely have encountered files in compressed format.
Data Compression Overview
With the extending use of computer in various disciplines, number of data processing applications is also increasing which requires processing and storage of large volumes of data. Data compression is primarily a branch of information theory which deals with techniques related to minimizing the amount of data to be transmitted and stored. Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of efficient coding and its consequences, in the form of speed.
What is Data Compression ?
Today, with the growing demands of information storage and data transfer, data compression is becoming increasingly important. Compression is the process of encoding data more efficiently to achieve a reduction in file size. One type of compression available is referred to as lossless compression. This means the compressed file will be restored exactly to its original state with no loss of data during the decompression process. This is essential to data compression as the file would be corrupted and unusable should data be lost. Data compression is the art of reducing the number of bits needed to store or transmit data. Data compression is one of the enabling technologies for multimedia applications. It would not be practical to put images, audio and video on websites if do not use data compression algorithms. Mobile phones would not be able to provide communication clearly without data compression. With data compression techniques, we can reduce the consumption of resources, such as hard disk space or transmission bandwidth.
Data Compression Principles
Below, data compression principles are listed. Data compression:
- It is the substitution of frequently occurring data items, or symbols, with short codes that require fewer bits of storage than the original symbol.
- Saves space, but requires time to save and extract.
- Success varies with type of data.
- Works best on data with low spatial variability and limited possible values.
- Works poorly with high spatial variability data or continuous surfaces.
- Exploits inherent redundancy and irrelevancy by transforming a data file into a smaller one
Figure 1: Data Compression Process
Data Compression Technique
Data compression is the function of presentation layer in OSI reference model. Compression is often used to maximize the use of bandwidth across a network or to optimize disk space when saving data.
There are two general types of compression technique:
Figure 2: Classification of Data Compression
Lossless compression compresses the data in such a way that when data is decompressed it is exactly the same as it was before compression i.e. there is no loss of data. A lossless compression is used to compress file data such as executable code, text files, and numeric data, because programs that process such file data cannot tolerate mistakes in the data. Lossless compression will typically not compress file as much as lossy compression techniques and may take more processing power to accomplish the compression.
Lossless data compression is compression without any loss of data quality. The decompressed file is an exact replica of the original one. Lossless compression is used when it is important that the original and the decompressed data be identical. It is done by re-writing the data in a more space efficient way, removing all kinds of repetitions (compression ratio 2:1). Some image file formats, notably PNG, use only lossless compression, while those like TIFF may use either lossless or lossy methods.
Lossless Compression Algorithms
The various algorithms used to implement lossless data compression are:
Run Length Encoding
- This method replaces the consecutive occurrences of a given symbol with only one copy of the symbol along with a count of how many times that symbol occurs. Hence the names ‘run length’.
- For example, the string AAABBCDDDD would be encoded as 3A2BIC4D.
- A real life example where run-length encoding is quite effective is the fax machine. Most faxes are white sheets with the occasional black text. So, a run-length encoding scheme can take each line and transmit a code for while then the number of pixels, then the code for black and the number of pixels and so on.
- This method of compression must be used carefully. If there is not a lot of repetition in the data then it is possible the run length encoding scheme would actually increase the size of a file.
Differential Pulse Code Modulation
- In this method first a reference symbol is placed. Then for each symbol in the data, we place the difference between that symbol and the reference symbol used.
- For example, using symbol A as reference symbol, the string AAABBC DDDD would be encoded as AOOOl123333, since A is the same as reference symbol, B has a difference of 1 from the reference symbol and so on.
Dictionary Based Encoding
- One of the best known dictionary based encoding algorithms is Lempel-Ziv (LZ) compression algorithm.
- This method is also known as substitution coder.
- In this method, a dictionary (table) of variable length strings (common phrases) is built.
- This dictionary contains almost every string that is expected to occur in data.
- When any of these strings occur in the data, then they are replaced with the corresponding index to the dictionary.
- In this method, instead of working with individual characters in text data, we treat each word as a string and output the index in the dictionary for that word.
- For example, let us say that the word “compression” has the index 4978 in one particular dictionary; it is the 4978thword is usr/share/dict/words. To compress a body of text, each time the string “compression” appears, it would be replaced by 4978.
A lossy compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original, but is “close enough” to be useful in some way. The algorithm eliminates irrelevant information as well, and permits only an approximate reconstruction of the original file. Lossy compression is also done by re-writing the data in a more space efficient way, but more than that: less important details of the image are manipulated or even removed so that higher compression rates are achieved. Lossy compression is dangerously attractive because it can provide compression ratios of 100:1 to 200:1, depending on the type of information being compressed. But the cost is loss of data.
The advantage of lossy methods over lossless methods is that in some cases a lossy method can produce a much smaller compressed file than any known lossless method, while still meeting the requirements of the application.
Examples of Lossy Methods are:
 “Compression Concepts”, available online at: http://www.gitta.info/DataCompress/en/html/CompIntro_learningObject2.html
 Dinesh Thakur, “Data Compression-What is the Data Compression? Explain Lossless Compression and Lossy Compression”, available online at: http://ecomputernotes.com/computer-graphics/basic-of-computer-graphics/data-compression
 Gaurav Sethi, Sweta Shaw, Vinutha K and Chandrani Chakravorty, “Data Compression Techniques”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, pp. 5584-5586
 Hosseini, Mohammad, “A survey of data compression algorithms and their applications.” Network Systems Laboratory, School of Computing Science, Simon Fraser University, BC, Canada (2012).