Data Compression

In Part1 article Ive described the need for compression in SQL Server and also gave a brief on Compression enhancement overview. As a next step, in this article Im going to describe the existing compression techniques, overview of basic compression techniques and goal of compression. Now the questions hits your mind is, why do we need to know the basic techniques? First we need to understand the basic compression techniques available and how it works, this will be the base to understand the compression technique used in SQL Server. Lets discuss on this one by one

Compression Techniques

We can segregate compression techniques into two types. Lossless and Lossy, lets discuss on them

Lossless

In this technique when you uncompress your data there wont be any loss in your actual data. This technique will compress the exact data without truncating any small piece of information. Since in SQL Server we need the data to be in consistent state after uncompressing it, so this is the best suited technique. There are two commonly used dictionary based algorithms Huffman Coding & LZ Coding for lossless compression. Some examples of lossless compression is RAR, ZIP etc

Lossy

In this technique when you uncompress the data the result data will have some data loss compared to the actual data. So if you use this technique obviously there will be some loss in your actual data hence this technique cannot be used in SQL Server since we need data to be in consistent state. Some examples of lossless techniques are JPG, MP3 etc

Overview of Basic Lossless Compression Techniques

Huffman Coding

Analyze the entire data to get frequency of letters used
Weigh the letters based on the frequency
Update these details in a tabular format and use these values for the entire set of data

More information on Huffman coding can be found at http://www.huffmancoding.com/david-huffman/huffman-algorithm

LZ (Lempel-Ziv) Algorithm

Initialize the dictionary
Read consecutive bytes and add it to dictionary if its not there

More information in LZ coding can be found at http://www.data-compression.com/lossless.html#lz

Goals Of SQL Server Data Compression

No data should be loosed when data is uncompressed. i.e Uncompressed Data = Actual Data
Ultimate aim is to balance between compression and cost. Achieving optimal compression sacrificing minimum cost. Lets say you are attaining 10 % of compression ratio, if your CPU is bouncing between 75~100 % while decompressing the data, then it shouldnt be a optimal threshold. Lets say you are attaining 30% compression with CPU 30%, then we can consider this as an optimal threshold. So we need to find acceptable compression and cost involved.
Data distribution and schema plays a role in compression. In row compression you will be making the fixed data type to variable data type by suppressing the tailing empty data. Consider you have declared a column as bit data type and if you enforce row compression then the compression will not be effective since the schema plays an important role.
Considering all these things Storage team has chooses dictionary based algorithm

Comments

One response to “Data Compression”

Leave a Reply Cancel reply