In Part1 article Ive described the need for compression in SQL Server and also gave a brief on Compression enhancement overview. As a next step, in this article Im going to describe the existing compression techniques, overview of basic compression techniques and goal of compression. Now the questions hits your mind is, why do we need to know the basic techniques? First we need to understand the basic compression techniques available and how it works, this will be the base to understand the compression technique used in SQL Server. Lets discuss on this one by one
Compression Techniques
We can segregate compression techniques into two types. Lossless and Lossy, lets discuss on them
Lossless
In this technique when you uncompress your data there wont be any loss in your actual data. This technique will compress the exact data without truncating any small piece of information. Since in SQL Server we need the data to be in consistent state after uncompressing it, so this is the best suited technique. There are two commonly used dictionary based algorithms Huffman Coding & LZ Coding for lossless compression. Some examples of lossless compression is RAR, ZIP etc
Lossy
In this technique when you uncompress the data the result data will have some data loss compared to the actual data. So if you use this technique obviously there will be some loss in your actual data hence this technique cannot be used in SQL Server since we need data to be in consistent state. Some examples of lossless techniques are JPG, MP3 etc
Overview of Basic Lossless Compression Techniques
Huffman Coding
- Analyze the entire data to get frequency of letters used
- Weigh the letters based on the frequency
- Update these details in a tabular format and use these values for the entire set of data
More information on Huffman coding can be found at http://www.huffmancoding.com/david-huffman/huffman-algorithm
LZ (Lempel-Ziv) Algorithm
- Initialize the dictionary
- Read consecutive bytes and add it to dictionary if its not there
More information in LZ coding can be found at http://www.data-compression.com/lossless.html#lz
Goals Of SQL Server Data Compression
-
No data should be loosed when data is uncompressed. i.e Uncompressed Data = Actual Data
-
Ultimate aim is to balance between compression and cost. Achieving optimal compression sacrificing minimum cost. Lets say you are attaining 10 % of compression ratio, if your CPU is bouncing between 75~100 % while decompressing the data, then it shouldnt be a optimal threshold. Lets say you are attaining 30% compression with CPU 30%, then we can consider this as an optimal threshold. So we need to find acceptable compression and cost involved.
-
Data distribution and schema plays a role in compression. In row compression you will be making the fixed data type to variable data type by suppressing the tailing empty data. Consider you have declared a column as bit data type and if you enforce row compression then the compression will not be effective since the schema plays an important role.
-
Considering all these things Storage team has chooses dictionary based algorithm
Leave a Reply to Data Compression (SQL Server 2008, 2008R2 Enterprise Edition) « dbadiary Cancel reply