SQL Server – performance and other stories: SQL Server Data Compression

Data compression is an Enterprise/Developer Edition feature starting from SQL 2008 and onward. The pros of this feature is that it reduces disk space and memory requirement, however on the other hand, it has some CPU cost for compressing and decompressing data pages repeatedly. Therefore, it is recommended that we should have some CPU headroom where it is being implemented.

We can easily estimate the benefits of saving disk space and memory, but how do we measure CPU cost associates with data compression? There is no any specific indicator till to date that can measure associated CPU headroom or cost.

What we can compress?

Unfortunately, we cannot compress the entire database in one shot, there is no such option. We can compress data table by table. We have following options:

1. A HEAP or a Clustered index

2. One or all non-clustered index

3. One or any table partition with different compression option

4. Indexed view

We have three options for data compression:

1. ROW: Compresses only rows in a leaf-level page.

2. PAGE: Compresses all leaf-level pages and all the rows as well

3. NONE: default – no compression

Basic syntax of data compression:

The following command changes the compression of a table and the heap or clustered index will be rebuilt. Please note that if the table is a HEAP (has no clustered index), all non-clustered indexes will be rebuilt.

ALTER TABLE TableA REBUILD WITH (DATA_COMPRESSION = PAGE);

(a) To rebuild an existing Clustered Index:

CREATE CLUSTERED INDEX [PK_TableA] ON [dbo].[TableA]

([xID] ASC)

WITH (DATA_COMPRESSION = PAGE, DROP_EXISTING = ON)

(b) To rebuild an existing non-clustered Index:

CREATE CLUSTERED INDEX [IX_TableA_Col1] ON [dbo].[TableA]

([Col1] ASC)

WITH (DATA_COMPRESSION = PAGE, DROP_EXISTING = ON)

ALTER INDEX ALL ON TableA

REBUILD WITH ( DATA_COMPRESSION = PAGE)

(d) To rebuild Partition number 1 (one)

ALTER TABLE TableA

REBUILD PARTITION = 1 WITH (DATA_COMPRESSION = PAGE)

(e) To rebuild all partitions in the table:

ALTER TABLE TableA

REBUILD PARTITION = ALL

WITH (DATA_COMPRESSION = PAGE ON PARTITIONS(1) ) ;

Our Sample table and testing approach:

We will be cratering a sample table and populating it with 1,000,000 rows.

1. Review data compression on HEAP with default fill factor

2. Compare the physical size versus cached size

3. Comparing IO and CPU

DMV and script for comparing compression and data buffer:

1. We will use following script to get buffer and disk pages

/*************************************************************

** Script to check buffer and disk pages

*************************************************************/

-- dbcc dropcleanbuffers

SELECT DB_NAME(bd.database_id) AS database_name ,

obj.[name] AS 'table_name' ,

obj.index_id ,

si.name index_name ,

obj.type_desc ,

obj.total_pages AS total_pages_in_disk ,

obj.size_in_disk_mb ,

CONVERT(NUMERIC(10), ( COUNT(*) )) AS total_cached_pages ,

CONVERT(NUMERIC(10, 2), ( COUNT(*) * 8 ) / 1024.0) AS 'cached_size_mb'

FROM sys.dm_os_buffer_descriptors AS bd

INNER JOIN ( SELECT OBJECT_NAME(object_id) AS name ,

type_desc ,

index_id ,

allocation_unit_id ,

total_pages ,

CONVERT(NUMERIC(10, 2), ( total_pages * 8 ) / 1024.0) AS 'size_in_disk_mb'

FROM sys.allocation_units AS au

INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id

AND ( au.type = 1

OR au.type = 3 )

UNION ALL

SELECT OBJECT_NAME(object_id) AS name ,

type_desc ,

index_id ,

allocation_unit_id ,

total_pages ,

CONVERT(NUMERIC(10, 2), ( total_pages * 8 ) / 1024.0) AS 'size_in_disk_mb'

FROM sys.allocation_units AS au

INNER JOIN sys.partitions AS p ON au.container_id = p.hobt_id

AND au.type = 2

) AS obj ON bd.allocation_unit_id = obj.allocation_unit_id

JOIN sys.indexes si ON si.index_id = obj.index_id

AND si.[object_id] = OBJECT_ID(obj.name)

WHERE bd.database_id = DB_ID()

AND obj.[name] = 'tbllarge'

GROUP BY obj.name ,

bd.database_id ,

obj.index_id ,

obj.type_desc ,

si.name ,

size_in_disk_mb ,

obj.total_pages

Let’s get started:

1. Create a “tblLarge” with default fill factor (which is 0 or 100) and default data compression (which is NONE)

SET NOCOUNT ON

IF OBJECT_ID('tblLarge') IS NOT NULL

DROP TABLE tblLarge

CREATE TABLE tblLarge

(

xID INT IDENTITY(1, 1) ,

sName1 VARCHAR(100) ,

sName2 VARCHAR(1000) ,

sName3 VARCHAR(400) ,

sIdentifier CHAR(100) ,

dDOB DATETIME ,

nWage NUMERIC(20, 2) ,

sLicense VARCHAR(25)

)

2. Populate 1000000 rows

INSERT INTO tblLarge

( sName1 ,

sName2 ,

sName3 ,

sIdentifier ,

dDOB ,

nWage ,

sLicense

)

VALUES ( LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*50) , -- sName1

LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*60) , -- sName2

LEFT(CAST(NEWID() AS VARCHAR(36)), RAND()*70) , -- sName2

LEFT(CAST(NEWID() AS VARCHAR(36)), 2) , -- sIdentifier

DATEADD(dd, -RAND() * 20000, GETDATE()) , -- dDOB

( RAND() * 1000 ) , -- nWage

SUBSTRING(CAST(NEWID() AS VARCHAR(36)), 6, 7) -- sLicense

)

GO 1000000

3. We need to clear the data buffer cache with the following DBCC command

DBCC DROPCLEANBUFFERS

4. We also like to get the STATISTICS IO and STATISTICS TIME output

5. We will use the following query to perform the test

SELECT * FROM tblLarge WHERE sIdentifier='AB'

6. Our preferable data compression method will be PAGE-level settings.

Reviewing performance on the HEAP with 0 Fill Factor without Data Compression:

We will run the following query at least ten times and average the ten outputs.

SET STATISTICS IO ON

SET STATISTICS TIME ON

DBCC DROPCLEANBUFFERS

SELECT * FROM tblLarge WHERE sIdentifier='AB'

After running the above query ten times in my system it seems that we have the following average from the STATISTICS TIME output:

-- Output from SET STATISTICS TIME ON

SQL Server Execution Times:

CPU time = 358 ms, elapsed time = 3190 ms.

CPU time = 406 ms, elapsed time = 2926 ms.

CPU time = 421 ms, elapsed time = 3074 ms.

CPU time = 297 ms, elapsed time = 3291 ms.

CPU time = 546 ms, elapsed time = 3218 ms.

CPU time = 359 ms, elapsed time = 3169 ms.

CPU time = 437 ms, elapsed time = 3057 ms.

CPU time = 437 ms, elapsed time = 3007 ms.

CPU time = 312 ms, elapsed time = 3011 ms.

CPU time = 327 ms, elapsed time = 2931 ms.

Average CPU time = 390 ms

Average elapsed time = 3087.4 ms.

Whereas STATISTICS IO output as follows:

Table 'tblLarge'. Scan count 3, logical reads 28832, physical reads 0, read-ahead reads 28810, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Buffer review from the given DMV query:

Reviewing performance on the HEAP with 0 Fill Factor with Data Compression:

Now compress the HEAP “tblLarge” table using the PAGE-level setting:

DBCC DROPCLEANBUFFERS

ALTER TABLE tblLarge REBUILD WITH (DATA_COMPRESSION = PAGE);

STATISTICS TIME output from above is:

SQL Server Execution Times:

CPU time = 6302 ms, elapsed time = 18245 ms.

Once the data compression is done we run the same script we ran before.

SET STATISTICS IO ON

SET STATISTICS TIME ON

DBCC DROPCLEANBUFFERS

SELECT * FROM tblLarge WHERE sIdentifier='AB'

From the STATISTICS TIME output we got following:

-- Output from SET STATISTICS TIME ON

SQL Server Execution Times:

CPU time = 342 ms, elapsed time = 643 ms.

CPU time = 313 ms, elapsed time = 470 ms.

CPU time = 281 ms, elapsed time = 630 ms.

CPU time = 375 ms, elapsed time = 478 ms.

CPU time = 359 ms, elapsed time = 666 ms.

CPU time = 327 ms, elapsed time = 475 ms.

CPU time = 375 ms, elapsed time = 643 ms.

CPU time = 312 ms, elapsed time = 470 ms.

CPU time = 297 ms, elapsed time = 544 ms.

CPU time = 328 ms, elapsed time = 471 ms.

Average CPU time = 330.9 ms

Average elapsed time = 549 ms.

From the STATISTICS IO output we got the following:

Table 'tblLarge'. Scan count 3, logical reads 13981, physical reads 0, read-ahead reads 1640, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Buffer review from the given DMV query:

Comparison Without vs. With:

STATISTICS IO	Without Compression	With Compression	Difference
Scan count	3	3	0
logical reads	28832	13981	14851
physical reads	0	0	0
read-ahead reads	28810	1640	13189
STATISTICS TIME	Without Compression	With Compression	Difference
Average CPU Time	379.3 ms	325.8 ms	53.5
Average elapsed time	3050.8 ms	1320.7 ms	1730.1
Data Pages	Without Compression	With Compression	Difference
# pages in disk	28841 (225.32 mb)	14001 (109.38 mb)	115.93 mb
# of pages In memory	28833 (225.26 mb)	13982 (109.23 mb)	116.03 mb

Comparison Summary:

From the above comparison it makes sense that data retrieval cost has been improved dramatically. It also indicates that the average CPU cost has also been reduced. In our example, there is a one time for Data compression. However, there are some (?) CPU costs associated with DML operations when Data Compression is enabled for a table, partition or index which I guess is much less than the overall benefits.

Benefits from Data compression:

1. Reduces disk space overhead

2. Reduces memory requirement

3. Fewer reads – improves performance

4. Less worker time – increases concurrency

5. Reduces Lock contention

Disadvantages:

Other than some CPU overhead there is no significant side effects mentioned in Microsoft Books Online (BOL) or in the MSDN article. However, there are guidelines and information that we can read for better understanding before implementing data compression.

Data Compression: Strategy, Capacity Planning and Best Practices

http://msdn.microsoft.com/en-us/library/dd894051(v=sql.100).aspx

“Microsoft SQL Server 2008 Internals” - Kalen Delancy

MS Press Book

Conclusion:

Data compression is an Enterprise Edition feature which helps to improve performance in many ways. We should take advantage of this feature to enhance performance.

So, beauty or beast? The answer is “The best way to predict your future is to create it” - Abraham Lincoln

SQL Server – performance and other stories

Total Pageviews

Sunday, February 24, 2013

SQL Server Data Compression – Beauty or Beast?

No comments:

Post a Comment