SQL Server – performance and other stories: Choosing an efficient clustered key

Other side of the moon

Choosing an efficient clustered key is a crucial factor of good database performance. However this factor is often neglected during database schema design time leading to poor performance. It also becomes difficult to resolve issues when the database grows into multi-terabytes in size and even using huge hardware will not reach the satisfactory performance goal.

What is a Clustered Key?

“Clustered indexes sort and store the data rows in the table based on their key values. There can only be one clustered index per table, because the data rows themselves can only be sorted in one order”. To improve performance, reducing the IO overhead is necessary to have a clustered key on almost all tables in a high transactional database. Although there are numerous guidelines and good practices that are available to understand the necessity of an appropriate clustered key in a table; the question is how often do we consider those practices while choosing a clustered key?

It is generally advisable that a clustered key should be the one which is narrow in length and has a unique value column (such as primary key). If the column is not unique then Database Engine will add a 4-byte uniqueifier value to each row to make the column unique. This added value is internal and can’t be seen or accessed by the user and has some internal overhead. However, the more inefficiency occurs when the clustered key is wider than needed.

Pitfall of in-efficient clustered key:

1. Fragmentation: Rapidly introduces more fragmentation.

2. Page Split: A huge number of page allocations and de-allocations happen.

3. Space: Requires more disk & memory, and IO cost will be high.

4. CPU Usage: Observe high CPU due to excessive page split.

5. Slowness: Query response time decreases.

6. Optimization: Index optimization requires more time.

Good Clustered Key:

1. A unique key column is the best candidate for a clustered key.

2. IDENTITY column is a good choice as they are sequential.

3. The column which is used on a JOIN clause.

4. The column used to retrieve data sequentially.

5. The column used in SORT (GROUP or ORDER) operation frequently.

6. Frequently used in range scan (such as BETWEEN, >=, =< )

7. Static Column: such as EmployeeID, SSN.

In-efficient choice for clustered key:

1. Wide Keys: multi-columns key. Such as LastName + FirstName + MiddleName or Address1 + Adddress2 + Phone, so on. “The key values from the clustered index are used by all non-clustered indexes as lookup keys. Any non-clustered indexes defined on the same table will be significantly larger because the non-clustered index entries contain the clustering key and also the key columns defined for that non-clustered index.”

2. GUID: Randomly generated unique values leads to highest possible fragmentation. NEWSEQUENTIALID() can be used instead of NEWID() to create GUID to reduce fragmentation in a table.

3. Data Changes: The column which has frequent value change is not a good choice for a clustered key.

Narrow vs. Wide Clustered Key Test:

Here we will be observing how a wide clustered key introduces performance issues. In our example,

(a) “xID” is the Clustered Key which is a Primary Key and an Identity column.

(b) Later we will create a multi-column clustered key by using “sName1”, “sName2” and “sName3” which are varchar columns.

(d) We will review fragmentation and page split for type of indexes.

DMV Query:

--To check table and index level changes:

SELECT OBJECT_NAME(ios.object_id, ios.database_id) as table_name,

ios.index_id ,

si.name AS index_name,

ios.leaf_insert_count +

ios.leaf_update_count +

ios.leaf_delete_count AS leaf_changes,

ios.leaf_allocation_count AS leaf_page_splits,

ios.nonleaf_insert_count +

ios.nonleaf_update_count +

ios.nonleaf_delete_count AS nonleaf_changes,

ios.nonleaf_allocation_count AS nonleaf_page_splits,

(ios.range_scan_count + ios.leaf_insert_count

+ ios.leaf_delete_count + ios.leaf_update_count

+ ios.leaf_page_merge_count + ios.singleton_lookup_count

) total_changes

FROM sys.dm_db_index_operational_stats(DB_ID(), NULL, NULL, NULL) ios

JOIN sys.objects so ON so.object_id = ios.object_id

JOIN sys.indexes si ON si.object_id = ios.object_id

AND si.index_id = ios.index_id

JOIN sys.schemas ss ON so.schema_id = ss.schema_id

WHERE OBJECTPROPERTY(ios.object_id, 'IsUserTable') = 1

ORDER BY leaf_changes DESC

--To check index fragmentation:

SELECT a.index_id ,

b.name AS [object_name],

CONVERT(NUMERIC(5,2),a.avg_fragmentation_in_percent) pct_avg_fragmentation

FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, NULL) AS a

JOIN sys.indexes AS b ON a.object_id = b.object_id

AND a.index_id = b.index_id;

Script to test:

CREATE DATABASE TestDB

USE [TestDB]

CREATE TABLE [tblLarge](

[xID] [int] IDENTITY(1,1) NOT NULL,

[sName1] [varchar](10) DEFAULT 'ABC' NOT NULL,

[sName2] [varchar](13) DEFAULT 'ABC' NOT NULL,

[sName3] [varchar](36) DEFAULT 'ABC'NOT NULL,

[sIdentifier] [char](2) NULL,

[dDOB] [date] NULL,

[nWage] [numeric](12, 2) NULL,

[sLicense] [char](7) NULL,

[bGender] [bit] NULL

) ON [PRIMARY]

-- Clustered key on xID

ALTER TABLE tblLarge ADD CONSTRAINT PK_tblLarge

PRIMARY KEY CLUSTERED (xID) WITH (FILLFACTOR=90)

-- DROP constraint

ALTER TABLE [dbo].[tblLarge] DROP CONSTRAINT [PK_tblLarge]

-- Multi-column clustered key

ALTER TABLE tblLarge ADD CONSTRAINT PK_tblLarge

PRIMARY KEY CLUSTERED (sName1, sName2, sName3) WITH (FILLFACTOR=90)

-- Insert 100,000 records

INSERT INTO tblLarge

( sName1 ,

sName2 ,

sName3 ,

sIdentifier ,

dDOB ,

nWage ,

sLicense ,

bGender

)

VALUES ( LEFT(CAST(NEWID() AS VARCHAR(36)), 8) ,

LEFT(CAST(NEWID() AS VARCHAR(36)), 13) ,

CAST(NEWID() AS VARCHAR(36)) ,

LEFT(CAST(NEWID() AS VARCHAR(36)), 2) ,

DATEADD(dd, -RAND() * 20000, GETDATE()) ,

( RAND() * 1000 ) ,

SUBSTRING(CAST(NEWID() AS VARCHAR(36)), 6, 7) ,

COALESCE(ISNUMERIC(LEFT(CAST(NEWID() AS VARCHAR(36)),1)),0))

GO 100000

Fragmentation Comparison:

As you can see from the following picture given below that the fragmentation and page split has been increased dramatically when wide key has been used.

Figure#1: Narrow clustered key

Figure#2: Multi-column clustered key

Conclusion:
While using wide key or multi-columns for clustered index is supported by SQL Server, but we should not overlook the dangerous performance consequences that occurs silently.

Reference:

Clustered Index Design Guidelines

http://technet.microsoft.com/en-us/library/ms190639(v=sql.105).aspx

SQL Server – performance and other stories

Total Pageviews

Tuesday, September 3, 2013

Choosing an efficient clustered key - Other side of the moon

No comments:

Post a Comment