Total Pageviews

Tuesday, February 19, 2013

COUNT(*) - Galileo sees the stars

In a busy OLTP environment, executing COUNT (*) against a large and heavily used table is often considered a performance hindering activity. A DBA or support personnel may immediately think that this execution caused a Table Scan and incurs huge I/O, CPU, Memory pressure, and so on.

Some of the common patterns we often see are as follows:

(a)           SELECT COUNT(*) FROM aTable....

(b)           SELECT sCustomerName, COUNT(*) as nCount FROM aTable......

(c)           DECLARE @nCount INT
              SELECT @nCount=COUNT(*)
               FROM aTable a
               JOIN bTable b ON a.xID=b.xID.......

              IF @nCount=0
                     --DO this
              ELSE
                     --DO that

(d)           IF ( SELECT COUNT(*) FROM aTable ) >=1000
                     --Do this please



A common practice is that we should always use “IF EXISTS” instead of COUNT(*)  to check existing records for a certain condition. Usually,” IF EXISTS” provides much faster execution and stop processing/scanning/seeking the next record as soon as it finds the first one. However, if the business requires counting a certain number of records or all records from a table to make a critical business decision then it will be a terrific task to replace the count functionality.

Reality by design:
So does COUNT(*) really cause performance issues we think it does?

Well, it depends on what form of indexes exists on the underlying table.  If there are no indexes on the underlying table (is clustered and has no non-clustered indexes) or the table is a HEAP (is not clustered and has no non-clustered indexes) then there will be an obvious clustered index or table Scan which will lead to poor performance. But if there is an index, SQL Server will always use that index to perform COUNT(*).  By design, SQL Server must use the smallest index.

We can reveal this behavior by executing the following sample T-SQL script.

Sample T-SQL Code:
Let’s create a sample table with 100,000 rows with a couple of different columns by defining different data types. We will be creating indexes step by step while testing each COUNT(*) statement.
Sample Table and data:

-- Create a sample table
SET NOCOUNT ON;
IF object_id('tblLarge') IS NOT NULL
  DROP TABLE tblLarge
GO
CREATE TABLE tblLarge(
  xID INT IDENTITY (1, 1),
  sName1 VARCHAR(10),
  sName2 VARCHAR(13),
  sName3 VARCHAR(36),
  sIdentifier CHAR(2),
  dDOB DATE,
  nWage NUMERIC(12, 2),
  sLicense CHAR(7),
  bGender BIT
)
GO

-- Populate the table with 100,000 rows
INSERT INTO tblLarge(sName1
                   , sName2
                   , sName3
                   , sIdentifier
                   , dDOB
                   , nWage
                   , sLicense
                   , bGender)
VALUES
  (left(cast(newid() AS VARCHAR(36)), 8)
 , left(cast(newid() AS VARCHAR(36)), 13)
 , cast(newid() AS VARCHAR(36))
 , left(cast(newid() AS VARCHAR(36)), 2)
 , dateadd(dd, -rand() * 20000, getdate())
 , (rand() * 1000)
 , substring(cast(newid() AS VARCHAR(36)), 6, 7)
 , coalesce(isnumeric(left(cast(newid() AS VARCHAR(36)), 1)), 0))

GO 100000


Test against a HEAP:
Let’s execute the following COUNT((*) query against “tblLarge” where this table is still in HEAP condition. Thus we see a Table Scan.

SELECT  COUNT(*)
FROM    tblLarge
GO
Figure 1: A table scan was performed
  
Test against a CLUSTERED Index:
Now create a clustered index on column “xID” and execute the same query. This time SQL Server optimizer uses the clustered index.

/* Create Primary and Clustered index on "xID" column*/
ALTER TABLE dbo.tblLarge ADD CONSTRAINT PK_tblLarge PRIMARY KEY CLUSTERED (xID)
GO

SELECT  COUNT(*)
FROM    tblLarge
GO

Figure 2: A clustered index scan was done

Test against a wider non-clustered index:
In this test we will be creating a non-clustered index on column “sName3” which is a varchar(36).

/* Create Non-Clustered index on "cName3" column*/
CREATE NONCLUSTERED INDEX IX_tblLarge_cName3 ON dbo.tblLarge (sName3)
GO

SELECT  COUNT(*) FROM    tblLarge
GO


Figure 3: A non-clustered index has been chosen by SQL Server query optimizer

Test against a narrow non-clustered index:
In our example table, the column “sName2” is a varchar(13), so let’s  create a non-clustered index for this column and execute the same query.

/* Create Non-Clustered index on "cName2" column*/
CREATE NONCLUSTERED INDEX IX_tblLarge_sName2 ON dbo.tblLarge (sName2)
GO

SELECT  COUNT(*) FROM    tblLarge
GO


Figure 4: The narrower non-clustered index was chosen by SQL Server query optimizer

Test against a narrowest non-clustered index:
Let’s choose the narrower column to crate the non-clustered index. In our example “bGender” is the narrower column.

/* Create Non-Clustered Key on "bGender" column*/
CREATE NONCLUSTERED INDEX IX_tblLarge_bGender ON dbo.tblLarge (bGender)
GO

SELECT COUNT(*) FROM tblLarge
GO

Figure 5: The narrower non-clustered index was chosen by SQL Server query optimizer

Test against the HEAP while there are non-clustered indexes:
What if we drop the clustered index and execute the same query.

/* Drop the Clustered index*/
ALTER TABLE tblLarge DROP CONSTRAINT PK_tblLarge
GO

SELECT COUNT(*) FROM tblLarge
GO

Figure 6: Still the narrower non-clustered index was chosen by SQL Server query optimizer

Conclusion:
From the above series of tests, we saw that the SQL Server Optimizer always uses the smallest non-clustered indexes. Therefore when a business needs a repetitive COUNT(*) against  a table, it may be good to choose to create a non-clustered index on a smallest width column which will obviously reduce IO and CPU cost while reducing resource contention.

2 comments: