Total Pageviews

Monday, July 8, 2013

Secret of "sp_updatestats" – feed your dragon

Query Optimizer vs. HR Manager
Let’s say you’re working in a company where there’s a couple of departments. Each department has a certain number of employees.


feed your dragon every day
The HR manager maintains a list of all staff in each department. He updates the list whenever there is an addition or any changes in staff. At any time if anyone ask the HR manager how many staff he has in the Finance Department, he will have the correct answer. Now, let’s say the HR manager went on vacation for two months and during that time there were a number of hires and turn overs but the staff list has not been updated yet.

The HR Manager came back from his vacation and someone asked him “how many staff do you have in the Finance department?” What will be the answer? As the list has not been updated, he will not be able to answer correctly. He needs to update the staff list for each department by utilizing the old Chinese abacus method. This will not be much enjoyable if the company is big and the HR Manager has to do a lot of work to correct the staff list.

This analogy is applicable to SQL Server optimizer. If the internal data distribution (statistics) is outdated, SQL Server will correct it. So, instead of SQL Server why don’t we just update it in the first place and on a regular basis?

Why is it so important?
To generate the optimal execution plan, SQL Server needs up-to-date statistics. Due to poor quality statistics, SQL Server creates sub-optimal plans and causes severe performance issues, such as high CPU, Table/Index Scan, and a Parallel plan. Query will start taking longer durations to finish then expected and locking overhead will introduce blocking and even a time-out issue. Not a good experience for the application user.

How statistics are created?
1.      When using a column in a query condition such as a WHERE or JOIN clause causes statistics to be created when automatic statistics creation is enabled.
2.      The query optimizer creates statistics automatically for indexes as a byproduct on tables or views when the index is created. These statistics (single or multi-column) are created on the key columns of the index.
3.      Statistics can be created manually with CREATE STATISTICS command.
4.      Using “sp_createstats” for all columns of all tables in a database explicitly. It creates single column statistics.

Query Optimizer of SQL Server always creates single column statistics with prefix “_WA_Sys_”.

Auto statistics update and threshold:
Generally SQL Server determines whether to update statistics based on changes to column modification counters (colmodctr).

There is an algorithm which triggers SQL Server to updates statistics. SQL Server keeps track of the number of changes for each column and triggers the statistics recompilation mechanism and the threshold is known as Recompilation threshold (RT).  RT depends on the table type (permanent versus temporary), and the number of rows in the table (cardinality) when a query plan is compiled. The recompilation thresholds for all tables referenced in a batch are stored with the query plans of that batch.

RT is calculated as follows. (Where n is the number of rows in the table when the statistics were gathered)

Permanent table
If n <= 500, RT = 500.
If n > 500, RT = 500 + 0.20 * n.
Temporary table
If n < 6, RT = 6.
If 6 <= n <= 500, RT = 500.
If n > 500, RT = 500 + 0.20 * n.
Table variable
Since table variable has no statistics, RT does not exist.

An important new feature for high-throughput OLTP environments is the ability to asynchronously (AUTO_UPDATE_STATISTICS_ASYNC ON) update statistics automatically.

Manual statistics update and drawback:
1.      When an index is created or rebuilt, only associated statistics will be updated and nothing else.
2.      When executing ALTER INDEX ALL … on a table, only the statistics associated with indexes are updated. Automatic or manual statistics created on the table will never be updated.
3.      “REORGANIZE” never updates any statistics.

Auto update and the consequence:
The statistics which are created automatically by SQL Server Query optimizer or manually, how will those be updated? If statistics auto creation and update mechanism have not been disabled then SQL Server will attempt to update those statistics whenever required. For a large table this might cause performance issues such as blocking if it triggers during a peak hour. And if for some reason it did not get updated, the old execution plan (which is now sub-optimal anyway due to changes of underlying data) will be used to execute the query. As a result query performance suffers badly.

How will you update “user or system created statistics”?
As “Automatic or user defined statistics created on the table will never be updated” by ALTER INDRX (ALL) REBUILD command, then the better way to update those statistics is by using “sp_updatestats” or manually each statistics at a time or we will have to depend on SQL Server auto-update mechanism solely. The last two processes may not be an effective option in many environments. Disabling auto update statistics is not recommend, and we also should not rely on auto mechanism completely and on the other hand we want to update the statistics beforehand.


“sp_updatestats” – how does it work?
MSDN says “sp_updatestats updates only the statistics that require updating based on the rowmodctr information in the sys.sysindexes catalog view, thus avoiding unnecessary updates of statistics on unchanged rows”. Also note that “sp_updatestats updates statistics on disabled non-clustered indexes and does not update statistics on disabled clustered indexes.”

“sp_updatestats” utilizes rowmodctr column and as well it utilizes another undocumented internal function “stats_ver_current” to conditionally check of an existing statistics to decide whether it needs to be updated or not.

A better approach to optimize databases:
Though each environment is unique statistics optimization is mandatory for the sake of the best possible query execution plans regardless of the size of table or the database. “sp_updatestats” will ensure the complete update of all statistics irrespective to how they were created. In my experience I found that updating statistics is much more important and crucial than rebuilding indexes.

A good way to optimize indexes and statistics:
1.      First rebuild all indexes based on a predefined fragmentation level; say 30%. Some indexes will not be touched and there will be some outdated statistics leftover.
2.      In the second step, run “sp_updatesstats”, which will take care of all the statistics.

A sample Script on how to use “sp_updatesstats” against all databases:
Following both scripts does the same job. You can use whichever you prefer.

/*******************************************************
** Statistics update on all db
*******************************************************/
EXEC sp_MSforeachdb 'if ''?'' NOT IN (''tempdb'') use ? exec sp_updatestats RESAMPLE'

/*******************************************************
** Statistics update on all db
*******************************************************/
declare @name varchar(100)
declare db_cursor cursor
for
    select  s.[name]
    from    sys.databases s
    where   s.[name] not in ( 'tempdb' )
            and s.[state_desc] = 'ONLINE'
            and s.[user_access_desc] = 'MULTI_USER'
    order by s.[name]
open db_cursor
fetch next from db_cursor into @name
while @@FETCH_STATUS = 0
    begin
        exec ('use ' + @name )
        print '-----------------------------------------------'
        print 'Updating stats on db: ' + @name
        print '-----------------------------------------------'
        exec sp_updatestats
        fetch next from db_cursor into @name
    end
close db_cursor
deallocate db_cursor

Output of “sp_updatestats”

References:
sp_updatestats (Transact-SQL)

Statistics Used by the Query Optimizer in Microsoft SQL Server

Statistics Used by the Query Optimizer in Microsoft SQL Server

Statistical maintenance functionality (autostats) in SQL Server

Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005

Thursday, July 4, 2013

Index optimization - REBUILD vs. REORGANIZE

Index optimization is probably one of the most critical task every database support personnel has to perform on a regular basis. Based on DML operations in a particular environment, we adopt various optimization tasks, steps and strategies that suit our needs. Some tables or indexes may need frequent optimization, some do not need it at all for a longer period of time.

To optimize an index we have two options, REBUILD and REORGANIZE. Both work differently and have different effects. There are some differences which we should be aware of for better understanding of how each T-SQL command works and what does it do for us.

Good to Know some key points:
1.      When a non-clustered index is rebuilt, only the associate statistics for this index will be rebuilt.
2.      Rebuilding a clustered index does not rebuild associated non-clustered indexes unless the keyword ALL is specified.
3.      “HEAP” cannot be optimized. If “ALL” is specified and the underlying table is a heap, the rebuild operation has no effect on the table. Any non-clustered indexes associated with the table are rebuilt.
4.      The rebuild operation can be minimally logged if the database recovery model is set to either bulk-logged or simple.
5.      The options ONLINE = ON and IGNORE_DUP_KEY = ON are not valid while rebuilding an XML index or a spatial index.
6.      “STATISTICS_NORECOMPUTE = ON” means Out-of-date statistics are not automatically recomputed. “STATISTICS_NORECOMPUTE = OFF” means automatic statistic updating is enabled.
7.      If index options are not specified, the existing index option values stored in sys.indexes will be used.
8.      ALTER INDEX cannot be used to repartition an index or move it to a different filegroup. This statement cannot be used to modify the index definition, such as adding or deleting columns or changing the column order.
9.      The values for ONLINE, MAXDOP, and SORT_IN_TEMPDB are not stored in the system catalog. You need to specify this OPTION in the index rebuild statement.
10. Reorganizing an index is always performed online. The process does not hold locks long term hence it does not block queries or updates that are running.
11. When you execute ALTER INDEX ALL … on a table, only the statistics associated with indexes are updated. Automatic or manual statistics created on the table will not be updated.
12. Index REBUILD can be a Parallel operation. Index REORGANIZE is always serial operation.
13. Rebuilding or reorganizing small indexes (which has 128 or less extents) often does not reduce fragmentation.
14. Reorganizing an index uses minimal system resources and also compacts the index pages.
15. Reorganizing an index does not update associate statistics.

Index Optimization Guideline:
The guideline that Microsoft has provided in the MSDN is a general guideline regardless of any DML operations happening in the database which need to be further reviewed by the database administrator based on his/her workload scenario to establish a better threshold.

The sys.dm_db_index_physical_stats can be used to determine fragmentation levels in a specific index, in all indexes on a table or indexed view, in all indexes in a database, or in all indexes in all databases. The following table explains three important columns of the system function which need to be researched closely:

Column
Description
avg_fragmentation_in_percent
The percent of logical fragmentation (out-of-order pages in the index).
fragment_count
The number of fragments (physically consecutive leaf pages) in the index.
avg_fragment_size_in_pages
Average number of pages in one fragment in an index.

Action threshold recommended by Microsoft.
avg_fragmentation_in_percent
T-SQL Command
> 5% and < = 30%
ALTER INDEX REORGANIZE
> 30%
ALTER INDEX REBUILD (All Edition)
ALTER INDEX REBUILD WITH (ONLINE = ON) (Enterprise Edition)
Number of Extents > 128
Will be a good candidate for index optimization

The above threshold is a recommendation only. As every environment is different therefore it is a good idea to research the best threshold that will suit your need.

DMV Query:
The following DMV query can be used to pull detail information about indexes.
/*********************************************************************************
Script: Index Fragmentation Status (includes Partitioned Tables/Indexes)
**********************************************************************************/
select  schema_name(o.schema_id) as [schema_name] ,
        object_name(o.object_id) as [table_name] ,
        i.name as [index_name] ,
        i.type_desc as [index_type] ,
        dmv.page_count ,
        dmv.fragment_count ,
        round(dmv.avg_fragment_size_in_pages, 2, 2) [avg_fragment_size_in_pages] ,
        round(dmv.avg_fragmentation_in_percent, 2, 2) [avg_fragmentation_in_percent] ,
        case when dmv.avg_fragmentation_in_percent <= 5 then 'RELAX'
             when dmv.avg_fragmentation_in_percent <= 30 then 'REORGANIZE'
             when dmv.avg_fragmentation_in_percent > 30 then 'REBUILD'
        end as [action] ,
        stats_date(dmv.object_id, i.index_id) as stats_update_date ,
        case when isnull(ps.function_id, 1) = 1 then 'NO'
             else 'YES'
        end as partitioned ,
        coalesce(fg.name, fgp.name) as [file_group_name] ,
        p.partition_number as [partition_number] ,
        p.rows as [partition_rows] ,
        prv_left.value as [partition_lower_boundary_value] ,
        prv_right.value as [partition_upper_boundary_value] ,
        case when pf.boundary_value_on_right = 1 then 'RIGHT'
             when pf.boundary_value_on_right = 0 then 'LEFT'
             else 'NONE'
        end as [partition_range] ,
        pf.name as [partition_function] ,
        ds.name as [partition_scheme]
from    sys.partitions as p with ( readpast )
        inner join sys.indexes as i with ( readpast ) on i.object_id = p.object_id
                                                         and i.index_id = p.index_id
        inner join sys.objects as o with ( readpast ) on o.object_id = i.object_id
        inner join sys.dm_db_index_physical_stats(db_id(), null, null, null,
                                                  N'LIMITED') dmv on dmv.OBJECT_ID = i.object_id
                                                              and dmv.index_id = i.index_id
                                                              and dmv.partition_number = p.partition_number
        left join sys.data_spaces as ds with ( readpast ) on ds.data_space_id = i.data_space_id
        left join sys.partition_schemes as ps with ( readpast ) on ps.data_space_id = ds.data_space_id
        left join sys.partition_functions as pf with ( readpast ) on pf.function_id = ps.function_id
        left join sys.destination_data_spaces as dds with ( readpast ) on dds.partition_scheme_id = ps.data_space_id
                                                              and dds.destination_id = p.partition_number
        left join sys.filegroups as fg with ( readpast ) on fg.data_space_id = i.data_space_id
        left join sys.filegroups as fgp with ( readpast ) on fgp.data_space_id = dds.data_space_id
        left join sys.partition_range_values as prv_left with ( readpast ) on ps.function_id = prv_left.function_id
                                                              and prv_left.boundary_id = p.partition_number
                                                              - 1
        left join sys.partition_range_values as prv_right with ( readpast ) on ps.function_id = prv_right.function_id
                                                              and prv_right.boundary_id = p.partition_number
where   objectproperty(p.object_id, 'ISMSShipped') = 0
order by [avg_fragmentation_in_percent] DESC,
        [table_name] ,
        [index_name]

Output of the above script:

Good practice:
1.      Try not to DROP an index beforehand and re-create it again. Use ALTER INDEX WITH REBUILD.
2.      To change the index definition, use CREATE INDEX with the DROP_EXISTING clause to perform the operations.
3.      Be careful about “ALL” option.  When “ALL” is specified, all indexes on the table are dropped and rebuilt in a single transaction; Transaction Log will grow rapidly.
4.      Rebuilding indexes ONLINE might need longer time and you still see short duration blocking.
5.      Always choose off-peak hours to optimize indexes and try to use MAXDOP to take advantage of parallel index creation.

Index Optimization Script:
There are a number of automated index optimization scripts available in the net. But the following are two FREE automated scripts you can use to optimize your indexes reliably and efficiently.

Index Defrag Script, v4.1


SQL Server Maintenance Solution

References:
Reorganize and Rebuild Indexes
ALTER INDEX (Transact-SQL)

Wednesday, July 3, 2013

Self-join incurs more I/O activities and increases locking overhead!

The Self-Join technique is commonly used to create a hierarchical tree set or finding duplicate records. It is also used to find previous or next values from a table by utilizing the inline table technique. Generally a self-join is a join in which a table is joined with itself. For example, when we need details about an employee and his manager where both employee and manager ID resides in the same table.

When we use self-join, it usually indicates that the table is not normalized. This may not be a problem for a small table but for a large and busy OLTP table with higher concurrency, this may lead to poor performance and degrade query response time.

The key point is that each self-joined table will be treated as a separate table while retrieving data from the disk, thus self-join incurs more I/O activities and increases locking overhead as it touches the same table twice or more. By adding appropriate indexes on JOIN and WHERE clause columns, it will reduce I/O activities and improve performance. It will be a good approach to avoid self-join whenever possible for heavy duty OLTP tables and queries.

Let’s perform a simple self-join test to see how the execution plan looks like. We can observe that as the retrieval number of records increases, each query behaves differently, such as requiring specific index and introducing parallel query.

Queries:
-- self-join one
select top (10)  a.xid, a.sIdentifier
from    tbllarge a
        inner join tbllarge b on a.xid = b.xid
where   a.sIdentifier ='A1'

-- self-join two
select top (100)  a.xid, a.sIdentifier
from    tbllarge a
        inner join tbllarge b on a.xid = b.xid
where   a.sIdentifier ='A1'

-- self-join three
select top (1000)  a.xid, a.sIdentifier
from    tbllarge a
        inner join tbllarge b on a.xid = b.xid
where   a.sIdentifier ='A1'

Execution Plan: