Total Pageviews

Tuesday, April 2, 2013

DBCC CHECKDB – “After death comes the doctor”

It is a common practice for a lot of us to run DBCC CHECKDB before backing up the database. Doing so on a small database performance impact may be negligible. However, for a large and high-end OLTP system, impact is noticeable and significant.

The point is why do we run DBCC CHECKDB and what does it achieve? Generally DBCC CHECKDB is considered as a maintenance command (corruption detection and repairing) to make sure that a database is in a consistent state. it can be used to repair a suspect or corrupt database with or without losing data. Losing a single byte of production data is not desirable and running DBCC CHECKDB isn’t the only way to make sure if the database is in good shape.

In this article we will be focusing on some other options and a easy to follow checklist which will help to prevent a possible database corruption.

What is database corruption?
When one or more pages cannot be read by the database engine, the database will be marked as Suspect or Restore pending. Generally there are two types of page corruptions that can happen– logical and physical.

(a)    Physical error: 823
A Windows read or write request has failed. In the read case, SQL Server will have already retried the read request four times. This error is often the result of a hardware error, but may be caused by the device driver. This is known as CRC (cyclic redundancy check) failure.
(b)   Logical error: 824
This error indicates that Windows reports that the page is successfully read from disk, but SQL Server has discovered something wrong with the page. This error is similar to error 823 except that Windows did not detect the error. This usually indicates a problem in the I/O subsystem, such as a failing disk drive, disk firmware problems, faulty device driver, and so on. This is known as torn page detection.
(c)    Restore Pending: 829
A page has been marked as restore pending.

Why do database corruptions happen?
SQL Server Engine highly cares about database corruption and although Microsoft put excellent and tremendous effort to protect user databases from every angle, database corruption still may happen due to the following identified reasons:

(a)    Faulty disk and controller system
(b)   Outdated bios firmware
(c)    Disk Caching issue
(d)   inappropriate system driver
(e)   Power supply failure

How to prevent database corruption:
There is no precise answer to this. Generally, database corruption happens due to faulty hardware and it is more or less random in nature. Therefore proactive monitoring of various logged events and taking action against suspicious occurrences is the key to prevent database corruption.

Starting from SQL 2005, there are features in Database Engine which reduces the likelihood of page level corruption, one is "Checksum" mechanism and another is "automatic page repair".

(a)    PAGE_VERIFY  CHECKSUM (SQL 2005, and up): This is a per database setting.  When CHECKSUM is enabled for the PAGE_VERIFY database option, the SQL Server Database Engine calculates a checksum over the contents of the whole page, and stores the value in the page header when a page is written to disk. When the page is read from disk, the checksum is recomputed and compared to the checksum value that is stored in the page header.

However, there is a glitch about this mechanism - the checksum are computed on the pages that are written only after the database has been enabled for checksum and re-computed when such pages are read back. Since the database was not initially created with checksum enabled, there will be pages in the database that have no checksums.

(b)   Database Mirroring (SQL 2005 SP1, and up): Automatic page repair, this feature is based on the fact that the principal and mirror databases are exactly the same. So, if a page becomes corrupt on the principal, SQL Server should be able to read the page from the mirror and use it to fix the principal. Similarly, if a page becomes corrupt on the mirror, the page can be read from the principal to fix the mirror.

(c)    AlwaysON (SQL 2012): Automatic page repair is supported by AlwaysON Availability Groups. After certain types of errors corrupt a page, making it unreadable, an availability replica (primary or secondary) attempts to automatically recover the page. The partner/replica that cannot read the page requests a fresh copy of the page from its partner or from another replica. If this request succeeds, the unreadable page is replaced by the readable copy, and this usually resolves the error.

Performance Impact:
Running DBCC CHECKDB against large database will incur significant performance impact. DBCC CHECKDB statement typically must read each allocated page from disk into memory so that it can be checked. By design this command will check allocation and structural integrity of all tables, indexes, indexed views and as well as text and image objects. Below are three major impacted areas.

(a)    Significant memory usage
(b)   Extensive I/O activities
(c)    Heavy spooling to tempdb

DBCC CHECKDB does not create any blocking and it works on a snapshot taken internally. Please note that executing DBCC CHECKDB automatically executes DBCC CHECKTABLE for each table in the database, as well as DBCC CHECKALLOC and DBCC CHECKCATALOG, eliminating the need to run them separately.

What to monitor?
Along with others routine checks, we should include the following areas in order to understand the system and database health.

(a)    SQL Server ERROR LOG for ERROR 823, 824
(b)   Windows System Events (disk, iSCSI, ftdisk, etc)
(c)    Querying msdb for possible events: SELECT * FROM msdb..suspect_pages
(d)   AlwaysON Availability Groups: SELECT * FROM sys.dm_hadr_auto_page_repair
(e)   Database mirroring: SELECT * FROM sys.dm_db_mirroring_auto_page_repair

Last ran of DBCC CHECKDB against all databases:
/***********************************************************
** Last ran of DBCC CHECKDB against all databases
***********************************************************/
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET NOCOUNT ON

DECLARE @databases TABLE
        (
          Id INT IDENTITY(1, 1)
                 PRIMARY KEY ,
          ParentObject VARCHAR(255) ,
          Object VARCHAR(255) ,
          Field VARCHAR(255) ,
          Value VARCHAR(255)
        )

INSERT  INTO @databases
        ( ParentObject ,
          Object ,
          Field ,
          Value
        )
        EXEC sp_msforeachdb N'DBCC DBINFO(''?'') WITH TABLERESULTS;';

INSERT  INTO @databases
        ( ParentObject ,
          Object ,
          Field ,
          Value
        )

        SELECT  'Final Record' ,
                'Final Record' ,
                'dbi_dbname' ,
                'Final Record';

WITH    database_name ( Id, Field, Value, DBID )
          AS ( SELECT   Id ,
                        Field ,
                        Value ,
                        ROW_NUMBER() OVER ( PARTITION BY Field ORDER BY ID )
               FROM     @databases
               WHERE    Field = 'dbi_dbname'
             ),
        Last_DBCC_CHECKDB ( Id, Field, Value )
          AS ( SELECT   Id ,
                        Field ,
                        Value
               FROM     @databases
               WHERE    Field = 'dbi_dbccLastKnownGood'
             )

     SELECT DISTINCT
            db1.Value [database_name] ,
            ldc.Value [last_DBCC_CHECKDB]
     FROM   Last_DBCC_CHECKDB ldc
            INNER JOIN database_name db1 ON ldc.Id > db1.Id
            INNER JOIN database_name db2 ON ldc.Id < db2.Id
                                            AND db2.DBID = db1.DBID + 1
     ORDER BY db1.Value;

Automated Monitoring and SQL Alert Feature:
DBCC CHEKDB is a more reactive and corrective kind of command which tells us the sad story of a database’s consistency after it corrupted. It will be too late to take action when the corruption is discovered. 

In order to get early notifications regarding database issues, we can easily implement a notification service task which will let us know immediately when an issue occurred in a database.

How to enable Alert System: 
1.       Configure Database Mail and Create a mail profile
2.       Create operator who will be notified in the event of issue
3.       In SQL Server Agent attach the mail profile to the Alert System
4.        SQL Agent Server needs to be restarted to take effect of above changes
5.       Expand SQL Agent tree nodes and choose Alert and create New Alert
6.       In the Alert Dialog box – choose “SQL Server Event Alert”
7.       For “Severity” level there are couples of options, select “Fatal” type such as 023, 024, 025, etc.

Below is a Notification Configuration screenshot for alerting a database corruption error in SQL Server Instance level for all databases.


We will receive the following alert message in case of any integrity issue.

DESCRIPTION:   SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x94a8f51f; actual: 0x0275a963). It occurred during a read of page (1:0) in database ID 7 at offset 0000000000000000 in file 'C:\Temp\TestDB.mdf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

Database Corruption related message can be found in SQL Server ERROR LOG:

2013-04-02 11:30:15.920 spid57 Setting database option ONLINE to ON for database 'TestDB'.
2013-04-02 11:30:15.920 spid57 Starting up database 'TestDB'.
2013-04-02 11:30:15.940 spid57 Error: 824, Severity: 24, State: 6.
2013-04-02 11:30:15.940 spid57 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x94a8f51f; actual: 0x0275a963). It occurred during a read of page (1:0) in database ID 7 at offset 0000000000000000 in file 'C:\Temp\TestDB.mdf'. Additional messages


Result of querying msdb for all possible recorded events happened in SQL Server instance level:

 
Recommendations:
Preventing database corruptions are one of the critical tasks of any DBA’s or SysAdmin’s job responsibilities. Along with the regular database backups and system checks described above, Automatic Page Repair features of database mirroring or AlwaysON can be used to prevent page level corruption.

Summary:
Honestly, no matter whatever we do, regular database and frequent transaction log backup is the only safeguard against database corruption or any disastrous situation. And even though, if the backup has not been restored successfully somewhere and periodically, nobody can guarantee that the backup is good enough to restore the database– we are not prepared for a database disaster.

So “After corruption comes the DBCC CHECKDB” is similar to “After death comes the Doctor”.

Learn from Guru: 
SQL Server DBCC CHECKDB
http://www.sqlskills.com/blogs/paul/category/checkdb-from-every-angle/

Description of using disk drive caches with SQL Server that every database administrator should know

2 comments:

  1. Sarjen

    A very good to the point explanation
    thanks

    Khalid

    ReplyDelete
  2. Thanks for sharing this informative article with us, I have found another nicely explained article on SQL Server Error 824. Must read from here: http://sqltechtips.blogspot.com/2015/11/troubleshooting-error-824.html

    ReplyDelete