First Fix and Second Fix

When DBAs are troubleshooting issues, solutions are sometimes applied in the heat of a critical incident which solve the problem, which are then walked away from because the issue has been “fixed”. With experience and a better understanding of the broader issue of long-term solutions, you realise that for a lot of the work there are two parts to a solution. To borrow a couple of terms from the building trade we can call these the First Fix and the Second Fix. Understanding (and implementing) both parts of this are key to evolving into a more Senior DBA mindset.

There are a couple of classic examples of this to illustrate the point, and one more recent addition to SQL Server which will likely start to emerge as a more common problem in the near future.

Recompiles

When you have a high-CPU scenario where a particular stored procedure is dominating, there is a tendency to force recompile it (sp_recompile) in case it was running with a “bad” plan. Sometimes this can work if the execution plan had regressed, and the emergency is then suddenly resolved and you can relax.

But, if this has happened once, it can happen again. Ideally the plan in place when the performance was poor was captured, and also the new better plan was also saved. There then needs to be investigation done on the query as it is obviously capable of generating multiple plans, including ones that can significantly degrade the overall SQL Server performance. This may be parameter sniffing or poor estimates, but there needs to be an investigation and some form of solution to help prevent this from happening again.

Just recompiling (First Fix) and walking away is not good enough, you can get to the point where you can keep on re-compiling but you cannot get back to a good plan as you are relying on luck. Better to address the long-term solution early by getting to an understanding of why the plans vary, and helping the optimiser achieve more stable performance (Second Fix).

Maintenance over-run

Most regular maintenance runs in a predictable time-frame, such as backups, DBCC CHECKDB, and statistics updates. But if you are running percentage-based index fragmentation maintenance (such as the Ola scripts), then this can vary wildly as certain large indexes creep over the threshold from one day to the next. It is for this reason that I prefer to put index maintenance as the last step in the maintenance chain, as it can on occasion run longer than expected, which then needs manually stopping if it is affecting in-day system performance. In this scenario you may be responding to a performance alert, and once you notice the maintenance job was running long and then stopped it, the performance issue goes away (First Fix).

But, as with as the recompile issue, there is a high probability that this will just happen again and cause the same performance issues when fragmentation gets back to the same levels. The Second Fix in this case is to implement a kill/stop job, whose purpose is to force the maintenance job to terminate at a given time if it’s still running to prevent the overrun. If you don’t add this in, then the problem will come back.

Query Store

This new feature in SQL Server 2016 allows us to quickly and easily identify execution plan regressions, and then force the better plan (First Fix). This is a huge step forward compared to the lucky-dip method of recompiles, or the cryptic method of using Plan Guides. But again, pinning a plan is a fast way of forcing behaviour and relieving the pain of a performance issue, but there really needs to be some assessment afterwards to see if there is a better long-term solution. Forced Plans are not guaranteed anyway, as indexes changes can invalidate them, so a Second Fix of evaluating why multiple plans are generated for that query can lead to a better resolution.

Summary

When fixing issues, we often need a fast solution to alleviate the pain being caused by the situation, and it feels great to get the problem fixed quickly. But the professional DBA will then schedule time to look at this in a calmer light later on, to see what can be done to ensure this will never happen again. Problems should ideally be fixed once, but fixed well.

First Fix and Second Fix

The Coeo Blog

Subscribe to Email Updates

Related posts

Troubleshooting Ola Hallengren’s Maintenance Solution

Domain-Independent Windows Failover Cluster for SQL Server AlwaysOn Availability Group

SQLBits 2022 session - Field Testing Ola Hallengren’s Maintenance Solution

Windows Active Directory Detached Cluster with SQL Server Basic Availability Group