Stay C.A.L.M. - Analyzing Space Alerts

Hello again. We've talked about a process to help you deal with alerts - so now let's start looking at some specifics, shall we? I'm going to lay these out in no particular order - this one is inspired by a colleague of mine, who asked me a great question: "Why do we care about low disk space?" So, without further ado - why do we? We've got a low disk space alert, let's walk through it.

Low Space Alerts come in all shapes and sizes, so first of all - is the file causing the alert a Data or Log file? If you're not sure, run something like TreeSize on the disk in question, give it a few minutes and run again, comparing the size of the biggest objects. Right, found it? Read on.

Data

If you don't have the space to write your data into, you...can't write data - this is often seen as a problem in a database. evan-wise-1SLDCdUZAcM-unsplash These cases are usually a genuine need for more space, but there are some exceptions - OLTP/Staging databases, where you might have transient data, and can reclaim some between runs if you're hurting for space, for instance. Treat these data file alerts like they're log file alerts. If you're NOT looking at a situation where you can easily reclaim space in the data file, you've got 2 options. You either need to use less space or have more space - there's no magic third option.

Use Less Space

Archive old data. Clear down temp tables. Review and drop unused indexes. Do you really need the access logs from 10 years ago? Do you need them on 'hot' storage or can you partition that table off and stick anything over a year old on some other, cheaper storage? Can we apply compression? If this is an SSAS database - do you need all the columns in your model, or can you crop out the ones you don't use to improve compression? Have you got an nvarchar(max) for your middle initial field?

Gotchas here - structural changes are blocking operations, so they're one to put in a maintenance window, and for Sagan's sake, make sure that change isn't going to interfere with index maintenance work because those two will collide. Honestly, in these scenarios, I'd rather have a meatbag on hand than a scheduled task because if it takes too long, errors, or whatever, I want someone to make a judgement call based on what else is in the mix. YMMV.

Have More Space

Extend the volume. Allocate another disk to the pool. Whatever. Really, there's nothing interesting for me to write here. 'Get more disk' is the entirety of the advice.

Log

Log space is interesting, so first a brief bit on how it works - the Transaction Log is a chunk of space, broken up into smaller sections called 'Virtual Log Files' (VLFs). As a transaction begins, a Begin Transaction Log Record is written to the end of the log. As the transaction ends, an End Transaction Log Record is written to the end of the log. Once all the 'Begin Tran' records within a VLF have had a matching 'End Tran' record written, the VLF is marked as ready for re-use, allowing the log to overwrite it once the active point wraps back around (so when the log hits the end of its allocated space, it goes back to the beginning - like Word Wrap).

But what if there is no 'End Tran' record? Well, then the VLFs remain active and can't be overwritten. So, the log has two options. One, disallow any new 'Begin Tran' records - i.e. refuse any new incoming transactions. If that sounds OK to you, a) you can stop worrying about space alerts, and b) are you sure DBA is the job for you? Because that is also known as 'server down' in user parlance - SQL has just started declining new transactions - and everything you want to do is a transaction... So, option two is preferred (and is the one SQL will take if at all possible) - extend the log. The Transaction Log grows (according to your autogrowth settings - which you haven't left as default, right? Those defaults are embarrassingly bad, and will hurt performance), expanding itself and creating a bunch more VLFs to use (Paul Randall and Kim Tripp over at SQLSkills have fantastic posts on that, so I'm not going to repeat them as to exactly how many and how big - not important for now). This expansion is another blocking operation - the file HAS to be 'Zero-Initialised' - literally just filled with zeroes - before SQL can create the VLFs and take over the space (imagine if the space allocated came from the Log file of a different instance, for instance, and the chaos that would cause as SQL tried to redo operations on databases that don't even exist).

So, if there's space, the Log has grown out and the long-running transactions have finally finished one way or another, so SQL quietly cleans up, shrinking the Log back down... wait, no. SQL has no way of knowing whether or not that huge transaction is the new normal. It's entirely possible that you don't either. So, the Log remains grown, taking up more space.

So, that's how the Log grows and behaves. The question you have to answer when looking at low disk space alerts on your Log disk is simply whether that growth is valid or whether it's a one-off event - runaway transaction, massive OLTP load, etc. This is generally a question of trends - look back at the Log Usage as far back as you've got it (and if you're not capturing it, maybe now's the time to start? Even persisting it to an SQL table once an hour/day/week or so can give you a good working baseline...). If there's been a recent spike, it's probably a long-runner - deal with that (let it run, unblock it, kill it and have a quiet word with a BI Developer - whatever) first, because it's still causing the log to grow. Once the transaction done, and usage has dropped back down, you can usually shrink the log back down to where it started.

OK, so you've eliminated long-runners, you've eliminated genuine growth - you have a file that is oversized, probably a log file, but potentially a data file.

If it's a data file:

Wait for your next maintenance window and shrink it back down to a reasonable size - I generally divide the minimum possible size of the ‘allocated’ (i.e. actually full part of the file, rather than the reserved space) data by the autogrowth increment, round up and add one, then multiply that number back out by the autogrowth increment. That leaves you in a place where you're not going to immediately trigger another growth event and have some breathing room, but don't have that huge empty overhead (so, on a 1024 MB autogrowth, you'd reclaim down to 1024/2048/3072/etc because otherwise it looks deeply odd and bugs me. No better technical reason I'm familiar with).

If it's a Log file:

You have some options. Check that there's space to reclaim (again, observing the autogrowth values). Now, look back that that historical data. If you have trends over a whole business cycle, you're aiming to resize so that the highest point of log usage is 75% of the log (give yourself some room to be wrong at this stage) - shrink (by truncating, if in-hours and possible, rather than moving data around, to avoid blocking user work) down to that size. If you don't have a full business cycle, but you've got a decent portion of one, you're aiming to shrink down so that double the highest point of log usage is around 75% (since you have less data, there's much more space to be wrong here, and this is an iterative process, so slow and steady wins the race). If you're seeing zero usage, cut the log by half. If you've got less than, say, a week, and you aren't at-risk of immediately blowing out the log drive, set up some sort of trend capture so you can make an informed decision on this.

Now, it's a week later. The magic of time-travel by blog post! Seriously though - capture the trends for another week, and compare to where you were last week - have any of your shrinks been undone? Are you still seeing unused space in that big log file? Keep tweaking downwards - and every week, you're getting more efficient space usage. Don't shrink down to the point where max usage is 100%, though - you always want some breathing room (that less-than-two-autogrowths increment we discussed before).

I'd hope it goes without saying that if you're in pain - if there's logs filling up and transactions are starting to fail - truncate the log with the MOST free space by 50% of it's free space first, and see if that gives you time to take a more measured approach. If not, rinse and repeat until either you have that time or you've had to extend the volume.

I hope this has been interesting, useful or at least remotely entertaining! As always, any questions, objections, or, well... comments? There's a button down there for that.

Stay C.A.L.M. - Analyzing Space Alerts

The Coeo Blog

Data

Use Less Space

Have More Space

Log

If it's a data file:

If it's a Log file:

Subscribe to Email Updates

Related posts

Troubleshooting Ola Hallengren’s Maintenance Solution

Domain-Independent Windows Failover Cluster for SQL Server AlwaysOn Availability Group

SQLBits 2022 session - Field Testing Ola Hallengren’s Maintenance Solution

Windows Active Directory Detached Cluster with SQL Server Basic Availability Group