Working with 30000 data files in one database – why do my log backups take so long?

Working in the managed support team at Coeo, we often get interesting and unique problems to solve, and one particular issue had landed on our queue that was just that.

The Issue

Log backups for a customer’s production database had started taking 15-20 minutes to complete, even though the size of the log backups were no different to the log backups on the previous day. Previously, log backups would take approximately 2 minutes to complete. The problem with an increased duration for log backups was that, with a frequency of 15 minutes for each log backup, we were seeing a number of missed runs of the backups and therefore the recovery point objective was reduced.

And then of course, there’s the natural curiosity that comes with seeing a problem that you’ve never come across before.

The issue was occurring on only one production database on a SQL Server 2012 Enterprise instance (with SP1) that was on a 2-node cluster. The affected database was approximately 700 GB in size and the log was 100 GB in size. Typical compressed log backup sizes (outside of maintenance windows) was anywhere between 40 MB to 150 MB.

Interestingly, the duration of a log backup (around 15-20 minutes) seemed to be unaffected by the size of the corresponding log backup. In other words, a log backup of 40 MB was just as likely to take the same amount of time as a log backup of 150 MB.

Also, there were other databases on the same SQL Server instance that produced similarly sized log backups but were unaffected by this issue.

The cause

This is the point where I would normally take you through a journey of investigation and root cause analysis, before performing the final reveal and explain what the cause is.

However, in this instance, I’m going to jump straight to the end of the story and tell you what caused the issue. I’ll then show you how to make a log backup take 20 minutes on an empty database with just a 500 MB log file.

And to be fair, the cause is kind of given away in the title of this post.

What we found was that the recorded duration of the log backup (in the msdb.dbo.backupset table) was showing that the backup completed in around 2 minutes. However, for 17 minutes subsequent to that, something else was happening.

It turns out, that for those 17 minutes, SQL Server was writing to the msdb.dbo.backupfile table.

Here’s the description of this table from BOL:

Contains one row for each data or log file of a database. The columns describes the file configuration at the time the backup was taken. Whether or not the file is included in the backup is determined by the is_present column. This table is stored in the msdb database.

What perhaps is not necessarily obvious from that description is that the table can contain one row for each data file (even if the corresponding data file no longer exists in the database). Additionally, BOL describes 8 possible states in the state column (states 0-7). However there is an undocumented 9^th state. State = 8 has a corresponding state_desc value of “DROPPED”.

In other words, if you create a data file and then drop it, subsequent log backups will record a row in the backupfile table, even though that file no longer exists.

In this particular case, what we found was that, every log backup was recording 32,767 rows in the backupfile table, and whilst the log backup was only taking 2 minutes to complete, it was the process of writing out data into the backupfile table that was taking so long.

And yes, you guessed it – this database did have 32,767 data files. The reason for this was that a partitioning was in play with some partitions being created at a daily level. As we’ll see however, the database doesn’t need to have that many data files – it just has to have had that many data files.

We did raise a case with Microsoft and they were able to reveal that this is indeed the case and they were able to give some additional information.

There is an undocumented system table called sys.sysprufiles, which is accessible through an admin connection. This table contains one row for every data file that has been created in the database since the last full or log backup.

As part of the log backup process, SQL Server, subsequent to the log backup, runs through the data in this table and generates the data needed to insert into the backupfile table in msdb. It is this process that was taking so long.

Note that this table never contains more than 32,767 rows. So if you’ve created more data files than 32,767 in the lifetime of the database, rows at the beginning of the table are replaced with newer rows, which represent newer files.

Reproducing the Problem

Create a database called ManyFiles, containing 1 data file and 1 log file – make sure the log file is around 500 MB in size. The reason for this is that we’re going to create and delete many files, which will generate some transactional activity and we want to make sure that the log doesn’t auto-grow, which would create additional Virtual Log Files.

USE [master]
GO
CREATE DATABASE [ManyFiles]
CONTAINMENT = NONE
ON  PRIMARY
( NAME = N'ManyFiles', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\ManyFiles.mdf' , SIZE = 8192KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
LOG ON
( NAME = N'ManyFiles_log', FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\ManyFiles_log.ldf' , SIZE = 512000KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO
ALTER DATABASE [ManyFiles] SET RECOVERY FULL
GO

Next, perform a full backup of the database:

backup database ManyFiles to disk = 'D:\Backup\ManyFiles.bak' with init

After creating the procedure “generate_many_files” in the database (see scripts section), execute the procedure to create (and delete) just under 4000 data files:

exec generate_many_files1, 4000

This will take about 10 minutes or so, depending on the environment you’re using.

Next, perform a log backup of the database:

backup log ManyFiles to disk = 'D:\Backup\ManyFiles.trn' with init, stats = 1

Notice how the percentage progress moves along nicely and the 100% mark is reached within 10 seconds or less.

97 percent processed.

98 percent processed.

99 percent processed.

100 percent processed.

Processed 42018 pages for database 'ManyFiles', file 'ManyFiles_log' on file 1.

However, the log backup seemingly continues to run for another 30 seconds or so before finally displaying the message:

BACKUP LOG successfully processed 42018 pages in 8.636 seconds (38.010 MB/sec)

Note how the backup completed in about 8 seconds.

Next, run the following query to see what was recorded in the backupset table:

select backup_start_date, backup_finish_date

from msdb.dbo.backupset

where type = 'L'

and database_name = 'ManyFiles '

order by backup_finish_date desc

Note how the backup_start_date and backup_finish_date corroborate the fact that, as far as history is concerned, that log backup only took 8 seconds or so to complete.

A clue to the cause of the issue lies in the following query:

SELECT *
FROM [msdb].[dbo].[backupfile] bf
inner join msdb.[dbo].[backupset] bs on bf.backup_set_id = bs.backup_set_id
where database_name = 'ManyFiles'
and type = 'L'

The above query should return 4001 rows. A quick glance at the data in the table reveals that it contains a row for all of the files that we earlier created and deleted. Note how each of these rows contain a status_desc value of “DROPPED”.

If you were to now run a subsequent log backup, the above query would return 8002 rows of data.

We can also view the data contained in sys.sysprufiles by opening an admin connection to the SQL instance and running the following query:

use ManyFiles

select * from sys.sysprufiles

Here you’ll be able to see that the table contains 4001 rows, one for every file that we created in the database.

Note that, if you were to create another 100 files in the database, there would still only be 4001 rows in the sys.sysprufiles. It appears that subsequent to any log backup (or full backup) data in sys.sysprufiles is cycled – meaning that the oldest files with a status of 2 (i.e. dropped) are removed from the top of the table to allow for new rows to be inserted.

Also note, that in our reproduction we have only created 4000 data files, which results in 35-40 second log backup. However, the log backup duration (or rather, the time taken to write to the msdb history tables) increases exponentially with the number of rows in sys.sysprufiles.

Using the exact same database in the reproduction, create 32,000 data files instead of 4000 and in my test environment the time to complete was over 20 minutes.

Fixing the Issue

It appears that once the data is in sys.sysprufiles, there is no way to manually clear it down. I haven’t explored the possibility of manually deleting the data from sys.sysprufiles but I would recommend doing so on a production environment.

In the end, under Microsoft’s advice, the only way to reliably reduce the log backup time (and in effect workaround the issue) was to enable trace flag 3001 (http://social.technet.microsoft.com/wiki/contents/articles/13105.trace-flags-in-sql-server.aspx), which stops the process that writes to the msdb backup tables.

There is of course a trade-off in enabling this trace flag – namely the fact that we no longer record in the backup history tables when a backup was last taken. However, given the alternative of having 20 minute log backups, which ultimately reduce the recovery point objective, it was a worthwhile workaround.

Avoiding the Issue

In the first instance, let’s be clear, this is a pretty edge-case scenario that you’re not likely to see every day. To start with, you need to have created many data files to even begin to notice the problem.

So the first thing to be aware of is that once you’re in this situation (where you have many rows in sys.sysprufiles) you cannot, seemingly go back. So if you’re need to created many data files for your databases, there are a couple of steps you can take to prevent the issue from happening.

Put the database into simple recovery mode’. In simple recovery mode, whilst data is generated in sys.sysprufiles, after each checkpoint, it is subsequently removed. I would assume that this is because SQL Server doesn’t need to keep track of LSN records relating to each data file. However, if your database is in simple recovery mode then you don’t care about log backups and you’d never see the issue anyway.

If you do need full recovery model, a better option does exist. After each data file is created and/or deleted, backup your transaction log. Try it:

Create the ManyFiles database as described above. Take a full backup and then create just one additional data file. Examine the sys.sysprufiles table and you’ll see three rows – one of the log file, the primary data file and the additional file you created.
Delete the data file you created above. Note how that row still exists in sys.sysprufiles.
Now take a log backup and create a 2^nd data file. The row that corresponded to the first data file you created, will have been replaced by this new data file.

So, to summarise, if you’re in the habit of creating and deleting many data files (thousands) then it pays to take a log backup immediately after the deletion of the data file.

If however, you’re maintaining a database with thousands of data files that you’re not deleting then, if you’re experiencing seemingly long log backups you might want to explore the option of trace flag 3001.

Scripts

USE [ManyFiles]
GO
/****** Object:  StoredProcedure [dbo].[generate_many_files]    Script Date: 04/07/2014 23:54:38 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create procedure [dbo].[generate_many_files]
(
@count int = 1,
@end int = 100
)
as
set nocount on
declare @debug bit
declare @file_name nvarchar(128)
declare @file_path nvarchar(1024)
declare @sqlcmd nvarchar(max);
set @debug = 0
while @count < @end
begin
set @file_name = 'file' + convert(nvarchar(11), @count);
set @file_path = 'C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\' + @file_name + '.ndf'
set @sqlcmd = N'ALTER DATABASE [ManyFiles] ADD FILE ( NAME = ''' + @file_name + ''', FILENAME = ''' + @file_path + ''' , SIZE = 1024KB , FILEGROWTH = 1024KB ) TO FILEGROUP [PRIMARY]';
print ''
print 'creating file ' + @file_name + '...'
print ''
if @debug = 1
print @sqlcmd
else
exec sp_executesql@sqlcmd
set @sqlcmd = N'ALTER DATABASE [ManyFiles] REMOVE FILE [' + @file_name + ']';
print ''
print 'deleting file ' + @file_name + '...'
print ''
if @debug = 1
print @sqlcmd
else
exec sp_executesql@sqlcmd
set @count = @count + 1
end

Working with 30000 data files in one database – why do my log backups take so long?

The Coeo Blog

Working with 30000 data files in one database – why do my log backups take so long?

The Issue

The cause

Reproducing the Problem

Fixing the Issue

Avoiding the Issue

Scripts

Subscribe to Email Updates

Related posts

Troubleshooting Ola Hallengren’s Maintenance Solution

Domain-Independent Windows Failover Cluster for SQL Server AlwaysOn Availability Group

SQLBits 2022 session - Field Testing Ola Hallengren’s Maintenance Solution

Windows Active Directory Detached Cluster with SQL Server Basic Availability Group