The current world record for the number of people to fit into a Mini is 28 people. For SQL Server, the max number of databases you can have on a single SQL Server instance is 32,767. However, like the people in the Mini, just because you can cram that many databases onto one server doesn’t mean you should as it’s not only uncomfortable for everyone involved, but it makes using your server/Mini extremely difficult.
At Coeo we manage hundreds of SQL Server instances and come across nearly every possible combination of architecture imaginable. In some rare cases, we manage instances that have over 10,000 databases, which presents unique challenges in how we manage the instance.
Let’s therefore take a look at some of the reasons why you should not treat the limits of SQL Server in your production environment like a challenge.
The Problems
1. Patching, restarts and failovers
One of the headaches we frequently have to deal with is the excessive amount of time it takes for SQL Servers with the discussed design to go through any planned patches. We have seen this while carrying out some planned maintenance for one of our customers and the work simply comprised failing over a SQL Server instance, from the active node, over onto the passive node so that the former active node could be patched. The passive node had had its SQL Server patched earlier in the day while the active node was still on the older patch level.
This SQL Server had 10,000 databases and all that we needed to do was to simply click the buttons to failover the SQL Server and then wait for it come online on the other node. In the end, this failover took nearly four hours to complete because upgrade scripts had to be applied to each of the 10,000 databases.
2. High Availability and Disaster Recovery (HA-DR)
There are calculable limits to the number of databases that you can practically have in an Availability Group or can Mirror, the limit to which is very heavily dependent on the number of threads that are available on the server. We typically start to see problems arising from AGs with over a hundred databases, let alone thousands, and it is widely accepted that mirroring or setting up AGs is simply not a viable solution in an environment with thousands of databases.
As such, the only real-world solution we have for this is to use log shipping to provide a working DR solution and creating failover cluster instance to provide HA.
While log shipping works, configuring it for this kind of environment is a challenge and the main issues are summarised below:
- The job history and backup history tables grow in size due to the number of entries and job executions
- For example, with 10K log backups being run every 15 minutes, that’s almost a million rows being added to the job history table per day.
- The CPU on the server regularly spikes as by default, log shipping jobs are all set to the same 15 minute schedule.
- Threadpool waits can be seen at points when there are too many of the log shipping jobs all running at once.
- Setting up log shipping from scratch can take a long time, as can reinitialising.
- Managing to take a log backup of every database, every 15 minutes can become very difficult and is not always possible.
Once Log shipping has been configured and is up and running, what can you expect in terms of RPO and RTO? With the large scale deployments we have seen and in the case of one which has just over 10,000 databases, the log backup jobs take around 20 minutes to complete, the copy jobs take around 15 minutes and the restore can take in excess of 20 minutes. This will obviously impact the best RPO you can hope to achieve and in this case, our RTO will realistically be the 20 or so minutes it takes to restore all of the logs, plus the time it takes to recover all of the databases.
Last year, we had a case where the log shipping secondary server had been shut down and the log shipping copy and restore jobs did not run for over 10 hours on a SQL Server instance that had 1200 databases on it. When the server came back online, the log shipping copy and restore jobs began running again but because there were so many backups to process, the jobs all took much longer to run than normal and all tried to run once. Eventually the server ran out of threads as it could not handle all the jobs trying to run at once.
Trying to disable the copy and restore jobs and then re-enable them in batches was unsuccessful and we would regularly reach this worker thread limit after around 1800 out of the 2400 jobs had been re-enabled and allowed to run on their normal 15-minute schedule.
This was eventually resolved by a batch process system which resolved not only the backlog, but the CPU spikes and threadpool waits we had been seeing. In fact, running every single log shipping job every 15 minutes was just not possible for this server and we compromised on an hourly one instead.
3. Monitoring
Our method of monitoring our customers' environments primarily involves the use of System Center Operations Manager (SCOM) to monitor both the OS and SQL Server instances that run on the server. During the roll out of this monitoring system, the SQL Server is discovered at the computer and instance level before all the databases are then discovered.
In an environment that had over 2000 databases on a single SQL instance, the SCOM discovery process would timeout, never completing successfully. Additionally, the impact of monitoring that many databases (and all of their data & log files) would place too great a load on the monitoring system even if discovery does complete. Consequently, we’re unable to monitor individual databases on instances with too many. What is too many depends on a number of factors and has to be tested against your specific environments. According to the literature, Microsoft recommended that you monitor no more than 50 databases:
“We recommend that you monitor no more than 50 databases and 150 database files per agent to avoid spikes in CPU usage that may affect the performance of monitored computers"
However, with some tweaks we have managed to monitor up to 1000 databases.
In addition to this, we also use SentryOne as an additional monitoring tool to compliment SCOM and mainly use it to review and to monitor both the performance of SQL Server and the queries that are run against it. In order to monitor query performance, SentryOne pulls data out of the sysjobhistory table in the msdb database. However, with thousands of databases that each have a log backup, copy and restore job, we commonly see increases in the CPU on the server in addition to blocking and the failure of processes which try to clear down this table because of the amount of contention.
Summary
There are multiple challenges when managing a SQL Server instance with thousands of databases and limited options available to manage such an environment.