In my previous post I discussed an instance where SCOM was failing to monitor the space of a SQL Server database. In this latest post I want to bring to light an issue that was altogether more serious.
I particularly wanted to write this post because, although there are a number of possible reasons for SCOM not discovering SQL Server instances, this specific reason is not particularly well documented (if at all) and I’m hoping that it will help others avoid the situation or, at least, be able to fix it if they encounter it.
The Problem
The problem became apparent when we noticed some backup failures had occurred on a specific SQL Server instance but we had received no corresponding SCOM alerts. Browsing through the list of SQL Servers in discovered inventory, it quickly became obvious that none of the SQL Server instances on this particular cluster were present (5 instances in all – a mixture of SQL Server 2005 and SQL Server 2008 R2 instances).
Initially, attention was focussed on the SQL discovery process, for both the 2008 and 2005 discoveries. Following Kevin Holman’s post on troubleshooting discovery scripts (here) the DiscoverSQL2005DBEngineDiscovery.vbs and 2008-equivalent scripts were manually executed on the cluster nodes that were hosting the SQL Server instances. In both cases, the scripts succeeded.
However, what was spotted was a warning (EventID 10000) with the following description:
A scheduled discovery task was not started because the previous task for that discovery was still executing.
Discovery name: Microsoft.Windows.2008.Cluster.Monitoring.Discovery
Instance name: Cluster Service
Management group name: ########
This warning was appearing every 4 hours, which corresponded with the frequency that the cluster discovery runs at. It was also interesting because, by all accounts, the cluster was discovered. I could see the resources and their groups in discovered inventory and we did receive alerts if resources failed or went offline.
Unfortunately, I’d missed something. Whilst the cluster service had been discovered, as had the resources, none of the associated computer objects for the cluster had been discovered. Even the cluster name itself was missing from the BaseManagedEntity table in the OperationsManager database.
So clearly, whilst some parts of cluster discovery had completed, the process as a whole hadn’t.
The Cause
The following article describes a situation under which not all cluster resources might be discovered if the cluster has orphaned entries in the cluster hive registry key. Specifically, HKEY_LOCAL_MACHINE\Cluster\Resources.
Running Get-ClusterResource against the cluster returned some interesting results. Firstly, there were a number of resources (3 to be exact) that were being returned but were not showing up in Cluster Failover Manager. The other interesting thing was that these resources (2 SQL Server resources and one SQL Network Name) were appearing to belong to the “Available Storage” group.
The 3 resources used to belong to old SQL Server instances that had previously been uninstalled. For whatever reason, the resources hadn’t been completely cleaned up from the cluster and SCOM was trying to discover them (unsuccessfully).
Also of note is the fact that the cluster discovery wasn’t actually failing – or if it was, there were no reported errors. It’s almost as if the discovery was just stuck in a loop of trying to discover some non-existent resources, getting nowhere, but determined to keep on trying. A consequence of this was that the cluster (as a whole) never finished being discovered and the SQL Server instances, which were part of the cluster, were never discovered. I assume that the same potential problem could befall other products that belong to a cluster.
The fix, in the end, was simple. Using the PowerShell command Remove-ClusterResource for the “non-existent” resources worked first time, and within a short period of time the rest of the cluster, and the SQL Server instances were successfully discovered.
Although it’s likely to be quite a rare occurrence, hopefully this post might save someone time and effort should they encounter the same issue.
 
 