Configuring the lease timeout threshold for AG's deployed in Azure

Written by Amie Coleman | 18-Aug-2016 13:29:53

Recently we helped one of our customers with a Microsoft case in regards to a clustered AG deployed in Azure.
We would regularly receive reports of the clustered nodes failing to heartbeat, and as such this would terminate the cluster service, and sometimes impact the synchronisation status of the AG. The loss of connectivity between nodes was caused by transient network issues linked to the fact that the servers were hosted in Azure.
We worked with Microsoft who advised that under these circumstances we may want to amend the delay and threshold settings for the AlwaysOn Environment to a more relaxed setting.
The full recommendation included:

Adjusting the following setting for both same subnet and cross-region solutions deployment of AlwaysOn availability groups. (Note: SameSubnetDelay and CrossSubnetDelay values are in milliseconds):
PS C:\Windows\system32> (get-cluster).SameSubnetDelay = 1000
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
PS C:\Windows\system32> (get-cluster).CrossSubnetDelay = 1000
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
Verify the changes
PS C:\Windows\system32> get-cluster | fl *subnet*

By default, SQL Server configures the availability group Lease timeout at 20000 ms. The above recommended adjustment ensures that the new Delay and Threshold settings (30000 ms) is greater than the default availability group lease timeout (20000 ms).
Confirm the lease timeout property
Launch Failover Cluster Manager. Click on Roles in the left pane.
In the Roles pane, click on the availability group resource.
In the Resource pane, click the Resources tab and right-click the availability group resource and choose Properties. Click the Properties tab to view the availability group properties which includes LeaseTimeout.

Since amending these settings we have not experienced the same loss of connectivity. To put this into perspective, at one point the cluster could be seen failing almost every minute which is not something we’re used to seeing.

Hopefully this information may help in any existing environments that use Azure, or any upcoming projects that you may have in the pipeline!

View full post