Availability groups in Azure – when always on is sometimes off

You picked the cloud because of the promises around VM availability. You deployed your data platform using availability groups to achieve the highest availability possible, but now you’re seeing outages – what’s going wrong?

Recent posts by Andy & Matt have highlighted the importance of managing and reducing costs – it might be tempting to drop the VM size down to a bare minimum, but as we’ll see, this might be a bad idea.

While creating an availability group on Azure VMs does require you to jump through a couple of additional hoops compared to an on-prem deployment, the process is largely the same, which can lead to us forgetting that our operating model has changed drastically

With any VM purchased on the Azure platform, presented front and centre when buying are details of what you’re getting and what limits you are subject to. This includes limits on the number (and type) of disks you can attach, and limits on the throughput.

Let’s work through an example, starting by picking a VM size.

Azure VM Limits

In addition to buying a certain amount of CPU and memory, each VM comes with limits on the maximum throughput permitted, expressed as both an IOPS and MBps limit. Here I’m using a D2s v3 with a limit of 3200 IOPs and 48MB/s.

As you can see in the screenshot, this size of VM allows for up to 4 data disks, with support for premium disks. So let’s add some disks.

Azure VM sizes

Azure Data Disk Throttling

Azure data disks are tiered by capacity and throughput limits, expressed in terms of IOPS and MB/second per disk, for example a P30 tier disk allows a capacity between 512GiB and 1024GiB, while limiting you to 5000 IOPS or 200MB/s of throughput.

Disk throttling means you won’t be able to exceed this maximum figure and your disks are now the limiting factor for you – you’re currently redlining your disks. If you need more performance, you can add higher tier disks or even add multiple disks and create a striped storage pool – right?

With our example VM we can add 4 disks, right up to four P80 tier disks, each offering up to 20k IOPS and 900 MB/second of throughput – total max throughput 80k IOPs.

Snap back to reality

Adding disks capable of 80k IOPs doesn’t mean we’re going to get 80k IOPS – think of it more like your internet provider promising speeds “up to” – they mean “not more than”, so this is a limit not a guarantee, and a key limiting factor will be the VM we have deployed these disks to.

When our individual disks were hitting the limit, our individual disks were throttled, but with VM level throttling this will affect all data disks attached… including the OS disk.

In my own test subscription, this tends to surface much the same as disk throttling, for some of our customers though, it’s been a different story. We’ve seen a number of occurrences of availability groups moving to a resolving state after logging event id 35206 & 35267 (connection timeouts) before returning to a healthy condition.

It took a conversation with Microsoft support for me to understand what was happening here. They identified that this can be caused when VM level throttling is severely affecting the OS disk. When this throttling continues for an extended period, key components that reside on the OS disk (like the SQL Server resource DLL) can fail to respond in a timely manner As a direct consequence of our disk throughput being overspecced when compared to our VM, we were seeing failures.

How to fix this? The initial response is often to extend the relevant timeouts – session-timeout for the AG replica and lease timeout being the most obvious two. This may decrease the number of times you see this occur, but should really only be thought of as a temporary measure.

The solutions available to us are simple enough – make sure that the total maximum disk IO doesn’t exceed that permitted by the VM SKU, we have a few options to achieve this though:

Planning – when building your environment, make sure you’ve considered the required throughput and specced both your VM and your storage to cope. It’s counter-intuitive, but picking lower spec disks will avoid this throttling, save you money and give you a more predictable experience.
Scale up the VM – really obvious, higher tier VMs will offer you a higher threshold before throttling will occur. This could get really pricey though.
Scale down the disks used to a lower tier – while this is possible, it’s not straightforward. Not something I’d like to have to do on a production server.
If neither of the above two options are available, reducing the IO requirements of the platform can help. There are options built into the platform that can assist (appropriate disk caching options are one such solution), but as with any tuning work this won’t be a quick operation. For a SQL Server workload there is an additional option – you can use Resource Governor and MAX_IOPS_PER_VOLUME to directly control your IO and prevent this.

The individual components of our data platform might each be simple to understand and implement, but it’s in combination, under a production workload that these simple components combine to become complex to troubleshoot.

Availability groups in Azure – when always on is sometimes off

The Coeo Blog

Azure VM Limits

Azure Data Disk Throttling

Snap back to reality

Subscribe to Email Updates

Related posts

Domain-Independent Windows Failover Cluster for SQL Server AlwaysOn Availability Group

Database in restoring state and single user mode? See why and how to resolve

Make the most out of your Azure Disks using Storage Pools

Configure Kerberos for Availability Groups