You picked the cloud because of the promises around VM availability. You deployed your data platform using availability groups to achieve the highest availability possible, but now you’re seeing outages – what’s going wrong?
Recent posts by Andy & Matt have highlighted the importance of managing and reducing costs – it might be tempting to drop the VM size down to a bare minimum, but as we’ll see, this might be a bad idea.
While creating an availability group on Azure VMs does require you to jump through a couple of additional hoops compared to an on-prem deployment, the process is largely the same, which can lead to us forgetting that our operating model has changed drastically
With any VM purchased on the Azure platform, presented front and centre when buying are details of what you’re getting and what limits you are subject to. This includes limits on the number (and type) of disks you can attach, and limits on the throughput.
Let’s work through an example, starting by picking a VM size.
In addition to buying a certain amount of CPU and memory, each VM comes with limits on the maximum throughput permitted, expressed as both an IOPS and MBps limit. Here I’m using a D2s v3 with a limit of 3200 IOPs and 48MB/s.
As you can see in the screenshot, this size of VM allows for up to 4 data disks, with support for premium disks. So let’s add some disks.
Azure data disks are tiered by capacity and throughput limits, expressed in terms of IOPS and MB/second per disk, for example a P30 tier disk allows a capacity between 512GiB and 1024GiB, while limiting you to 5000 IOPS or 200MB/s of throughput.
Disk throttling means you won’t be able to exceed this maximum figure and your disks are now the limiting factor for you – you’re currently redlining your disks. If you need more performance, you can add higher tier disks or even add multiple disks and create a striped storage pool – right?
With our example VM we can add 4 disks, right up to four P80 tier disks, each offering up to 20k IOPS and 900 MB/second of throughput – total max throughput 80k IOPs.
Adding disks capable of 80k IOPs doesn’t mean we’re going to get 80k IOPS – think of it more like your internet provider promising speeds “up to” – they mean “not more than”, so this is a limit not a guarantee, and a key limiting factor will be the VM we have deployed these disks to.
When our individual disks were hitting the limit, our individual disks were throttled, but with VM level throttling this will affect all data disks attached… including the OS disk.
In my own test subscription, this tends to surface much the same as disk throttling, for some of our customers though, it’s been a different story. We’ve seen a number of occurrences of availability groups moving to a resolving state after logging event id 35206 & 35267 (connection timeouts) before returning to a healthy condition.
It took a conversation with Microsoft support for me to understand what was happening here. They identified that this can be caused when VM level throttling is severely affecting the OS disk. When this throttling continues for an extended period, key components that reside on the OS disk (like the SQL Server resource DLL) can fail to respond in a timely manner As a direct consequence of our disk throughput being overspecced when compared to our VM, we were seeing failures.
How to fix this? The initial response is often to extend the relevant timeouts – session-timeout for the AG replica and lease timeout being the most obvious two. This may decrease the number of times you see this occur, but should really only be thought of as a temporary measure.
The solutions available to us are simple enough – make sure that the total maximum disk IO doesn’t exceed that permitted by the VM SKU, we have a few options to achieve this though:
The individual components of our data platform might each be simple to understand and implement, but it’s in combination, under a production workload that these simple components combine to become complex to troubleshoot.