Azure VM Storage Performance - Part 1

In this blog series I want to take a look at what Azure VM Storage is actually capable of. To do this, I decided to run some Azure VMs through a series of test loops using Microsoft's diskspd tool to see how well Azure storage actually performs under a variety of different access patterns. The focus of this first post will on single disk Standard Storage performance.

There are two types of disk storage available to VMs running in Azure, Standard Storage and Premium Storage.

Standard Storage is backed by regular spinning disk and is suitable for general purpose VM workloads. Microsoft provides the following performance targets for Standard Storage:

  • Maximum Disk Size: 1023GB
  • Maximum 8K IOPs per Disk per Standard VM: 500
  • Maximum 8K IOPs per Disk per Basic VM: 300

While these targets in and of themselves don't seem that high, Microsoft's guidance around achieving higher performance targets on standard storage has been to use Storage Spaces to aggregate individual disk performance into pools. Creating a pool of four standard 1023GB disks would give a theoretical performance target of 2000 IOPs for a 4TB virtual disk.

Premium Storage is backed by SSD and is designed for high IO, high throughput, low latency operations. Premium Storage comes in three SKU sizes with each size having its own fixed performance targets. We'll come back to premium storage later.

Test Setup

The diskspd tests were conducted in the following Azure VM configuration.

  • Azure Platform: Resource Manager
  • Azure Region: Australia SouthEast
  • VM Size: A2_Standard
  • Storage Accounts: 2 (1 OS / 1 Data)
  • Storage Account Configuration: Standard LRS
  • VHD Caching: None
  • NTFS Cluster Size: 64KB

  • Disk Configurations Tested

    • 1 VHD as a Standard Disk

NB: "A series" VMs in Azure Australia Southeast share the same underlying hardware family as "Dv2" (most of the time).

Tests

Over this test setup the following tests were run

  • IO Patterns: Random, Sequential
  • IO R/W: 0/100, 20/80, 50/50, 80/20, 100/0
  • Block Size: 4K, 8K, 64K, 256K, 1M
  • Queue Depth per thread: 1, 2, 16, 32
  • Worker Threads: 2
  • Duration: 300s
  • Warmup/Cooldown: 60s
  • Buffer Size: 1G
  • Software caching: Disabled
  • Hardware caching: Disabled

This results in about 200 tests with a run time of slightly over a day!

Test Results - A2_Standard - 1 VHD

A few high level results came out from the single VHD test.
In terms of maximums, the following maximums were achieved:

  • Max throughput: 99.61 MB/s (random / 100% read / 256k block / depth 16)
  • Max IOPs : 496.24 IO/s (sequential / 100% write / 4k block / depth 2)

4K Block

Looking at 4K block workloads the following conclusions can be drawn:

  • Standard LRS VHDs can deliver approximately 500 IOPs in the majority of conditions tested.
  • Under random read IO with total queue depth <2, Standard LRS VHDs suffers considerable IO degradation.
  • The IO latency of Standard LRS VHDs scales linearly with queue depth regardless of read/write or sequential/random access patterns.

8K Block

Moving on to 8K block workloads we can see that the same workload patterns can be observed, performance is relatively consistent with that of 4K workloads including the performance drop off when conducting random reads with low queue depth.

64K Block

As we move into the 64K block workloads we see the patterns change slightly.
All operations queue depth <2 show significant shortfalls of our 500IOPS target.
The worst of these are still random patterns with read involved.

256K Block

Moving up to 256K workload we see a significant change in the results. At 256K block size we can exceed the 100MB/s throughput ceiling of a Standard LRS VHD at 500 IOPS so we see the IOPS cap out at around 400. This yielded a few additional interesting results

  • Standard LRS can read at up to 100MB/s
  • Standard LRS can write at up to 67MB/s
  • As with previous workload profiles, queue depth is critical to achieving good performance

1M Block

Finally, looking at the 1M workload we see that when writing very large blocks to disk Azure can give pretty good performance showing between 44-96MB/s of throughput. At 1M block size the relative significance of queue depth also tapers off slightly with queue depth <2 read workloads showing 80% of the performance of queue depth 32 workloads.
We also didn't see the maximum write performance of 67MB/s observed in the previous test, instead 1M 100% random write maxed out at 56.5MB/s
Additionally, it appears the native IO size of Standard LRS is smaller than 1M as there is a ~4x latency penalty to using a 1M block size compared to 256K.

Conclusions

So on this first set of results focusing on a single VHD we can draw some initial conclusions about Azure Standard LRS

  • Its capable of up to ~96MB/s read and ~55MB/s write
  • For block sizes of <256K sequential workloads show better performance
  • Reaching the published performance maximums depend on your application/OS being able to build sufficient IO depth.
  • There is very little difference between read and write performance until you hit the write throughput ceiling.
  • Latency scales linearly until you start to push blocks larger than the native IO size of Azure storage. At this point your IO latency will fall off a cliff.

In my head, these conclusions roughly align with my experience of using Azure on a day to day basis. General disk IO in windows does not build a great deal of queue depth, is generally quite random, is heavily read focused, and has very mixed IO size, all of which play to the weaknesses of Standard LRS.

In the next post we'll look at how these performance numbers scale out when we aggregate the performance of multiple Standard LRS VHDs using Windows Storage Spaces.

comments powered by Disqus