5 Common Causes of Storage I/O Performance Problems

In theory, modern hard disks are capable of I/O rates (which refer to the speed at which they can read and write data) of hundreds of megabytes per second. You can potentially achieve even higher I/O if you use a solid-state disk (SSD) or configure multiple disks into a RAID array.

In reality, though, the I/O performance you actually attain from your disks may be far below the theoretical maximum, even if you use high-end hardware or advanced disk configurations. That’s due to a variety of bottlenecks that can cause I/O latency for storage devices.

Here’s a look at the top five most common causes of I/O latency in storage, along with tips on what you can do to avoid them.

#1. Software bottlenecks

Disks don’t read and write data on their own. They do it when software — in other words, an application or operating system — tells them to. And they can only process read and write requests as quickly as software sends those requests.

If your application or operating system slows down, then, disk I/O rates may also plummet.

Software may perform slowly for a wide array of reasons. You may have an application that is trying to retrieve data from a poorly structured database and is taking longer to pull data from the database than it would take for the disk to read it. You may have too many applications running on a single server, which causes them to max out available CPU resources and slow down. Your operating system may be bogged down by “zombie” processes that are consuming resources unnecessarily.

To prevent these sorts of issues and avoid storage I/O latency problems, you need to monitor your applications and operating systems constantly. And if you detect an I/O slowdown, one of the first places you should look to identify its cause is your software.

#2. Network bottlenecks

Modern applications are often deployed as a set of microservices, which are distributed across a cluster of servers and rely on the network to communicate with each other.

In theory, distributed applications should achieve storage I/O rates that are comparable to those of applications running on a single server because network throughput on modern local networks (which is typically 1000 gigabits or more, which translates to about 125 megabytes per second) can match or exceed the I/O capacity of most storage devices.

But in practice, a variety of problems may lead to network bottlenecks, which in turn cause disk I/O bottlenecks when an application is not able to issue read/write requests quickly enough over the network. The network could become flooded with more traffic than its switches and interfaces can handle, for instance. Or, problems with network service discovery within the distributed environment could lead to mismatched mappings between IP addresses and endpoints, which means network traffic will be sent to the wrong location. Issues with network switches or network address translation in cases where services are running on multiple subnets could likewise cause a slowdown in network throughput.

As with software, it’s important to monitor the network for signs of problems like these, and to ensure that you can correlate networking performance with disk I/O performance.

#3. Virtual storage problems

In addition to relying heavily on the network, modern applications also frequently make use of virtual storage. Virtual storage refers to a software-defined storage layer that runs on top of underlying physical storage.

Virtual storage offers the advantage of greater flexibility than storage that runs directly on physical devices because virtual storage makes it easier to pool devices together, configure automated failover between disks and add or remove physical disks from the storage pool.

However, virtual storage also increases the complexity and potential points of failure within the overall storage stack. If the software-defined storage platform that manages your virtual storage experiences a bug or simply runs out of CPU or memory resources, storage I/O is likely to suffer. In addition, the added overhead of virtual storage almost always comes with some I/O cost: Even under ideal conditions, a virtual disk will not be able to read and write data as quickly as a physical one.

This means that monitoring any software-defined storage pools, and the tools that manage them, is just as important as monitoring the rest of the software and infrastructure within your environment.

#4. RAID configuration problems

A RAID array is a set of disks that are pooled together to act as a single storage unit. (In this sense, RAID is comparable to virtual storage, but because RAID can be managed at the hardware or software level, it’s different from software-defined virtual storage.)

RAID arrays provide two key benefits: They can increase both storage I/O (by spreading I/O requests across multiple disks) and storage reliability (by spreading copies of data across multiple disks). However, the extent to which a given RAID array achieves either of these goals depends on how it is configured. There are multiple types of RAID configurations, some of which improve I/O more than others.

In addition, problems with the RAID controller — which is either a physical device or a software utility that manages the RAID array — can lead to slowdowns in I/O.

If you use a RAID array and you’re not achieving your desired I/O, then, you should check the health of the RAID controller while also ensuring that the RAID configuration you’ve selected provides the optimal balance between I/O performance and data reliability for your needs. If you are backing up data separately and don’t need to rely on the RAID array to ensure data integrity, you may wish to migrate to a “RAID 0” setup, which maximizes I/O but provides minimal data redundancy.

#5. Storage hardware failure

The final major cause of slow storage I/O is problems with individual storage devices. As hard drives become older, they become less and less likely to achieve their maximum theoretical I/O rates, even if the disks remain healthy in the sense that they are not experiencing I/O errors. The temperature of disks can also impact I/O latency. And when disks do wear out to the point that they begin experiencing I/O errors, I/O latency will increase because each failed I/O request needs to be repeated until it succeeds.

The takeaway here is that it’s important to monitor the health of physical storage media by tracking metrics such as disk age and operating temperature. You can also use so-called S.M.A.R.T. utilities to evaluate the health of disks and detect signs of aging. Older disks should, of course, be replaced before they deteriorate to the point that I/O declines precipitously.

The I/O rates you need

Achieving optimal I/O for storage requires more than just purchasing storage devices that promise the desired I/O rates. You must also carefully monitor and manage the software and network infrastructure with which your disks interact, while also ensuring that virtual storage and RAID arrays are properly configured. Maintaining the health of physical devices is critical, too, for ensuring that you reap maximum returns on your investment in storage infrastructure.