In the era of consolidation, storage has not been left out. Different systems are made to share the same storage boxes, fiber-channel switches and networks. Inside a typical storage box, we have front-end and back-end controllers, cache, physical spindles shared amongst different applications, databases, backup destinations, and so on.
The impact of backup on normal database activity . . . batch processing in one database impacting transactional processing — these are two real life examples of the consequences of storage consolidation known to almost every DBA. Of course, it’s easy to suggest separating databases to different physical disks, but what about SAN box controllers and shared cache? And don’t forget about the cost factor and ubiquitous consolidation that forces storage administrators to pack as much data as possible into a single SAN or NAS storage device.
Some of our customers use hosting services — they outsource hardware hosting just like they outsource DBA work to Pythian. In such scenarios, hosting service providers usually have storage hardware shared amongst different customers to provide higher utilization and on-demand storage capacity at a lower cost.
It is typical for a hosting service provider to have several different tiers of storage with different resource characteristics and prices, and to allocate storage in chunks. For example, tier-one would be 15K RPM RAID-10 storage; tier-two, 10K RPM RAID-10; and tier-three, 10K RPM RAID-5. They have different prices per gigabyte and allocate storage capacity in chunks of, let’s say, 32 GB. However, behind the scene they have, let’s say, a 16-disk RAID group, and they carve chunks from this storage pool. Often, it turns out that the same disks are shared amongst a dozen absolutely different customers. And I’m not even considering shared frond-end and back-end controllers.
Now, some customers run a data warehouse database with heavy batches every hour. Some process short transactions, while another class of customers run badly-designed and poorly-implemented purge processes every night, generating heavy I/O activity.
Assuming each physical disk can produce 100 I/Os per second without degradation, 16 disks can provide a random 1600 I/Os. If we have ten different databases or other storage consumers, and one or two of them generate 2000 I/O requests, and a moderate 100 I/Os from the rest, we have 3000 I/Os and the database administrator wondering why the average I/O response has plummeted from 10ms to 50ms for two hours, affecting all his online users.
In the networking world there is a solution — the QoS (Quality of Service) protocol. QoS is a mechanism that guarantees the availability of resources and controls their distribution. It helps to avoid such situations as a single misbehaving user impacting others. It provides flexibility and maximizes networking capacity usage, while delivering a guaranteed minimal level of service to every user.
It’s time for Storage QoS now (actually, it’s long overdue). However, it’s not as easy as it sounds. The storage subsystem is typically more complex than the network. In fact, the network is usually just one of the components of the storage infrastructure, besides disks, controllers, cache, etc. Moreover, storage components are more complex than a “simple” communication pipe, and modeling a single physical disk is more difficult than a packet-switched network.
I did a quick web search and found few a interesting papers on storage QoS, but I couldn’t find any industrial-strength implementation. I did find some interesting reading for you:
- Zygaria: Storage performance as a managed resource (PDF)
- Polus: Growing Storage QoS Management Beyond a “4-Year Old Kid” (registration is free).
Virtualization puts a new twist in consolidation, but storage virtualization methods are very under-developed compared to computing resource virtualization. Storage QoS and storage virtualization must necessarily be very closely-related areas with a lot of overlap.
As I’m not an expert in storage technologies, it could very well be that I’ve missed something, so your comments are very welcome as usual.