In the era of consolidation, storage has not been left out. Different systems are made to share the same storage boxes, fiber-channel switches and networks. Inside a typical storage box, we have front-end and back-end controllers, cache, physical spindles shared amongst different applications, databases, backup destinations, and so on.
The impact of backup on normal database activity . . . batch processing in one database impacting transactional processing — these are two real life examples of the consequences of storage consolidation known to almost every DBA. Of course, it’s easy to suggest separating databases to different physical disks, but what about SAN box controllers and shared cache? And don’t forget about the cost factor and ubiquitous consolidation that forces storage administrators to pack as much data as possible into a single SAN or NAS storage device.
Some of our customers use hosting services — they outsource hardware hosting just like they outsource DBA work to Pythian. In such scenarios, hosting service providers usually have storage hardware shared amongst different customers to provide higher utilization and on-demand storage capacity at a lower cost.
It is typical for a hosting service provider to have several different tiers of storage with different resource characteristics and prices, and to allocate storage in chunks. For example, tier-one would be 15K RPM RAID-10 storage; tier-two, 10K RPM RAID-10; and tier-three, 10K RPM RAID-5. They have different prices per gigabyte and allocate storage capacity in chunks of, let’s say, 32 GB. However, behind the scene they have, let’s say, a 16-disk RAID group, and they carve chunks from this storage pool. Often, it turns out that the same disks are shared amongst a dozen absolutely different customers. And I’m not even considering shared frond-end and back-end controllers.
Now, some customers run a data warehouse database with heavy batches every hour. Some process short transactions, while another class of customers run badly-designed and poorly-implemented purge processes every night, generating heavy I/O activity.
Assuming each physical disk can produce 100 I/Os per second without degradation, 16 disks can provide a random 1600 I/Os. If we have ten different databases or other storage consumers, and one or two of them generate 2000 I/O requests, and a moderate 100 I/Os from the rest, we have 3000 I/Os and the database administrator wondering why the average I/O response has plummeted from 10ms to 50ms for two hours, affecting all his online users.
In the networking world there is a solution — the QoS (Quality of Service) protocol. QoS is a mechanism that guarantees the availability of resources and controls their distribution. It helps to avoid such situations as a single misbehaving user impacting others. It provides flexibility and maximizes networking capacity usage, while delivering a guaranteed minimal level of service to every user.
It’s time for Storage QoS now (actually, it’s long overdue). However, it’s not as easy as it sounds. The storage subsystem is typically more complex than the network. In fact, the network is usually just one of the components of the storage infrastructure, besides disks, controllers, cache, etc. Moreover, storage components are more complex than a “simple” communication pipe, and modeling a single physical disk is more difficult than a packet-switched network.
I did a quick web search and found few a interesting papers on storage QoS, but I couldn’t find any industrial-strength implementation. I did find some interesting reading for you:
- Zygaria: Storage performance as a managed resource (PDF)
- Polus: Growing Storage QoS Management Beyond a “4-Year Old Kid” (registration is free).
Virtualization puts a new twist in consolidation, but storage virtualization methods are very under-developed compared to computing resource virtualization. Storage QoS and storage virtualization must necessarily be very closely-related areas with a lot of overlap.
As I’m not an expert in storage technologies, it could very well be that I’ve missed something, so your comments are very welcome as usual.
2 Comments. Leave new
These are exactly the things I commonly see in hosted environments alex! ‘Virtualisation’ of storage by using shared storage is done much in large(r) environments, but resource management of the storage (which is the term I use, which you refer to as ‘Storage QoS’) is never done. Knowledge about inner working and performance of SAN or NAS is extremely hard to find.
In fact, I encountered a situation once where I did monitor the storage device (all central storage I have looked at has a way of externalising performance data by SNMP) for 6 months and saw the processor of the SAN 100% busy for the last month. Upon asking, the storage admins saw ‘no problems’, and suggested to look elsewhere for the performance problem.
With this in mind, operating system virtualisation is the next thing which is already creeping into the data centres……
Most companies I know, are fighting performance problems. Quite some of these performance problems are due to improper use of central storage, due to the lack of knowledge about technical implications (about which you gave an example).
The only way to combat these problems is by altering the architecture of the data center (meaning balancing the SAN throughput and the number of clients, which means reducing the number of clients in most cases), which is often rejected as a solution because this affects the cost model of central storage (making it more expensive)
Now, think about operating system virtualisation. This virtualisation means we can use a physical machine (with fixed bandwidth for IO, networking, systembus, etc.) and use it shared JUST LIKE CENTRAL STORAGE. This means that, while our IO is still cumbersome in some cases, we are going to do the exact same thing with our physical machine. Just like the central storage/SAN’s I encounter, there is no way to guarantee IO throughput for a virtual machine, nor is there for network traffic, nor is there for system bus usage. (the resources which are manageable with the mainstream virtualistion software are CPU and amount of memory)
I would be very, very happy if there is a way to manage these resources, but have come across quite some data center’s and clients, and have never encountered it.
Alex, very interesting idea. But I think you’re proposing storage QoS based on different end user applications. The problem is, they all come to the storage tier as the same type of I/O’s. Network QoS prioritizes traffic based on network protocol, or anything that can be differentiated by examining packet headers, so that for instance, FTP TCP segments get higher (or lower) priority than HTTP segments. But do you make a distinction between an I/O request for database A’s data and that for B’s, or between accessing schema HR vs OE? I think it’s inherently impossible. — Yong Huang