A couple weeks ago I did a short blog post about SAN storage failures and how people are blinded by all the bells and whistles that are supposed to make storage arrays 100% reliable and failsafe. My conclusion was that there is no way to avoid storage failures, and that a better way is to anticipate those failures and be ready to handle them with minimal service impact.
I referenced a wake up call from a CTO of an Australian hosting company. Let me quote it again:
The outage, blamed on an IBM storage array, saw the company’s chief technology officer promise “significant changes to the way we deploy and manage our storage environment”.
Today, I stumbled across another article that demonstrates their solution of the storage reliability problem. From Melbourne IT on $18m Oracle revamp:
… to improve the reliability of its operational support systems at a cost of $7 million over three years, which has also seen it switch storage vendors from IBM to EMC. Data corruption that had occurred on its IBM storage systems were blamed for a several day outage experienced at the company’s WebCentral web-hosting business.
So we see that, instead of learning the right lesson, they conclude, “This IBM storage stuff isn’t reliable, EMC sales folks convinced me that they are better. Now my storage will not fail.” The “significant changes to the way we deploy and manage our storage environment” were mere vendor change.
Well, data recovery services will be flourishing!
Reliability and Operational support (or effecitiveness thereof) are good (well: OK-ish) with most vendors, although there will be small differences.
The real Difference in Results from modern storage now has to come from the Storage Engineers: Can they get the most out of it. Can they make it work for you ?
And why the H-ll does storage have to hide in another silo-department in the IT organisation?
In many situations, the promise of the storage-benefits is never realized because storage engineers are acting as “druids” and fail (#FAIL) to cooperate with sysadmin and DBA.
Storage can be a huge enabler for backups, for test-copies, and to help with DR-implementation.
But somehow most organistions pay oodles of money to storage(to vendors and to engineers/operators) but still use the System-CPU do run copies and backup.
And when trying to get back a mirror or snap-copy, there is this huge comm-problem between “storage” and “system” and DBA, with sometimes the DBA as the helpless outsider.
Many Storage Departments are now at the stage where DBA’s were in the 1990s: They have beautiful tools with amazing capability, huge promise. But they have to start to communicate with the rest of IT to make their systems work to full potential.
Unfortunately, many CIOs/Architects either miss or ignore the importance of basic operational processes such as tested recovery plans, and rather take the “important” decisions with their commercial technology vendor contacts:
– backup solution is provided, but no recovery tests.
– disaster recover solution deployed, no test (we count on our skilled personnel to do the dirty job when “that” happens).
– we have a top-notch NAS solution with “guaranteed” storage reliability. It’s so good that we’re putting all our critical data on it. Capacity planning ? We need capacity for DR ? really ? 2 years later data is safe but cannot be restored…
What seems to me pure raw, crude, basic common sense seems to be completely absent from people who lose contact with the real world while managing large IT budgets.
Life is a song…
@Pete: Good point. I do see the opposite (though, I admit not as often) – DBA’s resist a change and embrace some of the changes. Granted, it’s a job of a DBA to protect the data in the first place but sometimes it taken to absurd levels when DBA team doesn’t want to hear anything about changing their backup strategy (like using some of these cool storage technologies that are licensed and paid for or moving to RMAN from hot backups and etc).
@Alex: thanks for your comments. Totally in line.