Have you ever heard the one about throwing hardware at a software problem?
In one of my previous blog posts, I mentioned something along the lines of, well I’ll just cut and paste . . .
In my experience, the solution to most problems (the ones the caller refers to as “it’s running slow”) are not rooted in hardware, because hardware problems generally cause things to not run at all. It always baffles me when developers and architects prescribe hardware upgrades to make things run faster, because about 80% of the performance-related problems and subsequent solutions I’ve dealt with were resolved by tuning the application.
Well yes, I know you can buy new hardware, and it’s easier. But when it comes to hardware, how many of you have a ten-node RAC cluster running Enterprise Edition with 8GB of RAM on each node, running off a massive SAN?
I’ve been on so many systems that have been running for years–poorly–the way they were, and in a week we can take them apart and have them running without a hitch. We’ve even managed to fix problems that turned out to be the business case to go from RAC back down to a single instance. How much did those customers save on licensing costs?
Back to the example at hand. I have this nifty RAC system that supports some very public and very mission-critical apps, and one day (it was Sunday night) it starts choking. We’re getting enqueues. Slowly they start climbing. Ten nodes came to a crashing halt. I have now seen a ten-node RAC cluster come to crashing halt and completely lock up.
Why, you ask?
A simple SQL statement:
DELETE FROM a WHERE b=c AND d=e;.
That’s it. It was going off indexes and in itself was rather boring. Yes, indeed. So why was this statement the one waiting on all the enqueues?
This table was at the top of a four-levels-deep parent-child relationship, and all the foreign keys were un-indexed. For every delete, Oracle was forced to do four full tablescans on parent tables to make sure no child records were around to be orphaned. This oft-overlooked foundation of data-modeling was overlooked on this app, causing some serious downtime.
I first learned of this issue (well, not really an issue but a design oversight) when reading something by Steve Adams over at Ixora back in the day. (If anyone has a link to this or any article illustrating this, I’d appreciate it.) Back to basics, folks! We know Oracle is good, but that doesn’t mean we can forget the core of application design.
The point to take away from this is that we had a massive amount of hardware crippled by a design flaw that was easily avoidable. At this point, the work that I and a few of the other professionals at Pythian have done has reduced the overall load on this cluster to justify permanently reducing it. This is on an app that peaked at over 3000 queries per second prior to anything being done on it. It’s still doing the same amount of work, although I think we can count on this and a number of other changes that we’ve made at the design level, to drop it down from ten nodes (which is obviously overkill for most apps) to a more reasonable two.
There is a time and place for hardware upgrades. The problem is that most people find hardware easier and just aren’t aware of how expensive their app logic really is. I find tuning easier and more cost effective.