Have you ever heard the one about throwing hardware at a software problem?
In one of my previous blog posts, I mentioned something along the lines of, well I’ll just cut and paste . . .
In my experience, the solution to most problems (the ones the caller refers to as “it’s running slow”) are not rooted in hardware, because hardware problems generally cause things to not run at all. It always baffles me when developers and architects prescribe hardware upgrades to make things run faster, because about 80% of the performance-related problems and subsequent solutions I’ve dealt with were resolved by tuning the application.
Well yes, I know you can buy new hardware, and it’s easier. But when it comes to hardware, how many of you have a ten-node RAC cluster running Enterprise Edition with 8GB of RAM on each node, running off a massive SAN?
I’ve been on so many systems that have been running for years–poorly–the way they were, and in a week we can take them apart and have them running without a hitch. We’ve even managed to fix problems that turned out to be the business case to go from RAC back down to a single instance. How much did those customers save on licensing costs?
Back to the example at hand. I have this nifty RAC system that supports some very public and very mission-critical apps, and one day (it was Sunday night) it starts choking. We’re getting enqueues. Slowly they start climbing. Ten nodes came to a crashing halt. I have now seen a ten-node RAC cluster come to crashing halt and completely lock up.
Why, you ask?
A simple SQL statement:
DELETE FROM a WHERE b=c AND d=e;.
That’s it. It was going off indexes and in itself was rather boring. Yes, indeed. So why was this statement the one waiting on all the enqueues?
This table was at the top of a four-levels-deep parent-child relationship, and all the foreign keys were un-indexed. For every delete, Oracle was forced to do four full tablescans on parent tables to make sure no child records were around to be orphaned. This oft-overlooked foundation of data-modeling was overlooked on this app, causing some serious downtime.
I first learned of this issue (well, not really an issue but a design oversight) when reading something by Steve Adams over at Ixora back in the day. (If anyone has a link to this or any article illustrating this, I’d appreciate it.) Back to basics, folks! We know Oracle is good, but that doesn’t mean we can forget the core of application design.
The point to take away from this is that we had a massive amount of hardware crippled by a design flaw that was easily avoidable. At this point, the work that I and a few of the other professionals at Pythian have done has reduced the overall load on this cluster to justify permanently reducing it. This is on an app that peaked at over 3000 queries per second prior to anything being done on it. It’s still doing the same amount of work, although I think we can count on this and a number of other changes that we’ve made at the design level, to drop it down from ten nodes (which is obviously overkill for most apps) to a more reasonable two.
There is a time and place for hardware upgrades. The problem is that most people find hardware easier and just aren’t aware of how expensive their app logic really is. I find tuning easier and more cost effective.
The article and query at ixora’s site is here – the year was 2000. We miss you Steve. :-)
I’m running a 250GB sized PG DB on a celeron 1.7G box and 768MB of ram which never really sees it’s load average hit < 3 and due to the design of it, it’s handling the load pretty well and users are happier with the response compared to the company’s own OLTP styled DB on mssql
right tools for the job.
Other references: Tom Kyte and Connor McDonald (on Jonathan Lewis’s site.
As well as full table scans, there’s a locking – sometimes deadlocking – issue (which full table scans are more likely to reveal, of course). See OraStory for one anecdote. The FK locking behaviour has changed over time (sorry, can’t find a good reference or version number quickly).
[…] point you at an excellent blog post by Shakir Sadikali at the Pythian Group which shows off a ten-node RAC cluster brought to its knees […]
[…] SQL is 5 LIOs per row per row source.” He cites a post by Pythian’s Shakir Sadikali: Good Database Design is Mightier than Hardware. (Incidentally, Shakir will be handling next week’s […]
Big hardware breeds lazy programmers…
“It’s cheaper to buy more hardware than to optimize software” is one of those cliches that is true just often enough that people leap to the conclusion that it is always true. But a little thought will show that that is absurd.
If you have a program that requires 600 MB of RAM and you want to run it on a machine that only has 500 MB installed, it’s surely cheaper and easier to just upgrade the memory. (I’m skimming over issues of virtual memory and all to make the point.) But I have had many occasions where I have found that, for example, a program requires exponentially more time to process more records. So our little test database of 100 records runs in 5 seconds, but when we increase it to 200 records it requires 15 seconds, 300 records requires 45 seconds, 400 requires 135 seconds, and it quickly becomes obvious that when we get to a real database with a million records there won’t be enough computer power in the world to do the job in our lifetimes. At that point you need to revisit the algorithms and come up with a fundamentally different approach.
By the way, while I realize that there are times when throwing more hardware at it is the right solution from a cost-benefit perspective, I find this very emotionally unsatisfying. I take pride in my work and I want to write the best program I can. When I have to stop and say, “Hey, it’s good enough; put it into production and get on to the next task,” that just offends my sense of pride. I suppose craftsmen go through this all the time: Yes, you’d really like to spend another few weeks making that piece of woodwork absolutely perfect, but if you do it will only increase the potential selling price by a couple of dollars, no way is it worth your time.
Hey, can I get in a shameless plug for my book? “A Sane Approach to Database Design”. Available at “electrictactics.com”. Probably on Amazon by the time anyone reads this post.
look into Big O Notation:
Analyzing the performance of algorithms processing varying amounts of data is a science. Often times, more hardware isn’t going to matter.
I recall a story of a database app a company had written. It was slow in production, so they added 10x the hardware at the server side, but it was still slow, so they hired a top level consultant to come in and help them figure out how to speed it up.
First question the consultant asked was, “Have you indexed the tables?”
“Index?” the client asked…
Very Nice post, quiet informative.
Database designing problems being the buzzwords these days, one of the most common designing mistakes is the “Poor Planning / Architecture” of the database or mart. Follow the link to know more…