Todd Hoff, who apparently learned a hell of a lot during a short stint at Yahoo followed by some startups has an extremely well-written and edutaining article about how scaling to a million or more users requires jettisoning more or less everything we know and love about relational modeling.
Even though he uses bigtable (Google’s distributed hash storage system) as his example, in reality this approach works well with relational datastores like MySQL and Oracle too, you just have to think about your data differently and use the databases differently. So I’m including this article in the MySQL and Oracle categories because I think it would be of interest.
Here’s a taste of how it reads:
How do you structure your database using a distributed hash table like BigTable? The answer isn’t what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don’t. You–pause for dramatic effect–duplicate data instead of normalize it. *shudder*
Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don’t know how that decision was made, but it must have gone against every fiber in their relational bones…
But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture?
The answer is, in case you aren’t following yet, you store it everywhere you might need it and worry about keeping your multiple copies in sync later, if at all.
BigTable data ethics are more Mardi Gras than dinner with the in-laws. Data just wants to have fun. BigTable won’t stop you from hurting yourself. And to get the best results you may have to engage in some conventionally risky behaviors. But if those are the glass bead necklaces you have to give for a peak at scalability, why not take a walk on the wild side?
So anyway, this is awesome stuff and thanks Todd. For your reading and learning enjoyment: Todd Hoff’s “How I learned to stop worrying and use lots of disk space to scale”.
Actually I was learning from the people mentioned in post and inflicting it on the blog. I don’t have an easy time with this stuff either. The post was my way of trying to understand how to write my own code and then writing up how it seems to me.
And I think you are exactly correct about the ideas working on other systems.
Yes, exactly. In fact I was jamming with a giant in the internet-scale world (a very early leader) today about how one could use Oracle’s object interface on a federated architecture to exactly mimic the performance and scaling characteristics of bigtable.