Some decisions sound easy, but its also easy to get them wrong. Today I had a choice of hanging around New York city, or working on my big data presentation for RMOUG. Sounds easy, and yet I spent the day working on that presentation.
Whenever I tell an experienced Oracle DBA about Hadoop and what companies are doing with it, the immediate response is “But I can do this in Oracle”.
It is true. Oracle is a marvelous generic database and can be used for many things.
You can process your web logs and figure out which pathes customers are likely to take through your online store and which are more likely to lead to a sale. You can built a recommendation engine with Oracle. You can definitely do ETL in Oracle. I’m hard pressed to think of a use-case that is simply impossible with Oracle. I can’t say for certain that Google and Facebook could replace their entire data centers with Oracle, but perhaps its possible.
Just because it is possible to do something, doesn’t mean you should. There are good reasons to use Oracle as your default solution, especially when you are an experienced oracle DBA.
But, do you really want to use Oracle to store millions of emails and scanned documents? I have few customers who do it, and I think it causes more problems than it solves. After you stored them, do you really want to use your network and storage bandwidth so the application servers will keep reading the data from the database? Big data is… big. It is best not to move it around too much and run the processing on the servers that store the data. After all, the code takes fewer packets than the data. But, Oracle makes cores very expensive. Are you sure you want to use them to run processing-intensive data mining algorithms?
Then there’s the issue of actually programming the processing code. If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option. PL/SQL is a nice language, but it lags behind Java and Python for the availability of tools, ease of use and mostly in its standard library. I don’t think anyone seriously considers writing their data mining programs in PL/SQL (except maybe Gerwin Hendriksen). Once you write your code in Java or Python, to access mostly unstructured data in Oracle, you go through many layers of abstraction that result in uglier and slower code. I fail to see the benefits in forcing both PL/SQL and Oracle to do something un-natural.
Christo says that this means that Big Data is actually a license issue. It is partially a license issue – Oracle Database is expensive and MySQL isn’t good at data warehouse stuff. It is partially a storage and network issue of scaling large volumes of data, locality of data is becoming more critical. But I see it mostly as using the right tool for the job – and just because Oracle can do something, doesn’t make it the best way to do it. Sometimes you just need Python, a good file system and a nice distributed framework.