Some decisions sound easy, but its also easy to get them wrong. Today I had a choice of hanging around New York city, or working on my big data presentation for RMOUG. Sounds easy, and yet I spent the day working on that presentation.
Whenever I tell an experienced Oracle DBA about Hadoop and what companies are doing with it, the immediate response is “But I can do this in Oracle”.
It is true. Oracle is a marvelous generic database and can be used for many things.
You can process your web logs and figure out which pathes customers are likely to take through your online store and which are more likely to lead to a sale. You can built a recommendation engine with Oracle. You can definitely do ETL in Oracle. I’m hard pressed to think of a use-case that is simply impossible with Oracle. I can’t say for certain that Google and Facebook could replace their entire data centers with Oracle, but perhaps its possible.
Just because it is possible to do something, doesn’t mean you should. There are good reasons to use Oracle as your default solution, especially when you are an experienced oracle DBA.
But, do you really want to use Oracle to store millions of emails and scanned documents? I have few customers who do it, and I think it causes more problems than it solves. After you stored them, do you really want to use your network and storage bandwidth so the application servers will keep reading the data from the database? Big data is… big. It is best not to move it around too much and run the processing on the servers that store the data. After all, the code takes fewer packets than the data. But, Oracle makes cores very expensive. Are you sure you want to use them to run processing-intensive data mining algorithms?
Then there’s the issue of actually programming the processing code. If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option. PL/SQL is a nice language, but it lags behind Java and Python for the availability of tools, ease of use and mostly in its standard library. I don’t think anyone seriously considers writing their data mining programs in PL/SQL (except maybe Gerwin Hendriksen). Once you write your code in Java or Python, to access mostly unstructured data in Oracle, you go through many layers of abstraction that result in uglier and slower code. I fail to see the benefits in forcing both PL/SQL and Oracle to do something un-natural.
Christo says that this means that Big Data is actually a license issue. It is partially a license issue – Oracle Database is expensive and MySQL isn’t good at data warehouse stuff. It is partially a storage and network issue of scaling large volumes of data, locality of data is becoming more critical. But I see it mostly as using the right tool for the job – and just because Oracle can do something, doesn’t make it the best way to do it. Sometimes you just need Python, a good file system and a nice distributed framework.
Discover more about our expertise in Hadoop and Oracle.
With Exadata, the issue of network and storage scalability has been addressed to upto a great extent, and Oracle is improving that too.
So does that leave just the license issue?
Exadata helps a lot with storage/network speeds, but its not the same as running code on the same box with direct attached storage (Exadata does a bit of that, but not a lot).
The “Coding in PL/SQL” is still a big issue. With Exadata, either you need to find someone to write data mining code in PL/SQL, or move the data over non-IB network for processing in the app server.
Exalytics and Exalogic may help with the code issue.
After all that, we are faced with the license cost issue.
Hey Gwen! Thx for sharing your thoughts on the subject! I think many DBAs/architects facing the same question over and over again. I hope to see more discussion on this topic.
From my experience one of the main technical disadvantages of the high volume/big data processing within oracle database is lack of option that would allow us to avoid UNDO and related REDO streams generation.
I admire Oracle DB read consistency and recoverability features. However for Big data it should be left to developers to decide if those features (and related overhead) should be on or off.
I think people would have less tech arguments about leaving binary data in DB if oracle would provide this option (I predict that we will see it in 12 or 13 …..).
Device => ASM => foreground process => ASM => Device would become fastest option than any FS could provide (Mentioned ASM just to make the point on the cfg. data sent directly to a device from f process.)
Just my 0.02$,
The main thing to me in considering any ACID compliant RDBMS vs Hadoop/NoSQL solutions is quality and reliability of results. What often seems overlooked in this discussion is results sets in Hadoop/NoSQL are allowed to be “lossy” meaning all “rows” which may satisfy a given query are not guaranteed to be returned in the result set.
In certain applications, such as a web search engine or statistical analysis, a results set with “most” of the rows is acceptable. However, if used for accounting, inventory, logistics, medical, or many other important planning and operations applications, all rows are required in which case Hadoop/NoSQL cannot be used as all rows are not guaranteed to be returned.
Prices change, licensing models change, and Oracle RDBMS supports Java in addition to PL/SQL in the database engine, so many of the typical arguments for Hadoop/NoSQL and against Oracle RDBMS in the “big data” dialogue are either transient or inaccurate. Accuracy and consistency of data is not.
Been a year since I wrote this post. Cool to see that old stuff is still getting read!
I obviously didn’t make myself clear enough – Hadoop and NoSQL are two different things with different use-cases, benefits and trade-offs.
Most important, Hadoop is a file system and job scheduling framework. Its not a database. Nothing prevents someone from writing ACID compliant system on top of Hadoop. Just like Oracle runs on top of Linux while Linux itself is not ACID.
That said, while ACID is critical for many applications, some businesses can make the decision that consistency (different than accuracy!) can be compromised in some cases. Even traditional systems like MS SQL Server allow dirty reads when requested – allowing developers to trade consistency and accuracy for speed when they decide it makes sense.
IMO, more options is always a good thing.
That said, I’d love to hear which arguments pro Hadoop or pro NoSQL are inaccurate or transient.
[…] Shapira wrote a very interesting article entitled Oracle Database or Hadoop?. What she (generally) says in not only interesting and joy to read but most of all true: whenever I […]
i’m a b.tech student considering of taking training on database. which do you think will be appropriate in today’s context- hadoop or oracle dba? and why?
hello sadikshya if you google it u will find almost companies are using oracle ..hadoop is newly concept some companies adopted this technology..so u go for oracle as u get many opportunities in IT market after that u can learn hadoop …main thing is embibe the knowlege
@vinay, I am oracle professional and now want to switch to Hadoop technology but a very complex hindrance I am facing is employers are asking for experience in Hadoop.I wonder as its about approximately 2-3 years since companies are using hadoop how could they found relevant experienced candidate, for that I have recently took training on Hadoop and done with some POC’s but not getting entry to hadoop domain.
I am also in dellima about choosing Oracle Or Hadoop. I have done my BSc. CS and looking for doing a course of DBA. But I am confused that what woult be the better platform to learn.
Sir I just want to ask you that what is the better option for a secured and hiked Carrer?? Oracle Vs Hadoop!
Thank You !
“If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option.”
Sorry but no, its not true. PL/SQL its not about efficiency, i can pretty much do a single query with tables with millions of data straight in Oracle.
The issue is not to use either/or. We need to present ourselves as data integration specialists, and work with whatever technology does the job the best. I simply cannot go and recommend Oracle for a mom-pop shop, it would be quite expensive. Same for big data, and low validity. The right database for the right job. And so on. That was the dilema that most of these companies faced, neither Oracle or SQLserver were a practical solution, thus they had to improvise.
Oracle 13b (as in big data), should take into consideration this and other factors and decide what it really wants to become. So far, adding an extra “i” , “g” or “c to the release has made no difference.
With the current trend , my understanding is that every organization that has a significant licensing costs are looking for options. There are many variables at play .., licensing cost is only a one.
As Gwen stated , Hadoop is only the persistence layer with a proven ability to scale to petabytes of data. (or beyond)
Currently IoT (outside of hyperscaling segments , google/fb/amazon..etc )is one of the driving factors for Big data. Anyone who has dealt with the V3 model of Big data knows how much effort is required to balance a solution with those factored
in with the RDBMS options.
Oracle has provided many innovative options to tackle some of the factors without a doubt. it doesn’t mean we can solve all the problems effectively/efficiently.
There are frustrated stakeholders at all levels of organization with current (RDBMS) data management solutions. The Hadoop ecosystem provides some innovative options to those individuals. This flexibility is not available with current established platforms
(it is hard to imagine ..how that could happen).
I am not sure whether it was market dynamics or whether the fundamental technical capabilities actually hindered the traditional data management solutions to restrict their abilities. The current big data technologies truly revolutionizing the data management space.
They are not adding partition limits , real time sql monitoring …, they are guaranteeing real-time response (spark/flint) , trying resolve name node issues …,
I am seeing many new data consumers with higher data volumes are by design moving to newer technologies , namely Big data ..Hadoop Echosystem,Spark,Memsql,SpliceMachine ..etc
Even the new big data technologies looks promising they do have their share of gotcha’s …name nodes , storage , scaling options ..etc .. but this would be true for any new technologies.
Unless Oracle (or other players) comes up with a better plan … it is a matter of time , I can only state the famous quote from Peter Drucker “Innovate or Die” !
try to ponder on following –
From Concept to time to market ?
Why do we need multiple places to store the same data in different structures (OLTP/OLAP/EDW ..etc) ?
Disk space is cheap …. oops …look what happened ! all resources are precious and scars …
Small is big !
In Big data world, can we survive with backup and restored strategies of current technologies ? (do we really need to stick with this mode of thinking ?)
ACID is important but then BASE is important too ..
Fundamental technology changes ( von Neumann model ??), Disk vs SSD , Massive memory capacities , Memory technologies. Reconfigurable computing to Machine learning,CPU/GPU