With all the buzz in OOW about the big data machine, there was also a lot of non-sense flying around. I love it that the Oracle community is finally interested in Hadoop and NoSQL, but I hate it when people sound authoritative without having an actual clue. I’ve left a few presentations with smoke coming out of my ears.
Here are few things that people got all wrong:
Describing Hadoop as NoSQL database:
Hadoop is not a database. Its a filesystem (HDFS) and distributed programming framework (MapReduce).
Oracle has a different product (on same appliance) called Oracle NoSQL. This is a real NoSQL database – only key-value pairs, no joins, data is sharded and replicated, no single point of failure.
Hadoop can only be used for basic ETL transformations. Real data analysis has to be done in Oracle and BI tools.
There’s a grain of truth there – programming anything non-trivial on Hadoop is a real pain. There’s also a large dose of marketing – Oracle wants you to believe that Hadoop is just another way to get data to Oracle database. However the trend is for more and more friendly languages and tools to work directly with Hadoop, and there are many smart people who already do advanced processing on Hadoop, processing that can’t be done on Oracle, or can’t be done fast enough. Categorizing Hadoop as ETL-only can miss a lot of its value.
Hadoop is not accurate
Honestly, I’m not even sure what they mean by that.
If the idea is that Hadoop is normally used to store data that is “dirtier” than what normally goes into a DW system, than its true, but misleading. DW by definition have cleaned-up data. There is no data cleanup magic going on – if you want clean data then you need to process it. In Hadoop or elsewhere. It is also true that Hadoop often stores data that needs more processing than your OLTP system. But this is not because Hadoop is inaccurate, its because Hadoop provides the only way for businesses to manage and process this type of data – application logs, social media, images.
On the other hand, if you think that Hadoop is inaccurate because some writes are lost, calculations work on just a sample of the data, analysis is done on inconsistent data or data gets corrupted often. This is pretty much wrong.
Hadoop is real time
It isn’t. Hadoop is built for batch processing. You load the data, usually in bulk since Hadoop works best on large chunks of data. Then you create map-reduce jobs to process the data and harvest the results. The smallest fastest job you can imagine creating, takes several seconds to run. There are some attempts to build real-time Hadoop, but it isn’t a main stream use case or product.
Hadoop is high performance
It isn’t. Hadoop is a classic example for a system that scales without giving good performance on a per-node basis. The smallest usable cluster is around 20 nodes. And the clusters grow very large indeed. 1700 at LinkedIn. 20,000 at Yahoo. Compare to Oracle where one node often beats the performance of two and a 24 node RAC is considered huge. Obviously a lot of automation is needed to manage a cluster of this size. Because Hadoop makes it so easy to add nodes and scales so well this way, there is often little optimization done on a single server. First of all, its written in Java, with long code path from the MapReduce job to the disk controller. Its very hard to know if what you want to see is actually what happens. MapR are selling a faster file system for Hadoop, proving that there is still much room for improvement. But I think that much can be improved by paying attention to what the software does on a single server level and tuning the hardware and OS to match. Of course, very few people can tune an entire stack from Java to disk controllers.
Any other myth you ran into? Any rumor you want me to verify? Please comment!
Discover more about our expertise in Hadoop.
A very informative post. Thank you.
re: This is a real NoSQL database – only key-value pairs, no joins, data is sharded and replicated, no single point of failure.
Are these the defining characteristics of NoSQL databases or just Oracle NoSQL’s feature set?
The only requirement from nosql is to be non-relational, and therefore without joins.
The other attributes are common in nosql, but there are exceptions.
For example, MongoDB stores json documents, not key value pairs.
Apache Pig https://pig.apache.org/ project provide very impression query language for Hadoop.
Backtype/Twitter Storm is Hadoop realtime ETL platform https://www.slideshare.net/nathanmarz/storm-distributed-and-faulttolerant-realtime-computation
Take a look and may be you will write new Mythbusting for Mythbusting :)
I’m aware of both projects.
You should realize that while Pig looks very impressive to someone who is used to plain MapReduce, it still looks very basic and primitive to most SQL developers who are used to a richer library and toolset.
However, I did say that “Hadoop is only for basic processing” is a myth. You can definitely do impressive processing on Hadoop. Its just very difficult to someone who is used to Oracle. Even with Pig.
Regarding Storm – its cool, and I did say that there are real-time Hadoop projects (I think Facebook has another one, with HBase). But – its still far from a mature product with main stream adoption. Even by Hadoop standards.
Thanks for pointing those out for my readers though!
“large doze of marketing” Should that be ‘dose’, or does it actually make you sleepy ?
Corrected the typo. And yes, Oracle marketing is often sleep inducing :)
[…] Teradata has a true data analytics offering. Oracle? I am still unsure about Oracle Exalytics. I don’t think anyone really knows what it can or will do. Larry Ellison covered the basics in his opening keynote last week, and observers at OpenWorld did a bit of myth busting. […]
But, mongoDB is web scale!
NoSQL = Not only SQL
Hope you’ll revisit this page (the “NoSQL”, “analysis” and “real time” items anyway) once Impala goes out. :-)
Very interesting and informative post. I’m very interested in learning more about Hadoop, NoSQL and Big Data in general. Thanks for helping find a good start to my path.