With all the buzz in OOW about the big data machine, there was also a lot of non-sense flying around. I love it that the Oracle community is finally interested in Hadoop and NoSQL, but I hate it when people sound authoritative without having an actual clue. I’ve left a few presentations with smoke coming out of my ears.
Here are few things that people got all wrong:
Describing Hadoop as NoSQL database:
Hadoop is not a database. Its a filesystem (HDFS) and distributed programming framework (MapReduce).
Oracle has a different product (on same appliance) called Oracle NoSQL. This is a real NoSQL database – only key-value pairs, no joins, data is sharded and replicated, no single point of failure.
Hadoop can only be used for basic ETL transformations. Real data analysis has to be done in Oracle and BI tools.
There’s a grain of truth there – programming anything non-trivial on Hadoop is a real pain. There’s also a large dose of marketing – Oracle wants you to believe that Hadoop is just another way to get data to Oracle database. However the trend is for more and more friendly languages and tools to work directly with Hadoop, and there are many smart people who already do advanced processing on Hadoop, processing that can’t be done on Oracle, or can’t be done fast enough. Categorizing Hadoop as ETL-only can miss a lot of its value.
Hadoop is not accurate
Honestly, I’m not even sure what they mean by that.
If the idea is that Hadoop is normally used to store data that is “dirtier” than what normally goes into a DW system, than its true, but misleading. DW by definition have cleaned-up data. There is no data cleanup magic going on – if you want clean data then you need to process it. In Hadoop or elsewhere. It is also true that Hadoop often stores data that needs more processing than your OLTP system. But this is not because Hadoop is inaccurate, its because Hadoop provides the only way for businesses to manage and process this type of data – application logs, social media, images.
On the other hand, if you think that Hadoop is inaccurate because some writes are lost, calculations work on just a sample of the data, analysis is done on inconsistent data or data gets corrupted often. This is pretty much wrong.
Hadoop is real time
It isn’t. Hadoop is built for batch processing. You load the data, usually in bulk since Hadoop works best on large chunks of data. Then you create map-reduce jobs to process the data and harvest the results. The smallest fastest job you can imagine creating, takes several seconds to run. There are some attempts to build real-time Hadoop, but it isn’t a main stream use case or product.
Hadoop is high performance
It isn’t. Hadoop is a classic example for a system that scales without giving good performance on a per-node basis. The smallest usable cluster is around 20 nodes. And the clusters grow very large indeed. 1700 at LinkedIn. 20,000 at Yahoo. Compare to Oracle where one node often beats the performance of two and a 24 node RAC is considered huge. Obviously a lot of automation is needed to manage a cluster of this size. Because Hadoop makes it so easy to add nodes and scales so well this way, there is often little optimization done on a single server. First of all, its written in Java, with long code path from the MapReduce job to the disk controller. Its very hard to know if what you want to see is actually what happens. MapR are selling a faster file system for Hadoop, proving that there is still much room for improvement. But I think that much can be improved by paying attention to what the software does on a single server level and tuning the hardware and OS to match. Of course, very few people can tune an entire stack from Java to disk controllers.
Any other myth you ran into? Any rumor you want me to verify? Please comment!
Discover more about our expertise in Hadoop.