What Should We Store on Hadoop?

Posted in: Hadoop, Technical Track

Last year at Oracle OpenWorld, the most frequent questions about Hadoop and Big Data were either “What is it?” or “Will Hadoop replace Oracle?”.
The consistent message, both from Oracle and from independent data architects, has been: “Hadoop will not replace Oracle. Each system has its strengths, and they can be used side by side to offer a wider range of data storing and processing possibilities”.

It seems that most professionals understood and agreed with this message, since the most frequent question this year is: “Which kind of data should we store in Hadoop and in Oracle?”

I can’t claim to have the definitive answer, but I can offer some pointers and start the discussion.

Let’s start with size. Hadoop typically doesn’t make much sense at less than 6 node clusters. This is where the maintenance overhead and the headache of adding another system start to pay off. Most production clusters are at 20-30 nodes. Each server is likely to have at least 1T of storage, and data is replicated 3 times. So I’d say that if you have less than one terabyte of data, Hadoop is likely to be more of a problem than a solution. We may not know how big data should be to count as Big Data, but 1T seems like a lower bound.

If you have and plan on actually processing images, videos, sound files, and anything else that is non-text unstructured data, Hadoop is definitely the solution. Storing 1T of image files in a database or storage device and copying them over the network for process on application servers is far less efficient than using Hadoop to process the data where it is stored.

If you store images and videos without processing them, Hadoop is still a fairly cost effective solution, but at this point you need to calculate cost per terabyte and see which data store makes the most sense.

If you store text files that are truly unstructured, such as blog posts, and plan on processing them using natural language processing tools, Hadoop is a good solution. It allows you to store unstructured text and process it at the point of storage.

If you just want to search your text files and don’t plan on processing them, a text-index solution such as SOLR makes more sense.

Semi-structured text such as log files or XML is the most difficult to place. Since they have structure, they can be stored in relational databases. The decision is whether you want to structure the data when you save it or impose the structure later on when you retreive the data.
If at all possible, it is preferable to structure the data when you first store it. This way, you only structure it once instead of multiple times by each application retrieving the data. You also reduce the possibility of errors due to undocumented structure. Typically, it also reduces disk space usage.

However, as you recall, database experts caution against using the database as a “bucket of bits”. The performance of relational databases depend on having the right data model. If you don’t know how the data will be used and don’t have a good data model, it is better to store the data unstructured on Hadoop until you know how the data will be used and can create a good relational data model for it.

Finally, it can also make financial sense to use Hadoop to store large amounts of structured data that is queried very infrequently. If the business demands storing 15 years of data but only ever queries the last two years, Hadoop could be cost efficient storage for the other 13 years. With the added bonus of still being able to query the data if needed. Again, calculate the cost per terabyte to see if this is a good fit for you.

Note that I was only discussing the data stored. There is also the question: “What types of data processing should be done with Hadoop?”
I will try to answer this question at a later point. For now, if you determined a certain report is best done on Hadoop, you should store all the data used by the report in Hadoop.

I hope this is helpful. More suggestions and corrections are always welcome.

Discover more about our expertise in Hadoop.



Want to talk with an expert? Schedule a call with our team to get the conversation started.

10 Comments. Leave new

Deepak Sharma
October 8, 2012 3:28 pm


I attended your session “Building an Integrated Data Warehouse with Oracle Database and Hadoop”. Where can I find the slides?


Giorgio Chiappone
November 6, 2012 6:27 am

I would like to know the benefits of hadoop against Appliance and the bechmarck technological Hadoop.



I would like to know the difficulties in following hadoop. Hadoop is not always a better option. It cannot visualize the data and give you some interactive solutions to the problems faced by the companies. It has back up problems. Since only one name node knows all the data distributions, it will take long time to load a new application in to it. While this process is going on, other programs, that work under hadoop do not work. The hype created on hadoop is making it look bigger. But i think that it has its own limitations. So focus on the limitations and how these limitations can be solved by working on different environments like cloudera.


I would like to know is there any way to see the location blocks of data?

I means where exactly the data is stored in slaves
In which machine which data is stored?

How to check their location??

Kindly tell me something about Data Storage in HADOOP..


Gwen Shapira
April 15, 2013 6:47 pm

To find the locations of blocks of a file, use FSCK. You can also use it on entire directories.

For example /user/cloudera/passwd has one block on

[[email protected] target]$ hdfs fsck /user/cloudera/passwd -files -blocks -locations
Connecting to namenode via https://localhost.localdomain:50070
FSCK started by cloudera (auth:SIMPLE) from / for path /user/cloudera/passwd at Mon Apr 15 19:45:23 EDT 2013
/user/cloudera/passwd 2132 bytes, 1 block(s): OK
0. BP-1809851170- len=2132 repl=1 []



actually I want to know the locations of data that is stored on the cluster of HADOOP
for eg I had a cluster setup of say 4 machines
1 Master Node & other 3 as SLAVE Nodes
suppose I uploaded some file on master by running command ./hadoop fs -copyFromLocal /home/data/ /inp/
It’ll copy my all files present in data folder to inp folder in HADOOP Cluster
Now my point is where this data is stored?
I have set replication factor as 3
where is the file present 3 times?
Where the blocks of files are stored on my slaves?
Is there any method to see it

atleast to show it..

Please reply

Thanks for the last reply
It also helps me a little…



Thanks for the discussion about data storing in Hadoop. I am doing research on content based video processing using Hadoop. Is it possible to read video contents using Hadoop? I can split the video according to frames or time but how can I store videos in hdfs , any specific tool or any other API?????

Rashed Mustafa
March 20, 2014 10:26 pm

Thanks for the discussion about data storing in Hadoop. I am doing research on content based video processing using Hadoop. Is it possible to read video contents using Hadoop? I can split the video according to frames or time but how can I store videos according to specific contents?


Thanks for the above, it was really helpful. I am currently doing a research on modeling and simulation in big data. one of the first thing i need to do is for my model to take unstructured data form different source as input be it image, video, text etc and structure(output of the model) it using a tool, it there any Hadoop tool or any other tool that i can use to archive this. Thank you


Leave a Reply

Your email address will not be published. Required fields are marked *