Can you run Hadoop in the cloud?

Posted in: Business Insights, Technical Track

As a solutions architect at Pythian, I often get questions from clients about the many solutions available to them to address their big data needs. Between Hadoop, cloud-based, and hybrid solutions, finding the best option for their unique needs can be a daunting challenge.

Sometimes Hadoop is the answer, and sometimes a cloud solution is a better fit. Determining the right direction requires us to consider multiple factors including the specific need, use case(s), budget, available resources within the organization and data volume, to name a few.

With this post, I’m going to try to answer some of the most common questions presented to us by clients, including: At what point is big data big enough to require a system like Hadoop? At what point is it wise to run big data workloads on the cloud? What are its limitations? How does it compare to cloud? And can you run Hadoop on the cloud?

 The Cost And Benefits Of Hadoop

Hadoop is ideal for batch processing terabyte to petabyte scale. Because Hadoop can store and process any type of data, from plain text files to binary files like images, or different versions of data collected over time, it’s ideal if your use case requires you to store large volumes of unstructured data.

On the surface, it appears to be a cost-effective way to handle these big data workloads because it uses commodity clusters, and can scale from a single cluster to thousands of clusters. And the software itself is open-source, so it’s free.

But on the other hand, Hadoop is not a single solution—it’s a framework that requires you to build your data warehouse from the ground up. That means your solution can cost a lot of time and require specialized engineering resources. But once it’s built, you can continue to scale to suit your needs.

How Hadoop Has Helped Evolve Big Data In The Cloud

Many Hadoop ecosystem projects are easily reusable in the cloud and have been widely adopted in cloud architectures. Most notably, Spark and Kafka were developed as part of the Hadoop ecosystem, but they have adapted even better to the cloud than to the environment for which they were originally built.

Spark and Kafka work very well on the cloud alongside other scalable message bus systems like Google Pub/Sub, Amazon Kinesis, or Azure EventData Hub.

There are also some good distributed SQL engines built for the cloud, including Amazon Redshift, Google BigQuery and Azure SQL Data Warehouse. These are better alternatives for running SQL workloads than Hadoop.

 Can You Run Hadoop In The Cloud?

Technically yes, but cloud and Hadoop architectures were created for different purposes. While Hadoop was really meant for physical data centers, cloud environments were built to provide better elasticity and speed of source provisioning.

Maintaining a large permanent cluster in the cloud is expensive, especially at scale. So to run a permanent HDFS you might need to use cloud storage solutions and other SQL engine alternatives.

Another challenge is that there is really no universal solution to support security. While you might have a preferred cloud solution from one vendor, your Hadoop vendor may use another cloud platform. Using a mix of solutions can result in a steep learning curve, and possibly switching costs. And when it comes to ephemeral clusters in the cloud, governance and security become even more challenging. There is not yet an ideal solution to resolve them.

So yes, it is possible, but the costs and barriers that are created when you run Hadoop in the cloud generally make alternatives more appropriate. Additionally, in many cases you lose the ability to embrace the fully-managed nature of cloud offerings and the capability they provide. Running Hadoop on the cloud, hence should not be viewed as a simple technical choice, but a choice based on overall capability and long term strategic vision.

 The Bottom Line: Going Straight To The Cloud For Big Data Analytics

While Hadoop on its own is a reliable, scalable solution, the engineering costs and the inefficiencies associated with batch-processing queries aren’t appropriate for most use cases.

And as for implementing Hadoop in the cloud, you may end up needlessly building something from scratch that was already available as a service from one of the major cloud platform providers.

This is something we experienced first hand during a recent client project where we migrated a large on-premise Hadoop cluster with petabytes of data into a Google Cloud Platform. Once the migration was complete not much was left of the original Hadoop architecture. There were some basic Spark jobs that we could run on the elastic clusters in the cloud, but there was no need to maintain this permanent cluster for the long term.

“When it comes to #bigdata, many companies are skipping Hadoop and going straight to cloud.”

CLICK TO TWEET

Companies are increasingly looking to adopt big data solutions, but many of them are skipping Hadoop entirely and jumping straight to a cloud solution for these reasons. So when you start planning for your next big data project, it is important that you consider cloud-based solutions as an alternative to Hadoop—you may just find that it meets your needs faster and at a lower cost.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

3 Comments. Leave new

Great article, Danil. However, I see you did not mention Hadoop cloud solutions, such as Azure HDinsight and Amazon EMR. I believe those are the services we need to compare to local Hadoop clusters. DWs serve a slightly different purpose, although I agree that companies should prefer a DW most of the time.

Reply
Danil Zburivsky
February 26, 2019 8:56 am

Hi Daniel,
I guess the key point here is that while you certainly can run a traditional Hadoop workload in the cloud and there are services to help you do so, this goes against some of the foundational cloud ideas. Running a permanent cluster on EMR would usually also mean you store data on the VM nodes using HDFS and this doesn’t allow you to elastically scale your workload up and down on demand. You can only realize full value of the cloud when it comes to data processing if you completely separate your storage from your compute and use EMR or Google Dataproc as a transient compute cluster, just for the duration of the task and not running the cluster 24/7.

Reply

Thank you for providing such nice piece of article. I’m glad to leave a comment. Expect more articles in future

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *