Reduce costs by adding a data lake to your cloud data warehouse

Posted in: Business Insights, Technical Track

By Lynda Partner, VP Analytics as a Service and Danil Zburivsky, Director of Engineering

When it comes to data warehouse modernization, we’re big fans of moving to the cloud.  The cloud brings unprecedented flexibility that allows you to easily accommodate the growing velocity, variety, volume, veracity and value of today’s data. It also allows you to take advantage of cost elasticity, meaning you pay only for what you use.

To date, concerns over cloud costs have not been high on the decision makers priorities, as a) there are obvious cost savings over an on-premises solution, and b) the benefits of agility are the primary drivers of a decision to move data warehouses to the cloud.

That said, we’re predicting that once you’ve modernized and you’re in the cloud, corporate eyes will turn to an examination of cloud costs with a view to cost optimization.  IT managers will need to be ready and be able to demonstrate that their platforms have been designed to optimize cloud costs – which are driven by a number of factors, mostly storage and data processing.

When we design solutions for data warehouse modernization in the cloud, we spend a lot of time thinking about how to optimize costs (see this blog post for a higher-level discussion of building cost considerations into data platform architectures). It takes time but it pays off in the long run.  While there are many ways to incorporate cost considerations into a design, one way is to use a data lake with your data warehouse (it also has many other benefits but let’s stick with cost optimization for purposes of this blog post).

Here is a simple analysis we did for one of our projects that demonstrates the point that adding a cloud data lake to your cloud data warehouse can keep your costs lower than a data warehouse alone, even as it brings other benefits

Our simple data platform example looked like this:

The components that will contribute to the overall cost of this simple but quite typical data platform solution running in Azure are the following:

  1. Storing data on Azure Blob Storage
  2. Executing pipelines with Data Factory
  3. Running multiple Databricks Spark cluster nodes
  4. Running an Azure SQL instance

We will omit the first two items for simplicity and because the cost of storage and Data Factory pipeline execution will be insignificant when compared to the cost of running Databricks and Azure SQL. You are welcome to estimate the costs of the first two components yourself using Azure pricing documentation

NOTE: All prices shown below are as per Azure documentation as of February 2019 and are in USD.

Databricks Spark cluster pricing is driven by how many nodes and what types of nodes you using. According to the Azure Databricks pricing site, the cost of a single DS3v2 type of a VM is $0.692 / hour. Our simple examples used 9 VMs for a cluster (8 worker VMs and one driver VM). So the total cost per hour is $6.228.

If we were to use a data warehouse without a data lake, we’d need to look at the cost of an Azure SQL instance that allows us to get similar results as the data lake in terms of query performance. According to Microsoft’s SQL Database pricing website, the Premium tier, 1750 DTUs instance will cost $9.4088 / hour. It’s clearly more expensive than a Spark cluster, but not significantly so.

The cost difference becomes more obvious when you start utilizing the cloud’s ability to provision resources on demand and only pay for them when you use them.

To demonstrate, let’s outline the following use case: Imagine that our query needs to be executed only once every hour and the results are to be saved back to cloud storage for further consumption by some other process. In the meantime, let’s assume all other analytical needs can be satisfied with a Spark cluster that’s half the size of our large cluster and uses 5 DS3v2 VMs. In the cloud data lake environment, it’s easy to calculate what the total cost would be. If we assume a typical 8 hour business day, then we need one small 5-node cluster to be running 8 hours a day for on-average 20 days a month. The total monthly cost for the small cluster would then be $0.692 * 5 * 8 *20 = $553.6.

In addition, we need a separate large cluster with 9 nodes, but we only need it for ~5 minutes every hour to run our large query. This will result in a total monthly cost of ($0.692 / 60) * 5 * 9 * 20 = $83.04. So you can execute your long running query for only about $80 extra per month! This cost flexibility is only possible with the data lake design because there is a clear separation between the storage and compute resources. The data itself is stored on Azure Blob Storage and can be accessed by multiple clusters at the same time, completely independent of each other.

Now, the question is: Can we achieve something similar with Azure SQL alone?

Theoretically, it’s possible because Azure allows you to also scale the instance up and down on demand. The practical challenge is that Azure SQL doesn’t offer separation of storage and compute and thus you can’t have multiple instances of different sizes working with the same data simultaneously. So if you want to scale up your instance every hour to run a heavy query, then you need to take the whole Azure SQL instance offline for the duration of the scaling process. This can take 30 minutes or more and means no one can access the data during this time. For obvious reasons, this approach is usually unacceptable.

The alternative is to either keep a large instance running 8 hours a day and scale it down at night and during the weekends or to settle on a smaller instance size and sacrifice the performance of our long-running query. Let’s say performance is really important and we’ve decided to run a Premium tier Azure SQL instance for 8 hours a day. The result cost would then be: $9.4088 * 8 *20 = $1505.40. You can see that this is significantly more expensive than running two on-demand Spark clusters which would cost $636.64.

While cost certainly isn’t the only reason for considering a cloud data lake, it should be on your list.


Want to talk with a technical expert? Schedule a tech call with our team to get the conversation started.

Interested in working with Lynda? Schedule a tech call.

About the Author

Lynda Partner is a self-professed data addict and experiences the power of data every day as Pythian’s Vice President of Analytics-as-a-Service. The author of Pythian’s Love Your Data mantra, Lynda understands very well how data can transform companies into competitive winners and she was the driving force in adding an analytics practice to Pythian’s database focus. Lynda works with companies around the world and across industries to turn data into insights, predictions and products, and is the co-author of Designing Data Lakes in the Cloud.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *