When it comes to data warehouse modernization, we’re big fans of moving to the cloud. The cloud brings unprecedented flexibility that allows you to easily accommodate the growing velocity, variety, volume, veracity and value of today’s data. It also allows you to take advantage of cost elasticity; meaning you only pay for what you use.
Initially, decision-makers don’t always put cloud costs high on their priority lists. Instead, the decision to move data warehouses to the cloud is primarily driven by a) the obvious cost savings over an on-premises solution, and b) the benefits of agility.
That said, we’re predicting that once you’ve modernized and you’re in the cloud, corporate eyes will turn to an examination of cloud costs with a view to cost optimization. IT managers will need to be ready and be able to demonstrate that their platforms have been designed to optimize cloud costs which are driven by a number of factors; mostly storage and data processing.
When we design solutions for data warehouse modernization in the cloud, we spend a lot of time thinking about how to optimize costs (see this blog post for a higher-level discussion of building cost considerations into data platform architectures). It takes time but it pays off in the long run. While there are many ways to incorporate cost considerations into a design, one way is to use a data lake with your data warehouse (it also has many other benefits, but let’s stick with cost optimization for purposes of this blog post).
What follows is a simple analysis we did for one of our projects that demonstrates how adding a cloud data lake to your cloud data warehouse can keep your costs lower than a data warehouse alone, even as it brings other benefits.
Our simple data platform example looked like this:
The components contributing to the overall cost of this simple, but quite typical, data platform solution running in Azure are the following:
- Storing data on Azure Blob Storage
- Executing pipelines with Data Factory
- Running multiple Databricks Spark cluster nodes
- Running an Azure SQL instance
We’ll omit the first two items for simplicity, and because the cost of storage and Data Factory pipeline execution will be insignificant when compared to the cost of running Databricks and Azure SQL. You’re welcome to estimate the costs of the first two components yourself using Azure pricing documentation.
NOTE: All prices shown below are as per Azure documentation as of February 2019 and are in USD.
Databricks Spark cluster pricing is driven by how many nodes you need, and what types of nodes you’re using. According to the Azure Databricks pricing site, the cost of a single DS3v2 type of a VM is $0.692 / hour. Our simple examples used nine VMs for a cluster (eight worker VMs and one driver VM). So the total cost per hour is $6.228.
If we were to use a data warehouse without a data lake, we’d need to look at the cost of an Azure SQL instance that allows us to get similar results to the data lake in terms of query performance. According to Microsoft’s SQL Database pricing website, the premium tier, 1750 DTUs instance will cost $9.4088 / hour. It’s clearly more expensive than a Spark cluster, but not significantly so.
The cost difference becomes more obvious when you start utilizing the cloud’s ability to provision resources on demand and only pay for them when you use them.
To demonstrate, let’s outline the following use case — imagine we need to execute our query only once every hour, and we’ll save the results back to cloud storage for further consumption by some other process. In the meantime, let’s assume we can satisfy all our other analytical needs with a Spark cluster that’s half the size of our large cluster and uses 5 DS3v2 VMs. In the cloud data lake environment, it’s easy to calculate the total cost. If we assume a typical eight-hour business day, we need one small five-node cluster to run eight hours a day for an average of 20 days a month. The total monthly cost for the small cluster would then be $0.692 * 5 * 8 *20 = $553.6.
In addition, we need a separate large cluster with nine nodes, but we only need it for less than five minutes every hour to run our large query. This will result in a total monthly cost of ($0.692 / 60) * 5 * 9 * 20 = $83.04. So you can execute your long-running query for only about $80 extra per month! This cost flexibility is only possible with the data lake design because there’s a clear separation between the storage and compute resources. The data itself is stored on Azure Blob Storage and can be accessed by multiple clusters at the same time, completely independent of each other.
Now, the question is, can we achieve something similar with Azure SQL alone?
Theoretically, it’s possible because Azure allows you to also scale the instance up and down on demand. The practical challenge is that Azure SQL doesn’t offer separation of storage and compute. This means you can’t have multiple instances of different sizes working with the same data simultaneously. So if you want to scale up your instance every hour to run a heavy query, you need to take the whole Azure SQL instance offline for the duration of the scaling process. This can take 30 minutes or more and means no one can access the data during this time. For obvious reasons, this approach is usually unacceptable.
The alternative is to either keep a large instance running eight hours a day and scale it down at night and on weekends, or to settle on a smaller instance size and sacrifice the performance of our long-running query. Let’s say performance is really important and we’ve decided to run a Premium tier Azure SQL instance for eight hours a day. The resulting cost would be: $9.4088 * 8 *20 = $1505.40. You can see that this is significantly more expensive than running two on-demand Spark clusters which would cost $636.64.
While cost certainly isn’t the only reason to consider a cloud data lake, it should be on your list.
Learn how Pythian can help you plan wisely and spend smartly so you can win in the cloud.