Big data analytics can potentially cost big money.
But being reactive instead of proactive about optimizing your analytics in the cloud can, unfortunately, cost you even more money. This may sound obvious, but making the right decisions when configuring your cloud analytics platform and related processes now can save buckets of cash down the road.
Before you begin your journey to cloud analytics cost optimization, however, you must self-assess. You need to be honest with yourself and get a clear understanding of where you currently are, where you want to go and – maybe most importantly – what kind of costs you’re dealing with both now and in the future.
The data silo problem
If you’ve got a data silo problem – and it’s relatively easy to recognize if you do – then it’s time to face facts: your organization is spending way too much money on analysis that isn’t adding enough value. That’s because it’s almost certainly based on incomplete data sets.
In fact, Database Trends and Applications says poor data quality hurts productivity by up to 20 percent and prevents 40 percent of business initiatives from achieving targets. And a recent Gartner survey found that poor data quality costs businesses $15 million every year.
Not only that, but you’re also incurring a ton of hidden costs:
- Lost employees (and clients): Good employees hate dealing with bad data. They’ll eventually grow frustrated and leave. Bad data can also lead to wrong decisions and embarrassing client mishaps
- Lost time: The more time lost fumbling with incomplete data, the less effective your employees will be (and the more frustrated they’ll get). Not to mention the needless cost of all that wasted time
- Lost opportunities: Analysis based on flawed modeling is often worse than no analysis at all. With no central ownership, groups working with siloed data they believe to be complete is a recipe for disaster
Unfortunately, data silos are most prevalent in older, more traditional data warehouses that don’t have strong data integration tools to help them out. Along with this, the more obvious costs associated with running a traditional, on-prem data warehouse lie in scaling the system – which typically requires expensive equipment investments and upgrades – along with finding specialized expertise to keep it online.
First things first: run a TCO analysis
Clearly, there are real costs associated with traditional data warehouses that may not immediately show up on a balance sheet. The same goes for modern data platforms in the cloud that haven’t been cost optimized. But these sometimes hidden costs are just one part of your overall cost analysis.
This also requires a cost comparison of on-premises versus cloud systems. The high costs of on-prem systems have already been mentioned and are well-documented: They require a large CapEx investment out of the gate, are expensive to upgrade and require all sorts of cooling and fire suppression add-ons. Cloud systems typically only require smaller, monthly OpEx.
So while cloud users end up receiving a regular bill, they aren’t hobbled by gigantic initial investments. Google Cloud Platform and Microsoft Azure have even released their own cost calculators to drive this point home.
The business case can be broken down like this:
- Large CapEx expenses associated with an on-prem system can take significant money away from other areas of the organization
- A monthly OpEx paid as a subscription fee is much easier on the corporate wallet, keeping organizations more nimble
- If performance or costs aren’t up to standard, cloud users can always cancel
But this begs the question: If cloud is the way to go in terms of optimizing costs, how does one optimize costs further within the cloud itself – especially for analytics purposes?
Optimizing processing costs in the cloud
Whether you’re using Google Cloud Platform or Microsoft Azure, or AWS, optimizing cloud analytics costs essentially comes down to two things: optimizing data processing costs, and optimizing data storage costs.
A good first step is to look at running data transformations outside the data warehouse. It may seem counter-intuitive, but after ingesting and integrating all of your data into the data warehouse, it’s more efficient to pull that data back out and process it in something like Apache Spark – a framework that, when combined with Google’s Dataproc, can be up to 30 percent more cost-efficient than similar alternatives. This minimizes your cloud data warehouse processing costs.
Organizations can also utilize ephemeral clusters within Spark, which allows you to run several different jobs in parallel along with configuring cluster idle time. You can provision your Spark clusters to terminate if they’ve been idle for a set amount of time, or automatically come online when fresh data arrives, thus increasing cluster efficiency and eliminating downtime.
Organizations can also use something called preemptible virtual machines (VMs), which are essentially low-cost, short-life instances. They’re cheap and aren’t super reliable, but are great (and cost-effective) for fault-tolerant workloads and batch jobs. You can use pre-emptible VMs for exploration, machine learning algorithm training and development work.
And if preemptible VMs aren’t right for your use case, data exploration involving heavy processing can also be conducted in the data lake instead of the data warehouse. This can also lead to cost savings.
Optimizing storage costs in the cloud
Optimizing your cloud processing costs, however, is just half the battle. Deploying the right cloud storage options can also go a long way.
Remember the data lake we just mentioned? It can also be used for storage to minimize cloud data warehouse storage costs. But there are other, relatively simple, approaches you can take to keep cloud storage costs down:
- AVRO vs JSON: Instead of storing data as JSON files, it’s smart to institute a standard file conversion to AVRO files, which are more size-efficient
- Compression equals savings: Similarly, compressing all your data files as a matter of process helps keep storage costs down
- Consider cold storage: Cloud platforms like Azure and GCP offer cold storage options, such as Azure Cool Blob and Google’s Nearline and Coldline, which are less expensive options for storing large datasets and archived information: under some conditions, cold tier storage can equal savings of up to 50 per cent
- Evaluate data retention policies: In a perfect world, you’d keep all your data. But if you have so much that even keeping it in cold storage is cost prohibitive, you can always change your retention policies to delete very old raw data (you always have the option of keeping the aggregate data around, which takes up less storage space). Watch our video, Data hoarding in the age of machine learning to learn more.
Get cost control baked into your analytics platform
By including the best of cloud services, open source software and automated processes, Pythian’s cloud-native analytics platform, Kick AaaS, has cost controls built into it. It starts with the infrastructure layer that uses Spark on Kubernetes, which simplifies cluster management and makes resource utilization more efficient for Spark workloads by letting you spin them up and down as you need. Kick AaaS also uses size-efficient Avro files, data file compression and makes the most of available cloud cost control features such as cold storage and setting upper limits on data processing for queries. The Kick AaaS platform, along with Pythian professional services, ensures your cloud analytics costs are always optimized and under control.
Whatever your cloud strategy, Pythian expertise is there to help you every step of the way, including helping you make the most of the cost control features offered by many of the cloud service platforms.
Learn how to break down data silos and achieve true integration with cloud-based master data management platforms with our latest ebook The book of data integration zen.