Low hanging fruit items that alleviate DataOps operational pressures

Posted in: Business Insights, Data Enablement

Data is hard. It has been always hard and it’s not getting easier.  We always have more of it, there are more sources to integrate, it changes all of the time, the quality is questionable and the business wants all right away.  Working at this pace requires a sound operational mindset to not drive your teams crazy once the business starts using the data. This mindset needs to start very early in the process of every data project to ensure that you can keep your operational costs at a minimum and, most importantly, have teams that are happy to maintain these data products moving forward.  

  • Resiliency
  • Instrumentation
  • Proper Alerting hygiene
  • Client visibility


One of the most important things you can do is ensure that what you create is built with resiliency in mind.  In this context, I am not talking about infrastructure redundancy or autoscaling but rather the end data product that the business is using.  The easiest way of defining it is that the data should always be in a usable state. You might not always have the latest, but what you have is complete and accurate.  

One common example is a daily traditional batch full refresh of a data source.  I don’t know how many times I’ve seen this scenario:

Job Steps Business Impact
Truncate the target data set No data available until the load is finished.  Completely unusable
Bulk load the data…can be minutes to hours
Outcome: Success Usable data again
Outcome: Error Empty or inconsistent data set.

The better way to do this is to ensure the business can always use what they had before the reload and only show them the new data if it has been successfully refreshed.  It will reduce the pressure on the DataOps team to scramble to have something in there and allow the business to continue with most of their functionality except for anything that requires the last 24 hours of data.

The other part of resiliency that is fairly easy to implement is the concept of auto retry on failure. The goal here is to have the pipeline try to correct itself X times before involving a human.  Many times something happened during say, a file transfer, and just the act of re-running it resolves the problem. Why wake someone up in the middle of the night when a little extra development effort can resolve it

The above scenarios are not new.  I’ve seen this in the early days of databases, data warehouses and now modern data engineering platforms.  The old adage still applies. You either learn from history or you repeat it. 


The more metadata you have about the pipeline execution and quality of the data being loaded, the better.  It allows you to incorporate proper metrics and business KPI into the pipeline code, allows for faster troubleshooting of current issues, trending analysis and, once you are really good, you can predict failures with ML modeling.  This should provide a better experience for the business and allow your DataOps team to focus on data changes or new data sources vs spending an inordinate amount of time fixing the existing stuff.

Proper Alerting hygiene

You don’t want to be “alert happy”.  If you do, you will not retain your DataOps team for very long.  You want to get their attention only when it matters. As such, you should not alert them to every error or even look for errors, per say. You want to get their attention if it affects the business.  For example, alerting on failure of a job doing an incremental update every five minutes does not matter if all you really care about is a successful execution of the same job once an hour if that is acceptable by the business users.  Those errors should be investigated from a trending point of view but you should not look at all of them as they happen.  

When you do alert, ensure that the alert is clear, to the point, easy to understand and of course actionable.  There is nothing worse than getting a massive trace dump, then expecting your DataOps team member to dissect it and resolve it while the business user is continually asking you “is it fixed yet?”

The most important alerting advice for a good relationship with the business is to ensure you know about a problem before they do.  There is nothing worse than having a senior business executive call you to tell you that his/her data is not available just to find out that you were blissfully unaware.  Always set up your alerting with the business in mind.

Client visibility

This one is relatively easy but goes a long way into building trust with your business users.  Ensure that people who use the data to make decisions or the ones using it to build reports, data whizzes, etc, know the state of the data at all times.  Metadata around the recency and quality of the data should be made available and incorporated into the semantic layers used but the various reporting, BI and data viz tools.  The additional benefit is a reduction of DataOps requests asking if the data is up to date. If people trust data, they will use it. Otherwise they won’t. It’s that simple and it happens quickly

We are just scratching the surface here.   We can talk about scenarios and techniques for days.  Hopefully, this helps you get a leg up if you starting their first data pipeline project


Want to talk with a technical expert? Schedule a tech call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *