How to Implement Airflow Best Practices from a Data Scientist’s perspective – Part 1

Posted in: Big Data, Cloud, Technical Track

This blog post is a compilation of suggestions for best practices drawn from my personal experience as a data scientist building Airflow DAGs and installing and maintaining Airflow.

Let’s begin by explaining what Airflow is and what it is not. From the official documentation (https://airflow.readthedocs.io/en/stable/index.html), Airflow is a platform to programmatically author, schedule and monitor workflows. It recommends using Airflow to build directed acyclic graphs (DAGs) of tasks. The solution is composed of workers, a scheduler, web servers, a metadata store and a queueing service. Using my own words, Airflow is used to schedule tasks and is responsible for triggering other services and applications. The workers should not perform any complex operations but must coordinate and distribute operations to other services. That way, workers don’t need to use too many resources.

On the other hand, according to the official documentation, Airflow is not a data streaming or data flow solution. Data must not flow between steps of the DAG. I’ll add more: Airflow is not a data pipeline tool. Avoid building pipelines that use a secondary service like an object storage (S3 or GCS) to store intermediate state that is going to be used by the next task. Airflow is not an interactive and dynamic DAG building solution. Avoid changing the DAG frequently. Workflows are expected to be mostly static or slow-changing.

But wait a second, this is exactly the opposite of what I see data engineers and data scientists using it for. Indeed, you may find the previous statements easily beatable. However, after working with DAGs after the first month of deployment, you start getting stressed. Every time you have a change in the code, you need to change something on the DAG, and you risk breaking it and having to wait for the next available hour when no DAGs that can be impacted by the change are running. Believe me, I went through waiting a couple hours to finalize a 5-minute fix in the code.

Well, another recommendation is to keep the code elegant, pythonic and do defensive programming (https://www.pluralsight.com/guides/defensive-programming-in-python). Enjoy the opportunity of using Jinja templating and building pythonic code. And please, implement custom exceptions and logging on the piece of code that is going to run every step of the DAG. The Airflow Web UI is going to print every single logging message. Also, don’t forget to implement code in all your assumptions by using assert and check data boundaries.

Next, be careful with the operators that you are using. Don’t think they are maintained to follow all the updates in the third-party services that are available. For example, imagine how frequently Google Cloud SDK and AWS SDK evolve: do you really think that Airflow operators are evolving as fast as them? Probably not. Therefore, test and implement your own versions of the operators.

The last experience I would like to share in this first part of this series is about time and timezones. Airflow core uses UTC, by default. Specify your default timezone in airflow.cfg. Following that, use pendulum pypi package to define the timezone in which your DAGs should be scheduled. Here is an example: https://airflow.readthedocs.io/en/stable/timezone.html .

Some final advice: the date and time you can see on the header of Airflow Web UI is not the one being used by the system. The Airflow Web UI naively uses a static version of JQuery Clock (https://github.com/JohnRDOrazio/jQuery-Clock-Plugin) to print UTC time. Holy cow, I spent half an hour here until I realized this flaw.

That is all I have to start with. In the following posts, I’m going to go over more specific best practices for scheduling machine learning pipelines.

email

Interested in working with Carlos? Schedule a tech call.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *