Controlling Cloud Composer Costs and Performance
Managing, optimizing and balancing cloud cost vs. performance is an ongoing challenge for all cloud architects and administrators. The variety and complexity of tools available can sometimes be daunting, so much so that many companies outsource such tasks to a managed service provider. An excellent example is Pythian’s own FinOps Cost Management and Optimization service. However, some cloud managers prefer a more hands-on approach to balancing cloud cost and performance and are therefore eager to learn about relevant cloud-native tools.
This blog post will detail several GCP Cloud Composer costs and performance management options. Cloud Composer is a Google Cloud managed service built on top of Apache Airflow. Airflow is a job-scheduling and orchestration tool originally built by AirBnB.
GCP’s Composer is a nice tool for scheduling and orchestrating tasks within GCP, and it’s especially well-suited to large tasks that take a considerable amount of time (20 minutes) to run. Unfortunately, when dealing with hundreds of jobs or thousands of tasks, Composer struggles and there may not be an obvious fix.
To resolve these performance issues, we see companies turn to:
- Increasing the amount of CPU (this will raise the cost of Composer).
- Moving to event-driven processes (this removes the Composer UI and ease of use).
- Migrating off Composer to another orchestration tool (this can be complex and time-consuming).
Rather than make those more drastic changes, we have several intermediate steps which have worked in the past to improve Composer performance without causing an increase in costs.
The Composer scheduler process runs in a loop, scanning the DAGs and tasks which need to run. As more DAGs and tasks are added to Composer, this loop becomes larger. The larger the loop, the longer it takes to run the next task in a job. I have seen DAGs with pauses of 10 minutes or more just waiting for the next task to start.
The key here is that only active DAGs are scanned, and setting a DAG to a paused state will remove it from the loop. A Cloud Scheduler task can then be used to start the DAG on a schedule.
The dags_are_paused_at_creation configuration option can be set to ensure new DAGs do not slowly impact the task runs.
You have to manually unpause the airflow_monitoring DAG to ensure monitoring will continue to work!
If the Composer instance doesn’t have jobs running 24×7, growing the Composer instance and allowing it to autoscale down throughout the day is a good method to control Composer costs and performance.
You can set up a Cloud Build trigger to define and create an autoscaling method. Pythian recommends never losing more than one node at a time to avoid cliff edges. You can then schedule the trigger for any time it’s needed, disabled or called by other tools. The Composer instance will grow to its max size and slowly shrink down as processes complete throughout the period.
Grow the scheduler
By default, it isn’t possible to grow just the GKE (Google Kubernetes Engine) node running the schedule process. It’s important to note that this option isn’t supported by Google.
However, you can set an individual size for the node by creating a new node pool in the GKE cluster then migrating the scheduler process to that node. That node can then be set to any size required. This will increase costs, but it limits it to just the size of the scheduler node.
Of course, during an upgrade of the Composer instance, this change will be reverted by Google Cloud.
These are some ideas we’ve implemented at our clients to control Composer costs and performance without a major change to their job architectures. They’re simple enough to implement that it’s always worth a discussion, and usually at least a test run.
If you have any questions or thoughts about the information in this post, please leave them in the comments.