With the current focus on data science, data engineering and the game-changing advantages of doing data lakes or warehouses in the cloud, all the major cloud providers continue to strengthen their data-related services and tools.
On this front, Microsoft has done an incredible job at revamping and expanding its Azure Data Factory (ADF) service from the time that it was first released. Version 2 of ADF expanded the orchestration capabilities, amount of sources and sinks supported, scheduling flexibility and more. Now, they are introducing a new exciting feature into ADF called Mapping Data Flows (MDF).
What are Mapping Data Flows?
Mapping Data Flows (MDFs) are a new way to do data transformation activities inside Azure Data Factory (ADF) without the use of code. Before MDFs, ADF did not really have transformation capabilities inside the service, it was more ELT than ETL. The idea was to use ADF to move data around and then use the compute power of the data sinks to perform transformation.
With MDFs, we now have a way to do real ETL with the transformation defined and executed from inside ADF. Behind the scenes, ADF uses the Azure Databricks service to spin clusters up and down as needed to perform the transformations. These data flows can be integrated as one activity of a bigger pipeline, expanding the possibilities of what can be built and managed from inside ADF without dealing with more external services.
The main value proposition here is that no coding is required so data engineers can just do quick work through a friendly interface and the actual transformations can be understood and manipulated by another developer without having to know a common programming language.
Interesting Features of MDFs
With MDFs now in preview, we have a good idea of the features and way that the feature is going to work, and there are some interesting notable characteristics of this feature:
- Debug Mode: This is an interactive way to run the transformation using data samples. It doesn’t generate the actual data output so it’s useful for iterative development with no cleanup.
- Schema Drift: This feature offers some options to handle schema drift from sources to sink, including a column pattern matching option.
- Inspect Pane: This feature offers a way to track how your data schema is changing as you progress through the data flow, as well as visualizing data previews showing how the data is changing.
- Partitioning optimization: This feature allows you to configure partitioning schemes and values for your transformation so there is data locality in your Databricks cluster nodes and you get better performance.
- Expression Builder: Allows for creating and transforming columns through a simple “Excel-like” function language.
These are just some of the main ones and I’m sure as the offering matures, more interesting capabilities will be added to this feature.
The cost of using MDFs is the cost of the underlying Databricks cluster that will be actually doing the transformation computations. These clusters are created and destroyed dynamically by the service when a transformation is in “Debug” mode or when it is actually being executed inside a pipeline. These charges are by the minute, with a fixed cost per vcore in the cluster.
There are three different types of nodes offered at the moment: Compute Optimized, General Purpose or Memory Optimized. Choosing which one to select will depend on the specific transformation and whether it would benefit from more CPU, more RAM or if a simple General Purpose configuration would be enough.
All right, let’s go into the demo and I will walk through the data flows interface and the process of creating a simple data flow transformation. Let’s check it out!