Earlier this year Microsoft released the next generation of its data pipeline product Azure Data Factory. The first release of Data Factory did not receive widespread adoption due to limitations in terms of scheduling, execution triggers and lack of pipeline flow control. Microsoft took this feedback to heart and came back with a more feature-rich version that can now cover a larger percentage of production scenarios and can be a good fit for many projects.
With this in mind, I recently had to load about 1TB worth of data into Azure SQL Data Warehouse and thought that this was a perfect opportunity to test Data Factory on higher volumes. I ran into a few issues that are worthy of documenting publicly and share my current workarounds so others can benefit from my experience.
The importance of Polybase
Before we dive into the two specific issues I faced, it’s important to touch on Polybase and how it relates to loading data into Azure SQL Data Warehouse (ASDW). There are two ways you can load data into ASDW:
- Through the Control Node: This is a “trickle INSERT” scenario where you do some individual INSERTS through your connection or if you use a tool like BCP or a regular SSIS load through the SQL driver into ASDW.
- Through Polybase: Polybase is a highly parallel and efficient data loading module inside ASDW. It bypasses the Control Node and loads directly from storage into the different distributions using the compute capacity of the multiple Compute Nodes. It can be used directly through T-SQL or with tools that are Polybase-aware like Azure Data Factory or the Polybase target task in SSIS.
Clearly, if we want to load 1TB of data, we need to be loading through Polybase and a fallback into loading through the Control Node will yield significantly lower throughput and slower performance. When designing a pipeline that will load lots of data into ASDW, this is the most important characteristic; Polybase must always be used under the covers.
The next two scenarios will discuss the issues I found and what my workarounds were to always have Polybase do the data load.
Using a wildcard to select files
When you are developing a pipeline that uses a Blob Storage source (very common), you can specify a file wildcard to match the files you wish to load. This is very handy, of course, since Polybase itself does not support wildcards yet; it requires you to specify a full folder to load or a specific single file.
Data Factory will then let you add the ASDW sink target and you are allowed to select to load with Polybase and deploy your pipeline with no issues. The problem here, though, is that once the pipeline is executing, Data Factory will not use Polybase, it will fall back into doing a Control Node load. The reason is that you used a wildcard in the source and since Polybase does not support it, then Data Factory gives up on using Polybase. I thought that under the covers Data Factory could map the wildcard files into a series of EXTERNAL table definitions so that Polybase was still used, but that is not the case. So, lesson learned. Using a wildcard on the Blob Storage source will disable Polybase at the ASDW sink. This is a no-go for any high volume data loading.
The workaround here is to implement the wildcard using Data Factory parameters and then do the load into Polybase with each individual file. This is done by combining a For Each loop with a Copy Data activity so that you iterate through the files that match your wildcard and each one is further loaded as a single operation using Polybase. With this approach, the performance of my data load was about 100x faster than the “fall back” where ADF just ditched Polybase and loaded through the Control Node.
If you want to see how this configuration looks like in practice, check the video at the bottom of the post where I walk through it.
Let’s move on to the next issue I faced.
Dealing with custom date formats
The files I was loading were using a not-too-common date format of “YYYY-MMM-DD,” for example, 2018-JUL-01. When I simply configured the pipeline and ran it, I thought maybe ADF would be smart enough to automatically recognize the format, but it just failed with an error. Ok, fair enough, I thought I just need to specify it’s a custom format somewhere. And indeed, ADF does support specifying custom date formats based on what is available in the .NET framework. The full list is HERE.
Now, if you through the list, you will see that my particular date format is actually not there, I was unlucky enough that out of the seemingly dozens of formats, this one is just not available. However, the formats supported by Polybase are actually not the same and it’s a different list you can find HERE. And this time, I did get lucky, my YYYY-MMM-DD format IS supported directly by Polybase!
The solution here is then very simple; for the files with the date format that is not supported by ADF but supported by Polybase, I can still use ADF if I want it as a scheduling and flow tool, but I need to do the actual loading through Polybase and T-SQL. Or if you have a very simple or one-time load, just simply ditch ADF and load with Polybase on T-SQL.
There is also another possibility that I want to mention for the sake of completeness but was not a good fit for me. ADF also allows doing Staging copies that can change the data schema; however, doing this means copying the data to manipulate it and then finally doing the real table load. When dealing with 1TB+ of data, this data copying step was just too much work when Polybase already supported the date format.
All right, I hope these solutions will help you if you face similar issues and you can take advantage of my experience when working with Azure Data Factory and Azure SQL Data Warehouse.
Now that we have gone through the scenarios, let’s do a demo of the Wildcard loading issue and the difference in performance on Azure Data Factory when Polybase is not used and when it is. I will also show how the pipeline is configured to be able to do wildcard loading and still keep Polybase active. Check out the video below!