As the amount of data generated around us continues to grow exponentially, organizations have to keep coming up with the solutions of our new technological landscape. Data integration has been part of this challenge for many years now and there are many tools that have been developed specifically for these needs. Some tools are geared specifically from moving data from point A to point B, other tools provide a full ETL (Extract-Transform-Load) solution that can work with many products using all kinds of different drivers.
For many years, the first party tool of choice for SQL Server professionals has been SSIS. Interestingly, even though it’s called SQL Server Integration Services, SSIS is really a general purpose ETL tool. If you want to extract data from Oracle, transform it with the full expressive capabilities of .NET and then upload it to a partner’s FTP as a flat file, you can do it in SSIS!
As we continue our journey into the cloud and hybrid environments, more tools will start coming up that will work as an ETL PaaS offering. You won’t have to manage the pipeline’s OS, hardware or underlying software, you’ll just create your data pipelines and be off to the races.
What is it?
Azure Data Factory (ADF) is Microsoft’s cloud offering for data integration and processing as a service. You don’t have to install any bits or manage any software, you’re only responsible of creating the pipelines. Since it’s developed to run inside the Azure the tool also has some pre-made hooks that make it really easy to interoperate with other Azure services such as blob storage, HDInsight or Azure Machine Learning.
On premises you would need a machine (VM or physical), you would need a license for your ETL tool (let’s say SSIS), then you would need to keep SSIS patched up, the machine up to date, think about software and hardware refreshes and so on. Using ADF, you can focus on the pipeline itself and not have to worry about what underlying sofware and hardware is actually making it work. The service supports a wide array of sources and targets (and continues to grow) and also robust options for scheduling the pipeline or running continuously to look for new slices of data.
When should you use it?
If you’re thinking about creating a new SSIS package and find that your sources are all web or cloud based then ADF is a good choice. Build a prototype of your pipeline, make sure that it supports your expected transformations and then you can operationalize it on the cloud. As a PaaS offering, it takes away the time, cost and effort of having to deal with the underlying bits and you can just focus on delivering quality data pipelines in a shorter timeframe.
Like all new things in Azure, there are still some service limitations. The biggest one at the moment is that the service is only available in the West US and North Europe regions. If you don’t have resources in those regions and will be moving a lot of data then I would advise to start learning the service and prototyping but not put in production the pipelines. The reason for that is that any data movement from outside the region will have an outbound transfer cost. If your resources are in those regions then there’s no charge and you can ignore this warning.
In the Demo video we’ll look at the user interface of Azure Data Factory, how to add a source and target, scheduling and checking the status of the pipeline. Enjoy!