Making SQL sense

+44 (0)20 3051 3595 info@coeo.com
coeoClose

Making SQL sense

+44 (0)20 3051 3595 info@coeo.com

Careers

We're looking for people who share our commitment to excellence in Microsoft's data platform to join us consultants working on exciting business intelligence, analytics, and SQL Server projects on-premises and in the cloud.

Azure Data Factory Best Practices: Part 2

The Coeo Blog

Hey there! In the second part of my Azure Data Factory best practices I’ll be talking about controlling the flow of your tasks.

ADF is great and by running tasks in parallel not only can you run different activities but you can also run multiple date slices when you set the concurrency of the activity. However, as good as this can be, when using ADF for a data warehouse ETL you may want to have control around how the loading takes place.

If we look at a straightforward ETL process for a data warehouse the load order is usually something like:

  • Extract data from source
  • Transform the data
  • Load Dimensions
  • Load Facts
  • Load Snapshots

Without any control flow these tasks would all run in parallel and the likelihood is that your business keys would be broken. Facts would be missing dimension values and your data will be no good for reporting.

By chaining activities and pipelines in ADF you can implement a control flow that will prevent your loads going out of sync.

Grouping activities into a Pipeline

The way I usually group activities is a group per table I am working with. Whilst this can get very large it gives you the flexibility to only re-load data for one table at a time. If it was all in one large pipeline then every step would always have to run.

Chaining Activities

Chaining activities is a way to make one activity dependant on another’s successful completion. The way we can define a chained activity is by using the ‘Inputs’ and ‘Outputs’ settings within activity. An output specified so that it can be used as an input to another activity eg:

A simple Truncate table task:

 

"activities": [

     {

       "type": "SqlServerStoredProcedure",

       "typeProperties": {

         "storedProcedureName": "[stg].[TruncateCustomer]",

         "storedProcedureParameters": {}

       },

       "outputs": [

         {

           "name": "SPOutput-Truncate-Customer"

         }

       ],

       "policy": {

         "timeout": "01:00:00",

         "concurrency": 1,

         "executionPriorityOrder": "OldestFirst",

         "retry": 3

       },

       "scheduler": {

         "frequency": "Day",

         "interval": 1

 

       },

       "name": "1-TruncateStaging-Region"

     },

 

The output was specified as

       "outputs": [

         {

           "name": "SPOutput-Truncate-Customer"

         }

       ],

 

This means that my next task, ‘Load into Staging’ can use “SPOutput – Truncate – Customer” as an input. Once specified the Truncate activity has to successfully complete before the ‘Load into Staging’ task can run.

Pipelines work in exactly the same way, Remember that a pipeline can have multiple inputs which is how you ensure all the dimensions load before the factual data is loaded.

Click here to read the first of this series on Azure Data Factory Best Practices.

Subscribe to Email Updates