Azure Data Factory Best Practices: Part 2

Hey there! In the second part of my Azure Data Factory best practices I’ll be talking about controlling the flow of your tasks. If you missed part one you can see it here.

ADF is great and by running tasks in parallel not only can you run different activities but you can also run multiple date slices when you set the concurrency of the activity. However, as good as this can be, when using ADF for a data warehouse ETL you may want to have control around how the loading takes place.

If we look at a straightforward ETL process for a data warehouse the load order is usually something like:

Extract data from source
Transform the data
Load Dimensions
Load Facts
Load Snapshots

Without any control flow these tasks would all run in parallel and the likelihood is that your business keys would be broken. Facts would be missing dimension values and your data will be no good for reporting.

By chaining activities and pipelines in ADF you can implement a control flow that will prevent your loads going out of sync.

Grouping activities into a Pipeline

The way I usually group activities is a group per table I am working with. Whilst this can get very large it gives you the flexibility to only re-load data for one table at a time. If it was all in one large pipeline then every step would always have to run.

Chaining Activities

Chaining activities is a way to make one activity dependant on another’s successful completion. The way we can define a chained activity is by using the ‘Inputs’ and ‘Outputs’ settings within activity. An output specified so that it can be used as an input to another activity eg:

A simple Truncate table task:

"activities": [

{

"type": "SqlServerStoredProcedure",

"typeProperties": {

"storedProcedureName": "[stg].[TruncateCustomer]",

"storedProcedureParameters": {}

},

"outputs": [

{

"name": "SPOutput-Truncate-Customer"

}

],

"policy": {

"timeout": "01:00:00",

"concurrency": 1,

"executionPriorityOrder": "OldestFirst",

"retry": 3

},

"scheduler": {

"frequency": "Day",

"interval": 1

},

"name": "1-TruncateStaging-Region"

},

The output was specified as

"outputs": [

{

"name": "SPOutput-Truncate-Customer"

}

],

This means that my next task, ‘Load into Staging’ can use “SPOutput – Truncate – Customer” as an input. Once specified the Truncate activity has to successfully complete before the ‘Load into Staging’ task can run.

Pipelines work in exactly the same way, Remember that a pipeline can have multiple inputs which is how you ensure all the dimensions load before the factual data is loaded.

Click here to read the first of this series on Azure Data Factory Best Practices.

Azure Data Factory Best Practices: Part 2

The Coeo Blog

Subscribe to Email Updates

Related posts

Parallelising Python on Spark: Options for concurrency with Pandas

Predictive Analytics – Do you know your data?

Azure Data Factory & XML: Loading unsupported file types

Do I need a Data Strategy?