Recently I have been working on several projects that have made use of Azure Data Factory (ADF) for ETL. During these projects it became very clear to me that I would need to implement and follow certain key principles when developing with ADF. This is the first of a series of posts which will cover the principles that I have discovered so far.
Today we are going to look at Naming Conventions. The first time I used Azure Data Factory I used some generic ‘copy data’, ‘load data’ style titles in my activities. I found that when troubleshooting these and tracking progress through the monitor that it was impossible to know which task had run in which order. Whilst this was still manageable on a small number of activities, I knew there would be times when a large number of activities would exist in a pipeline and tracking them all would become impossible, sifting through over four hundred lines of JSON would not be a suitable way to track which activity ran in which order!
So where should we apply a naming convention?
When it comes to deciding on a naming convention there is not a ‘one size fits all’ scenario. What I do is come up with a convention that meets the development standards of the team I'm working with. When applying a naming convention in ADF there are four areas of consideration.
- Linked Services
- Datasets (tables in Visual Studio)
- Activities within pipelines
The naming convention will largely come down to how you design your ETL process. In the following example I have chosen to have a single pipeline to process a single table.
The naming convention used here was:
- Pipeline name = ETL-Pipeline-<TableName>
- Activity = ActivityNumber-Task-table
The activities were prefixed with the number to indicate the run order. It is useful to know that activities will sequentially run in the order that they appear in the JSON code but it always helps to have it listed out very clearly. I’ve then put a descriptive word for the task being carried out eg: ‘Copy’ activity.
As you can see, in this example the Linked Services and Datasets are not visible. Whilst it is important to name them I find that it is much more important to name your pipelines and activities as these are what you will see when using the monitor and doing initial troubleshooting for any issues.
The naming rules for Azure Data Factory can be found here: https://docs.microsoft.com/en-us/azure/data-factory/data-factory-naming-rules.
- Maximum number of characters in a table name: 260
- Object names must start with a letter, number or an underscore
- The following characters are not allowed: “.”, “+”, “?”, “/”, “<”, ”>”,”*”,”%”,”&”,”:”,”\”
Below are two types of naming convention that I tend to use.
If you are likely to just have one type of activity in a pipeline, then this simple naming convention is easy to apply.
If you are using a pipeline to group multiple activities together perhaps on a per table basis for an ETL process, then you may find this naming convention useful:
In summary, deciding on a naming convention that works for you is one of the most important considerations when it comes to using ADF. Without one, not only will you lose track of which data set belongs to which pipeline but when you hand this off to your support teams, they will have no idea of what is being run. A small amount of upfront effort will ensure that you produce a easy to read and consistent Azure Data Factory that is easy to support.
If you want to read more about ADF Best Practices please continue to part 2: