Welcome to part two of my shallow dive into the PDW. In this section we take a look at the options available for ETL with the appliance. Here we will cover the following topics:
The DWLoader application is a bit like the PDW version of BCP, it is a command line application that is all about very fast import and export of flat files that are logically structured. The tool can be used via a script or the command line to import flat files to the appliance or export data to flat files off the appliance. Example syntax of the DWLoader application:
dwloader.exe -i ${loadfile} -m -fh 1 -M {mode} -b 100000 -rt value -rv 1000 -R ${loadfile}.rejects -E -e ASCII -t ‘,‘ -r \r\n -S $server -U
$user -P $password -T {targetdatabase}${table}
Key switches:
Once the import or export is started there is not much to see however you can track progress through the PDW dashboard.
A common problem with the DWLoader tool is that it isn’t able to work out the file format of the flat file (ie. Unicode or UTF-8). If you aren’t sure about the file type open up the file in SSIS Import/Export wizard and it will allow you to work out the format.
First let’s remind ourselves how where SSIS fits in the PDW world:
The important point the diagram shows is that SSIS runs on the landing zone server NOT on the PDW directly. This is quite critical because, whereas in traditional ETL solutions you will look to SSIS to manage all of the transformations, when you have a PDW appliance as your database you will want to do that work on the actual appliance. This is also important to limit the effect on memory by pulling large data volumes into the SSIS data flow. It is much better to use SSIS as the control flow for the ETL solution and doing the advanced transformations of the data on the PDW.
To use SSIS with the PDW you use the same tools and processes as you would against normal SQL Server. Visual Studio with SSDT is used to create an SSIS project. Within the project you can specify a source adapter that will be an OLE DB source pointing to the PDW; however you will see that SSIS thinks this is a normal SQL Server source because the icons for the tables do not show as distributed or replicated, just tables. This is by design and it is good practice to build your source query as a stored procedure on the PDW so that you can test it before using in the SSIS project. The big difference in SSIS is the destination adapter, this is PDW specific and an example is shown in the diagram below:
Using the PDW destination adapter it is critical to make sure the source columns match the data type of the destination columns; the adapter is not flexible or forgiving like other destination adapters in SSIS (i.e. SQL Server Destination), it won’t do any automatic type conversion. Important settings when using the PDW destination adapter include:
Below is an example of the package running to import a simple file:
You can then follow the performance of the load through the PDW dashboard just as with the DWLoader:
In summary, when dealing with ETL on the PDW, push as much of the data transformations to the PDW as possible. Use tools not normally available in SQL Server or SSIS such as CTAS (Create Table As Select) and try and limit the use of the Update statement. Understanding that SSIS works by pulling data from the data source adapter into the memory on the SSIS server and runs it there, is critical to manage memory and performance of SSIS.
My suggestion for using SSIS with PDW is to use it for the control flow and for any basic data preparation for types not supported on the appliance, for example XML data type. Then use the power of the PDW to do the advanced transformations. In a recent POC for a customer we were loading around 300gbs of data in less than one hour using DWLoader and then transforming 2 million XML documents into a relational data warehouse using SSIS and PDW stored procedures. We did have an issue with no Merge statement support in PDW so to upsert into a dimension table we created a “mega-merge” statement using the CTAS function. In Part 3 of this series I will cover how that worked and also how partitions are managed on the PDW.
If you have any comments or questions please feel free to post them here for me or tweet me at @mattasimpson…