13th June 2017

BimlScript and Automated ETL Part 2

Implementing the Pattern

Now that we have the requisite metadata it’s time to use it to create our Biml files.

In order to handle the requirements of different feeds, i.e.  different target dimension and fact tables, some files will require data to be unpivoted and some will contain multiple demographic attributes so we will need to write quite a bit of code.

If you are not up to speed with C# and you are interested in Biml development, you can find a number of good guides to getting started on training sites such as Pluralsight 

All the code for the article is available on GitHub and you are free to use it for your own projects.

Using Custom C# Libraries

It is common practice in a Biml Script project to have a C# include file with all your custom code in it. This is a bad idea because it ignores the issues in a real production system such as being able to independently test the code and the management of secrets such as connection strings, passwords and so on.

By creating a separate C# project, these limitations can be overcome. We can also include a test project so that every piece of functionality can be tested and debugged independently. We can use user secrets files to for local development to hold the database connection strings and passwords so that developers can’t accidentally check them into source control.

Once we have working code, we can import the resulting libraries (.dll files) into our BimlScript as shown below. Notice the two step way it is done, first by declaring the assembly name and then importing the namespace, but that’s all there is to it. You can see it in the picture below or better still have a look at the code on GitHub

Use Tiers

In order to keep your BimlScript files clearer and easier to manage, you should split them into smaller files. For instance the code for things like script tasks can be quite large and having everything in one file soon becomes very difficult to navigate. In order to make sure everything compiles properly you should use tiers. You can see the tier directive in the first line of the Biml file

An in depth explanation of tiers is beyond the scope of this blog, but essentially they control the order in which the Biml compiler compiles the files. This is important so that resources required by lower tier files have already been compiled and are available to their parents, or children depending on how you want to look at it.

I have settled on a two tier approach, Tier zero is the main block and tier one is used to abstract out the flat file formatters  and script tasks. I also have a common variables file for, well, common variables.

The pattern I have used is to initially load the flat file data to staging tables and then process the staging data into the data warehouse, applying validation and matching rules as agreed with the client. A strong auditing/reporting regime is important here in order to be able to satisfy client queries. Finally the data ends up in our single customer view.

This blog post will just detail the Staging load, but you can adapt the code to create your own dimension and fact table loading

Depending on your business and clients you might also want to store the supplied data files somewhere else, exactly as they were received, either in SQL tables, SSIS Raw files or just store the original files in the file system. You will definitely want to archive the supplied data somehow in case of client queries.

As an aside, if you do retain original files, I recommend you encrypt them with a strong encryption like GPG. This provides two benefits: One it protects you in case of a data breach, and two, it has the benefit of compressing the data, so reducing storage costs.

Tier One

In tier one I have two files: Tier1ScriptProjects.biml contains script projects, while Tier1FileFormats.biml is responsible for creating the flat file connection, database connection and Flat File Format based in our file header and column metadata.

Tier Zero

Tier 0 is responsible for generating a separate package for each file in the MappingFile metadata table. This is where the hard work is done as each flat file may have a different combinations of dimensions and facts that will be loaded and all these have to be determined at compile time.

The Flat Files

In order to do the initial staging, we need to dynamically generate the flat file connections and formatters using the metadata discussed in the previous post. We use convention, naming the flat file connection with the same name as its formatter, to ensure the connector and formatter are correctly wired up in the generated Biml.

In the example BimlScript below we create the necessary file formatters:

We generate it by calling first GetFileFormatHeader()

This pattern of using a query to populate a data table object and then loop through it to produce the BimlScript is used all the way through.

In this case it uses the file metadata to construct the header descriptor on the <FlatFileFormat> element.

Then we call GetFileFormatColumns() to generate the <Columns> elements.  The resulting BimlScript for a single file formatter comes back something like this.

Have a good look at the code in Tier1FileFormats.biml and the functions it calls in the MappingUtils.cs class to see how this works

Leave a Reply