Or a prescriptive pattern for automating the building of sophisticated data warehouses based upon a simple metadata store
There are lots of Biml articles that discuss strategies for generating SSIS packages to load data warehouses that are essentially copies of the production DB. These strategies usually leverage the built in Biml Extensions methods.
There are fewer examples where the target data warehouse is a Kimball model based upon a number of disparate data sources where the objective is a single version of truth for the domain being modelled.
The purpose of this article is to suggest a prescriptive pattern that enables fast development of useful marketing focused data marts from multiple sources of customer data (although the same pattern could be applied to other domains, e.g. Manufacturing) using a metadata driven approach in conjunction with BimlScript and custom C# code classes.
All the code for the article is available on GitHub and you are free to use it for your own projects.
Step One: Metadata and Design Choices
The pattern we are aiming for is the traditional one where we assume that we have a bunch of flat file feeds which we are going to stage first before we cleanse, validate and match and finally load into our data mart.
The example data warehouse we will target contains dimensions for customer, address, email and phone along with fact tables for sales and email and demographic activity
Designing the Metadata
First off we need a metadata store to model the flat files that we intend to import. To do this I have created one as shown below:
This table describes the flat file headers and will be used by the code to construct the <FlatFileFormat /> Biml elements needed to generate the flat file formatters and connections.
The column mappings serves to enable building the <Columns> collections as well as customising the <Transformations> elements and deciding where in the data warehouse to map the incoming data.
The isAttribute flag along with the AttributeTypeId enable us to map attribute data that doesn’t fit into the core entities of customer, address, email or phone. For example customer demographic data that may be useful in marketing segmentation later on
When the flat file data that we are modelling comes in it will usually have column headings. We don’t really care about these because they are often unreliable, which is why the ColumnNumber field is there. It’s that which will drive which column gets mapped to which in the target table.
A sample column mapping
The column number is zero based to match C#’s zero based array indexes which we will be using a lot later.
The IsPivot and IsAttribute columns are used to drive special processing, either channelling data into attribute tables or unpivoting entities that repeat in in a single feed, e.g. home phone and work phone number, into a specific table.
A note about phone numbers
The processing of phone numbers ignores the type of phone number at the point of initial load. This is because supplied data is almost always unreliable. It is far more reliable to simply import the data and determine whether a number is a fixed line number or a mobile number during the transform stage. This applies in Europe at least. American phone numbers do not distinguish between mobile and fixed line numbers through the numbering so you may need to devise a different strategy for these.
Next up – implementing the pattern. In which we see how we can use our meta data in conjunction with BimlScript and C# to generate the Biml we need.