Design Tip #147 Durable “Super-Natural” Keys

Pessoal, mais uma dica sobre Data Warehousing, elaborada  por Warren Thornthwaite da Kimball University.  Vamos aproveitar para continuar a treinar o nosso inglês… 

One of the tasks of the ETL system’s customer dimension manager as described in the Kimball Group Reader is to “assign a unique durable key to each customer.” By durable key, we mean a single key value that uniquely and reliably identifies a given customer over time. In most cases, this unique durable key is the natural business key from the operational systems, and all we have to do is copy it over as an attribute in the dimension table. However, there are cases where the natural key changes, and when it does, the dimension manager has to step in.

Some common causes for natural key changes include business reasons, duplicate entries, and integration of data from multiple sources. All of these require the creation and management of a unique durable key, also known as a super-natural key, during the ETL process.

A good example of a natural key change for business reasons comes from the credit card industry. Credit card account numbers are natural keys; they appear in transactions, and are mapped to surrogate keys in the dimension table. If a credit card is stolen, a new account number is issued. Without awareness of this change, the new account number would look like a totally new account and be entered as a new entity in the account dimension. The full history of the account would be lost because it now has two natural keys.

When this kind of business-driven change happens, the transaction system must generate a notification record telling the ETL process that a new account has been created to replace an old account. This could be as simple as a table with the old account number, the new account number, and the effective date. The ETL system must then create a new row with the new account number and a separate durable key column that ties the old and new accounts together. Figure 1 shows how this durable key would look in an account dimension.

Account_Key

Account_ID

Durable_Account_ID

Account Holder

State

Eff_Date

End_Date

3

8765

3

    Smith CA

2011-02-01

2011-05-10

7

8765

3

    Smith OR

2011-05-11

2011-10-23

23

9251

3

    Smith OR

2011-10-24

2011-12-31

55

9251

3

    Smyth OR

2012-01-01

9999-12-31

Figure 1 – A durable key in an account dimension.

In Figure 1, the Account_Key is the surrogate key assigned by the ETL system to uniquely identify each row. The Account_ID is the ETL substitute for the natural key from the transaction system because you would not usually load a sensitive element such as a credit card account number directly into the data warehouse. The third key column, called the Durable_Account_ID, is the durable key assigned by the ETL system to tie all related rows together. Figure 1 shows four rows for the same account because there were two Type 2 changes to track, one state change and one last name change, in addition to the Account_ID change.

Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes.

Dealing with duplicate entries in a dimension, or integrating disparate sources into a single dimension, involves more complex business logic that relies on durable keys. The end result is similar to Figure 1, but the integration process has to generate the list of related items rather than relying on the transaction system. Figure 2 shows the integration of products from multiple source systems. The MDM subsystem in the ETL process has identified these three products as the same and assigned them to the same durable key.

Product_
Key

Product_ID

Durable_
Product_
Key

Product_
Name

Product_
Group

Match_
Date

Match_
Score

5

37285

5

Wrench

Tools

2011-02-01

1.00

9

39101

5

Wrench

Hand tools

2011-10-24

0.93

25

17195

5

Wrench

Tools

2012-05-11

1.00

Figure 2 – Duplicate product entries.

In this case, you might need to mark the natural keys from the different sources so they don’t collide with each other. Try using a character data type for the source natural key, prepended by a source code. For example, if the products in Figure 2 came from SAP or the CRM system, the Product_ID column might contain the following values: SAP|37285, CRM|39101, and SAP|17195. This table then becomes input to the dimension and fact managers.

Durable keys are mandatory for dealing with ambiguities in the source system natural keys. Creating and assigning durable keys allows you to work around business changes to the natural keys, or to integrate duplicate or disparate data. But durable keys are just the start; there is a lot more to deduplicating and data integration.

Até a próxima!!!

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s

%d blogueiros gostam disto: