Although dimension tables are typically much smaller than fact tables, the dimension tables are the true drivers of the data warehouse. Dimension tables provide the descriptive context for all the measurements recorded in the fact tables. Although it’s something of an obvious point, without the dimensions the data warehouse would be a meaningless ocean of numbers. Ralph Kimball on February 5, 2014.
Dimensions should not be built on a individual, case-by-case basis for each data source coming into the data warehouse because dimensions are a strategic resource that provide coherence and consistency across the data warehouse. The descriptions of typical enterprise entities including calendars, customers, locations, products, services, providers, and competitors should be attached wherever possible to each data source available for querying and analysis across the enterprise.
This strategic view of enterprise dimensions has powerful architecture and governance implications. Let’s summarize these powerful ideas. Be warned that in the rest of this short design tip, we’ll be reminding you of the standard dimensional modeling techniques, about which we have written extensively.
Enterprise dimensions should be logically centralized and available on a publish/subscribe basis. Each enterprise dimension should have a single authoritative source that makes that dimension available to “fact table providers” across the enterprise. This guarantees that all queries and reports that use an enterprise dimension automatically constrain and group that dimension in a consistent way. Merely providing a centralized dimension is a big data governance step. Don’t get too bogged down in implementing a high tech publish/subscribe architecture. The point is to have everyone using the same data, even if it is based on simple file transfers.
Enterprise dimensions should track time variance using the slowly changing dimension Type 2 (SCD2) technique: add a new row. Any dimension that has changeable content should be built around the standard SCD machinery: 1) single column surrogate primary key; 2) surrogate natural key (sometimes called the supernatural key); 3) administrative fields describing the reason and time stamps for each change; and 4) original raw natural keys from each data source contributing to the dimension. (For instance, in a bank, the customer dimension may be built from multiple lines of business, each with its idiosyncratic version of the natural key for the same customer). If the concepts of this paragraph are unfamiliar, please follow up with the resources mentioned above because SCD techniques are part of the bedrock of modern data warehouses.
Build similar data pipelines for each fact table. Once the enterprise dimensions are available, you should build identical ETL pipelines for processing incoming fact table data. Using the dimension tables themselves, look up the raw natural keys in each dimension, and replace the fact table keys with the correct dimension table surrogate primary keys. Although different ETL environments may code this processing step differently, all the data warehouse teams should have this logic as part of their basic skill set and vocabulary.
Maintain shrunken rollup dimensions as part of the logically centralized enterprise dimension resource. Shrunken rollup dimensions must be physically separate tables apart from the corresponding atomic base dimensions. In general, it is impossible to implement a rollup dimension as a dynamic view on a base dimension. Shrunken rollup dimensions must have their own surrogate primary keys.
Merge enterprise dimension content with local dimension content at the time of publishing. Having an enterprise dimension does not mean that the dimension is a suffocating replacement of local dimension content. For instance, a line of business in a bank, such as credit card, may have private descriptive attributes of each customer that are not exposed as part of the overall enterprise description of that customer. When the credit card data warehouse team receives periodic updates of the enterprise dimension, they merge their local descriptive attributes into the enterprise dimension by matching on the raw natural key, as described above. These steps are at the root of processing “conformed dimensions” in the data warehouse. Please see the white paper Essential Steps for the Integrated EDW on our web site.
I hope with these perspectives you see the power and leverage of building a ready-to-go resource for enterprise dimensions. It gets easier and easier to add data sources from across the enterprise as you build up your dimension “library” because so much of the hard work in each dimension has already been done. As an extra bonus, you have made good progress on establishing data governance, everyone’s favorite topic.