A company that has a huge volume of data builds a data warehouse so that it can generate reports, perform analytics and make informed decisions. The process of building a data warehouse from a transactional or source system is important and the process that a company selects depends on its long term vision and goal of its business intelligence systems. There are two schools of thought on how a data warehouse system should be built. In this article we discuss one of those. The method that we discuss here owe’s its existence to Ralph Kimball and can also be called Dimensional modeling.
Before discussing what dimensional modeling is, let us understand what is expected out of a data warehouse. The requirements of a data warehouse are:
1) It should allow everyone in the organization to effectively access information.
2) It must present a ‘single version of truth’ of data.
3) It must be able to incorporate changes quickly.
4) It must protect data and manage data access.
Components of a Data Warehouse:
Let’s discuss the components of the data warehouse topology.
- Operational source system : This source systems are the transactional systems that the company uses to capture its operational data. The source systems may range from a relational database to a flat file. The source systems are characterized by high availability and high performance. However, they may not store historical information.
- Data Staging Area: The data from the source systems need to be fed to the data warehouse. However, the data may undergo various transformations before it is loaded into the data warehouse. A process called ETL (Extract, Transform and Load) may be carried out during the data staging area. Ralph Kimball advocates that the data in data warehouse should be stored in dimensional form . However, data in the staging area may be in normal form, but should not be avilable to the end user for querying.
- Data Presentation: This is the area where the end users submit their query. It consists of a series of data marts. A data mart contains data in a dimensional form. The data in a data mart corresponds to one business process. Note that the data mart does not contain data from a business department but a business process. There may be multiple processes in an organization and hence multiple data marts. However the processes and hence the data marts may contain information that is shared accross the organization. For example, the customer information may be present in all data marts. Therefore all data marts may contain a dimension called customer. This dimension may be shared (physically or logically) accross all the data marts and is called a conformed dimension.
- The presentation area is used in the following way:
- 1) Creating predefined reports.
- 2) Creating adhoc reports.
- 3) As an input to forecasting.
- 4) As an input to other analytics tool.
The dimensional schema for a relational database is called a star schema. For a multidimensional system the data may be stored in OLAP cubes.
Besides the above the organization may also contain a metadata repository. The metadata is information about the tables, its indexes, structure Etc. It would be beneficial for the organization to let the users perform query using the metadata rather than the tables themselves. This makes the system flexible since any changes in the underlying physical tables does not effect the query.
The metadata can be looked as physical table -> physical model -> business model – > business table <- user query
- Fact Table: The dimensional schema consists of fact tables and one or more dimensional tables. The fact table is the heart of the schema and contains the transactional information. It stores the numeric information that can be aggregated. The performance metrics and other KPI are generally stored in the fact table. one characteristic of the fact table is that it contains data at the same grain. i.e. all rows in the fact table contain data at the same level of granularity. Also this level of granularity should ideally by as low as possible. This gives the user the ability to perform drill down at any level. The textual values may not be suitable for the fact table and are moved to the dimension table. The fact tables have foreign keys(FK) that correspond to the primary key in the dimensional table. Dimensional Tables: This contain the dimensions upon which the user can perform drill through and use in their queries. The dimensional table mostly contains textual values. A very important dimension table is the Date table. The fact table along with the dimensional table form a star join sch ma. Example : Lets take an example of a retail garment store. The store has its sales information and it wants to set up a data warehouse that allows them to generate adhoc reports. The information that it can put in the fact table is the sales amount, the discount offered , the price after discount etc. The information in its dimensional table could be the size of shirt or trousers sold, the colour of the garment. It could have a data dimension that stores the data of the transaction and also stored other information such as holidays and other special days. The schema looks like below.
- The process of dimension design is as follows:
- 1) The first step is selecting the business process. As discussed earlier the business process is distinct from a business department or organization. Typical business process are orders, invoicing, inventory management etc. The business departments may be procurement, store management etc.
- 2) The second step is identifying the granularity of data that is to be stored in the fact table. Care should be taken at this step since it is difficult to change the grain at a later stage. It is advisable to select as low a grain as possible to allow user to perform adhoc queries. Separate fact tables may be used to store accumulated data at a higher grain.
- 3) The third step is choosing the dimension. The dimensions may be date, product etc. The dimension table should contain descriptive values rather than code values. This allows the user to quickly use these columns in their query without performing any lookups. Also the dimension table should contain all derived information as separate columns. For example in a data dimension it is beneficial to store separate columns for day, month, day of year etc. Although this information can be obtained from a ‘date’ object, the user should not be burdened with performing the transformation for use in the query/report. Creating report from the dimensional schema should be as simple as dragging and dropping the columns in the reports.
- 4) The fourth step is to identify the numeric information that goes into the fact table. The numeric information forms the part of the KPI and hence the organization dashboard. Once again derived information should be stored as separate columns. For example if the data contains weight and if the user may need weight in either pounds or Kg then it is beneficial to have two columns in the fact table, once containing the weight in pounds and the other the weight in Kg.
Some important concepts that can be helpful while designing a dimensional model are:
- Degenerate Dimensions : Certain values such as the invoice number or the order number can be used in the fact table without a corresponding dimension table. This values are known as degenerate dimension.
- Dimensional normalization- snowflaking: Although normalization is in general not recommended, in certain cases the dimensional tables may be normalized. This is know as snowflaking. This is typically done when the number of attributes is high. The redundant attributes may be put in another table.
- Surrogate Key : The fact table is joined to the dimensional table using a key. This key should not be a natural key but rather a key generated by the system specifically for this purpose. For example a product dimension may be joined to the fact table using the product number, however there are problems to this approach and it as recommended to generate a unique number for each row in the dimensional table and join the fact table to the dimensional table using this unique number.
- Data Warehouse Bus Architecture:
- Data Warehouse bus architecture can be used to implement an integrated data warehouse in an enterprise. The different data marts can be plugged together using the bus architecture. As seen in the picture, the individual data marts are the rows of the matrix, whereas the columns of the matrix are the conformed dimensions. While designing the data mart, the row is added and the corresponding conformed dimension can be looked up on the columns. It is beneficial to enumerate all conformed dimensions at the start of data warehouse building process. This helps the designer to quickly identify dimensions that can be reused.
- Conformed dimension: Conformed dimensions have been described earlier. Some of the characteristics of conformed dimensions are:
1) They have consistent column names, definition and values.
2) A subset of the entire dimensional table may be used.
3) Separate physical tables may be used. But the physical tables are well synchronized.
4) Dimension tables may be handled by a separate authority, called a dimensional authority.
- Slowly Changing Dimensions: The attributes of the dimensional table may change over time. A suitable technique is needed to incorporate the changed dimensions in the dimensional table without breaking the reports and also making the changed attributes available to be joined to the fact table for reporting.
There are three strategies to handle slowly changing dimensions. The change is described as slow since it is not very often that a dimension attribute changes. If the change is dynamic then the design needs to be rechecked.
Type 1: In the first strategy the new value overrides the old value. So for example, if a product moves from Category A to Category B the category column value for product A is simply changed from category A to category B. This is a simple, fast and easy way to handle the change. The fact table does not change. However, note that the historical information is lost. There is no way to find the performance of product A under category A items. Also historical data of category A may start differing since Product A is not shown under it any more.
Type 2: The second strategy is very widely used. In this strategy an additional row is added to the dimensional table. For example, for product A an additional row is added where the category is now category B. All other values remain same. A new key is generated for the row. This method also highlights the advantage of using a system generated primary key(surrogate key) instead of a natural key as a primary key. The second advantage is that historical reports may still contain the older definition of product A. A date value may be added to show the effective/expiration date.
Type 3: A new column can be added. For product A, two columns can be maintained. one column shows old category and the other shows new category. This strategy can be useful when changes are minor. The historical information is also preserved. However, it may not always be possible to add new columns.
- Dimension Role Playing : Sometimes a single dimension table may be linked to the fact table multiple times (ways). In this case the dimension table may be a single physical table but may be represented in different views. The dimension table therefore plays multiple roles.
- Junk Dimension : A junk dimension is sometimes used to group low cardinality columns such as flags.
- Fact table types: The fact table may be broadly of three types:
- Transaction fact table: These store data at the transaction grain. Each row may correspond to a single transaction.
- Periodic snapshot: These show cumulative data for a particular time frame. Each row may contain aggregated data for the day, week or month. These tables can be used to generate reports for performance metrics (KPI)
- Accumulating snapshot: The accumulating snapshot stores data for a transaction or entity that may take an indeterminate time. For example an accumulating snapshot of customer purchases may show all purchases till date. Note that in this kind of tables the values may need to be updated regularly.