Sérgio Sousa Fernandes PMP – Coloca em debate, em um dos muitos sites de DW/BI que acompanho, qual a melhor forma de desenhar um DW Dimensional: Surrogate Keys ou Chaves de Negócio junto com as Datas.
Vamos acompanhar esse debate…
Sérgio Sousa Fernandes PMP • I’m currently designing a DW following Kimball and using SAS 9.3 Technology. I’m facing a decision on using or not Surrugate keys. My past DWs were allways based on Business key plus date when joining with dimensions but now SAS provides the ability to design everything using Surrugate key. I studied the problem and found benefits and disadvantages of SKs as follows:
– Querying the DW will be faster because everything will be based on SK
– The load process of DW will be slower because have to find the SK for each fact;
– In the cases of snow flakes it will generate more rows in dimensions. For example, considering dimension customer, with snow flake on country. I have Customer 1 registered on DIM customer which is part of country 1 as well. If there is any change on Country 1 it will generate a new row and SK for country so, have to generate a new row on Customer with new country. The problem is that SAS doesn’t implement this off the shelf;
– How to deal with missing data in dimensions for each fact?
Richard Gowan • The primary reason for using your own keys – is independence from the source system, as well as an ability to handle “Slowly Changing Dimensions” (Kimball’s terminology) in various ways. I would not worry initially about speed… while it is important not to build anything that completely sabotages future performance, overall it is flexibility (in my view) in IT that is more important than speed.
In many cases speed can be addressed if/where it turns out to be a problem.
In terms load handling… Kimball’s early articles on type 1-3 Slowly Changing Dimensions (SCDs) are a useful classification. Those articles are easily found online – but there’s now a reference on wikipedia. Admittedly, I don’t think they illustrate the concept for snowlake schema.
Pijush Chatterjee • If you are not designing a DW to be implemented on RDBMS, you could denormalize the dimensions, maintain history table and current table to tackle SCD. In today’s MPP databases and even on Oracle exacta, I found that single table fact analysis and denormalize dimension works much better from consumption and maintenance stand point. Nowadays with negligible cost of disk space and high rate of compression ..I guess we could safely navigate away from decade old philosophy.
Ronald Kunenborg • The main issue is maintenance. After that: how often do you load your datamart? And how often do you query it? Whichever you do more (probably query) should be the focus of your optimization.
As for the snowflake example, it is indeed the case that you can’t change on-the-fly if you want to query on the situation “as is” or “as was reported”. In that case you could use a surrogate key that is only created on the first insert of the business key and point to that one, in combination with the date, but I’m not sure if SAS supports that, and it also sort of defeats the purpose of using surrogate keys in the first place.
In a situation where you needed that sort of flexibility I’d say that you could use
a) business keys + dates – easy inserts, slight performance drop in querying which may be a significant issue. Has one major advantage: rebuilding a dimension with the same business key does not force you to rebuild the fact tables. This becomes a HUGE issue if your historical storage is dimensional, not so much if your storage is in a different historical storage layer, regardless of modelling method.
b) surrogate key + dates, with a surrogate key that never changes over time – load performance is lower but query performance is higher in a number of databases (at least Oracle) where the dates are numerical. If you ever need to rebuild a fact or dimension, you need to rebuild the whole star unless you have the SK stored somewhere else.
c) two surrogate keys in both fact and dimension: one SK for the situation at the time you inserted the customer dimension, that is different for each time you insert a line in the dimension, and one SK pointing to the “global ID” of the dimension key, that you can use in combination with a “current” flag to find the latest version of the dimension. More complex loading procedures (but not much). Rebuilding the star is about as complex as with option (b).
Given the maintenance hassle with surrogate keys on stars, I’d say spend a bit more money on hardware, save a lot of money on maintenance, and use the business keys + dates.
Sérgio Sousa Fernandes PMP • Yeap. Got the ideas. They match with my expectations. Thank you guys.
Richard Kemp • Have you considered denormalizing the snowflake down to a star schema, and just avoid the snowflake issue altogether? Will SAP then support that?
Siva Gudavalli • Fernandes, you need to ask couple of questions before considering which option you are chosing.
Since you said you are using Kimball approach, I believe you are going to create datamarts and get them connected using conformed dimensions in order to setup an Enterprise Data Warehouse
At first place try to make the schema simple, star is prefarable and surrogates are really nice to have here.
Snow Flakes, Its bit complicated! Surrogates leads to disassociation which is going to create unneccessary ETL burdens and degrade performance, please be very careful if you are using Type II SCD. I have seen issues working with surrogates and SCD Type II
Also try to also take extra care on Conformed Dimensions, surrogates are highly suggested here as you are going to integrate couple of systems.
Sérgio Sousa Fernandes PMP • I understand the advantages of Dimensional modelling that in fact I’m using for years. It works perfectly when you are developing/designing Data Marts but when you consider the big picture it has some limitations. And the first limitation is related to the fact that you can’t avoid snowflake and the country/geographic dimension is a good example. You may classify facts based on geographic dimensions but you may also classify other dimensiosn as for example the customer dimension. Also, you may need to have one standardized geographic dimension in order to do, for example, some geographic analysis as the distribution of customer per geography or transactions or…
Related to technology. In this case I’m using SAS which you may know is an open architecture with some (not all) features out of the box. So, I have to design in accordance with project times, business requirements and off course technology features. In fact, I know some other technologies like Oracle, SQL Server or Talend and any of those implement this kind of requirements out of the box. That’s why I’m asking the question to get some experiences/knowledge… There exists allways severall possibilities. For example, if you buy SAS xIS (Banking, Telco or Insurance) solutions the first thing you get is a data model (and you pay for it!!) that is not trully dimensional.
Alan Musnikow • If you are “following Kimball” you must use surrogate keys. Surrogate keys are absolutely required by Ralph Kimball’s approach.
Adeoye Omoboya • I agree surrogate keys are important by the Ralph Kimball Approach for the Dimensional Model. SCD and RCD could be considered depending on what you want to achieve
Sérgio Sousa Fernandes PMP • @Alan and @Adeoye, when you talk about SK, you mean incremental SKs or SK + dates?
Alan Musnikow • Simple surrogate key without dates or any other meaningful value. Surrogate key that is an arbitrary number that uniquely identifies a row in a table.
Hal Crawford • By definition, surrogate keys don’t carry any value other than to link to other data. That way, if a group of values change for a given SK, you only have to change it in one place (rather than across the entire DW domain).
Some Kimball purists believe everything should be tied to a surrogate, including date values. I’m not one of those purists. But remember that you are building a storage solution designed to deliver fast responses to queries and analysis.
If SAS will let you get away with denormalization of your fact data, I say go for it. Snowflaking will cost you in performance and maintenance down the road.
Sérgio Sousa Fernandes PMP • @Alan, you are rigth, but then tell me how you solve the Geography problem on your data? I mean, you have geography in customer and also on transactions, so the easier way is to build a Geog. Dimension. With incremental (unique) SKs?
Richard Kemp • I assume from the business case you’re describing that the fact has a reference to geography (where a product was sold perhaps?), and customer has a geography heirarchy logically distinct from that. But its the same heirarchy of country/state/city/whatever. Is that correct?
If so I see it as two different dimensions, each with an SK from fact to dimension just like any other. To build those dimensions you have a common table in a staging area representing the geography and heirarchy, and use that to build both dimensions.
But from a query perspective, and actual data modeling perspective, they are two different things.
Até a próxima!!!!