Big data gets a lot of press in technology and IT circles these days, both because it is a disruptive technology very different from traditional relational databases, and because it opens up many new forms of analysis. In this Design Tip, I’ll tackle some questions many IT people may be worried about. Is big data a new IT theme that has nothing to do with the data warehouse? Do the data warehouse skills and perspectives we have developed over the years help us in any way with big data? And maybe does big data belong in end user departments outside the scope of IT altogether? For an in depth treatment of big data, please visit our website for a link to my white paper, “The Evolving Role of the EDW in the Era of Big Data Analytics.”
Big data fits within the mission of the data warehouse. The mission of the data warehouse has always been to marshal the data assets of an organization and expose those assets in the most effective way to facilitate decision making by a broad range of business users. Big data clearly fits within this mission.
The dimensional modeling foundations of the data warehouse can be found in every big data use case. The dimensional modeling approach to data warehousing starts with measurement events (observations). In the relational world, these events are captured in fact tables, and these fact tables are linked to the natural entities of the organization which we structure as dimension tables. It is not a stretch to interpret virtually every big data use case as collecting a set of observations whose context requires linking to natural entities. The observations and entities don’t need to be cast as literal relational fact tables and dimension tables. For example, a tweet coming from Twitter is itself an observation which carries with it a number of obvious dimension-like entities, including the sender, the recipients, the subject, the origin server, the date and time, and the causal factors in the environment that the tweet may be responding to. Digging out some of these dimensions is not unlike what we do when constructing causal promotion dimensions in the data warehouse
Data integration requires conformed dimensions. If we agree that big data entities are just dimensions, and if we are committed to integrating diverse big data sources together, then we can’t avoid the central step of integration: conforming. Stepping back from relational databases, conforming dimensions means that we establish common descriptive context for a given entity when that entity appears in more than one big data source. In other words, if we have a common User entity associated with Twitter, Facebook, and LinkedIn observations, then if we intend to tie these data sources together, we must have a common data thread consisting of descriptive attributes administered identically across these three data sources. In the data warehouse we know a lot about conformed dimensions. This knowledge is spot-on relevant to integrating big data sources..
Proper tracking of time variance in big data requires durable keys and probably surrogate keys. If a big data use case requires correct historical tracking, then a mechanism must be supplied for keeping the old versions of the dimensional entities. At the very least, an entity like User requires a durable identifier that remains constant across varying versions of the dimension member. And, sooner or later, big data practitioners will discover a lesson we learned in the data warehouse nearly twenty years ago: you need to create your own surrogate keys for the members of a dimension because natural keys created by source systems are plagued with operational problems. Thus, once again, fundamental lessons learned in the data warehouse can be applied to big data. Or, to put it more strongly, sooner or later these fundamental lessons MUST be applied in the big data world.
IT and big data must get married someday. A lot of the action in big data today is taking place outside of IT. “Data scientists” are building their own data analysis sand boxes, developing their own ad hoc analytic applications, and taking their results directly to senior management. As exciting as this is, this is not a sustainable model. Where do I start? No data governance, no sharing of IT resources across sand boxes, poor communication among data scientists, no culture for producing production hardened applications, no end user training or support, and the list goes on. For IT and big data to get married, a significant effort will have to be made by end user management in recognizing that the current model is at best a prototyping effort, and by IT who has to invest in big data technical skills and business content.
I am cautiously optimistic that big data is real and that it will grow to be a significant theme for the data warehouse and IT. This will happen sooner if big data users recognize the valuable legacy that the data warehouse brings to the party.