The enterprise data warehouse must absolutely stay relevant to the business. As the value and the visibility of big data analytics grows, the data warehouse must encompass the new culture, skills, techniques, and systems required for big data analytics.
For example, big data analysis encourages exploratory sandboxes for experimentation. These sandboxes are copies or segments of the massive data sets being sourced by the organization. Individual analysts or very small groups are encouraged to analyze the data with a very wide variety of tools, ranging from serious statistical tools like SAS,
Matlab or R, to predictive models, and many forms of ad hoc querying and visualization through advanced BI graphical interfaces. The analyst responsible for a given sandbox is allowed to do anything with the data, using any tool they want, even if the tools they use are not corporate standards. The sandbox phenomenon has enormous energy but it carries a significant risk to the IT organization and EDW architecture because it could create isolated and incompatible stovepipes of data. This point is amplified in the section on organizational changes, below.
Exploratory sandboxes usually have a limited time duration, lasting weeks or at most a few months. Their data can be a frozen snapshot, or a window on a certain segment of incoming data. The analyst may have permission to run an experiment changing a feature on the product or service in the marketplace, and then performing A/B testing to see how the change affects customer behavior. Typically, if such an experiment produces a successful result, the sandbox experiment is terminated, and the feature goes into production. At that point, tracking applications that may have been implemented in the sandbox using a quick and dirty prototyping language, are usually reimplemented by other personnel in the EDW environment using corporate standard tools. In several of the e-commerce enterprises interviewed for this white paper, analytic sandboxes were extremely important, and in some cases hundreds of the sandbox experiments were ongoing simultaneously. As one interviewee commented, “newly discovered patterns have the most disruptive potential, and insights from them lead to the highest returns on investment.”
Architecturally, sandboxes should not be brute force copies of entire data sets, or even major segments of these data sets. In dimensional modeling parlance, the analyst needs much more than just a fact table to run the experiment. At a minimum the analyst also needs one or more very large dimension tables, and possibly additional fact tables for complete “drill across” analysis. If 100 analysts are creating brute force copy versions of the data for the sandboxes there will be enormous wasting of disk space and resources for all the redundant copies. Remember that the largest dimension tables, such as customer dimensions, can have 500 million rows! The recommended architecture for a serious sandbox environment is to build each sandbox using conformed (shared) dimensions which are incorporated into each sandbox as relational views, or their equivalent under Hadoop applications.
An elementary mistake when gathering business requirements during the design of a data warehouse is to ask the business user if they want “real time” data. Users are likely to say “of course!” Although perhaps this answer has been somewhat gratuitous in the past, a good business case can now be made in many situations that more frequent updates of data delivered to the business with lower and lower latencies are justified. Both RDBMSs and MapReduce/Hadoop systems struggle with loading gigantic amounts of data and making that data available within seconds of that data being created. But the marketplace wants this, and regardless of a technologist’s doubt about the requirement, the requirement is real and over the next decade it must be addressed.
An interesting angle on low latency data is the desire to begin serious analysis on the data as it is streaming in, but possibly far before the data collection process even terminates. There is significant interest in streaming analysis systems which allow SQL-like queries to process the data as it flows into the system. In some use cases when the results of a streaming query surpass a threshold, the analysis can be halted without running the job to the bitter end. An academic effort, known as continuous query language (CQL), has made impressive progress in defining the requirements for streaming data processing including clever semantics for dynamically moving time windows on the streaming data. Look for CQL language extensions and streaming data query capabilities in the load programs for both RDBMSs and HDFS deployed data sets. An ideal implementation would allow streaming data analysis to take place while the data is being loaded at gigabytes per second.
The availability of extremely frequent and extremely detailed event measurements can drive interactive intervention. The use cases where this intervention is important spans many situations ranging from online gaming to product offer suggestions to financial account fraud responses to the stability of networks.
Continuous thirst for more exquisite detail
Analysts are forever thirsting for more detail in every marketplace observation, especially of customer behavior. For example every webpage event (a page being painted on a user’s screen) spawns hundreds of records describing every object on the page. In online games, where every gesture enters the data stream, as many as 100 descriptors are attached to each of these gesture micro-events. For instance, in a hypothetical online baseball game, when the batter swings at a pitch, everything describing the position of the players, the score, runners on the bases, and even the characteristics of the pitch, are all stored with that individual record. In both of these examples, the complete context must be captured within the current record, because it is impractical to compute this detailed context after the fact from separate data sources.
The lesson for the coming decade is that this thirst for exquisite detail will only grow. It is possible to imagine thousands of attributes being attached to some micro-events, and the categories and names of these attributes will grow in unpredictable ways. This makes the data bag approach discussed earlier in the paper much more important. It means that positionally dependent schemas, with the keys (names of the data) pre-declared as column names is an unworkable design.
Finally, a perfect historical reconstruction of interesting events such as webpage exposures needs to be more than just a list of attributes on the webpage when it was displayed, even if that list is enormously detailed. A perfect historical reconstruction of the webpage needs to be seen through a multimedia user interface, i.e., a browser.
Light touch data waits for its relevance to be exposed
Light touch data is an aspect of the exquisite detail data described in the previous section. For example, if a customer browses a website extensively before making a purchase, a great deal of micro-context is stored in all the webpage events prior to the purchase. When the purchase is made, some of that micro-context suddenly becomes much more important, and is elevated from “light touch data” to real data. At that point the sequence of exposures to the selected product or to competitive products in the same space becomes possible to be sessionized. These micro-events are pretty much meaningless before the purchase event, because there are so many conceivable and irrelevant threads that would be dead ends for analysis. This requires oceans of light touch data to be stored, waiting for the relevance of selected threads of these micro-events to eventually be exposed. Conventional seasonality thinking suggests that at least five quarters (15 months) of this light touch data needs to be kept online. This is one instance of a remark made consistently during interviews for this white paper that analysts want “longer tails” which means that they want more significant histories than they currently get.
Simple analysis of all the data trumps sophisticated analysis of some of the data
Although data sampling has never been a popular technique in data warehousing, surprisingly the arrival of enormous petabyte sized data sets has not increased the interest in analyzing a subset of the data. On the contrary, a number of analysts point out that monetizable insights can be derived from very small populations that could be missed by only sampling some of the data. Of course this is a somewhat controversial point, since the same analysts admit that if you have 1 trillion behavior observation records, you may be able to find any behavior pattern if you look hard enough.
Another somewhat controversial point raised by some analysts is their concern that any form of data cleaning on the incoming data could erase interesting low-frequency “edge cases.” Ultimately both the cases of misleading rare behavior patterns, and misleading corrupted data need to be gently filtered out of the data.
Assuming that the behavior insights from very small populations are valid, there is widespread recognition that micro-marketing to the small populations is possible, and doing enough of this can build a sustainable strategic advantage.
A final argument in favor of analyzing complete data sets is that these “relation scans” do not require indexes or aggregations to be computed in advance of the analysis. This approach fits well with the basic MapReduce distributed analysis architecture.
Data structures should be declared at query time, not at data load time
A number of analysts interviewed for this white paper said that the enormous data sets they were trying to analyze needed to be loaded in a queryable state before the structure and content of the data sets were completely understood. Again, thinking of the data bag kind of marketplace observation where within a well-structured dimensional measurement process the actual observation is a disorderly and potentially unpredictable set of key value pairs, the structure of this data bag may need to be discovered, and alternate interpretation of the structures may need to be possible without reloading the database. One respondent remarked that “yesterday’s fringe data is tomorrow’s well-structured data,” implying that we need exceptional flexibility as we explore new kinds of data sources.
A key differentiator between the RDBMS approach and the MapReduce/Hadoop approach is the deferral of the data structure declaration until query time in the MapReduce/Hadoop systems. An objection from the RDBMS community that forcing every MapReduce job to declare the target data structure promotes a kind of chaos because every analyst can do their own thing. But that objection seems to miss the point that a standard data structure declaration can easily be published as a library module that can be picked up by every analyst when they are implementing their application.
The EDW supporting big data analytics must be magnetic, agile, and deep
Cohen and Dolan in their seminal but somewhat controversial paper on big data analytics argue that EDWs must shed some old orthodoxies in order to be “magnetic, agile, and deep.” A magnetic environment places the least impediments on the incorporation of new, unexpected, and potentially dirty data sources. Specifically, this supports the need to defer declaration of data structures until after the data is loaded.
According to Cohen and Dolan, an agile environment eschews long-range careful design and planning! And a deep environment allows running sophisticated analytic algorithms on massive data sets without sampling, or perhaps even cleaning. We have made these points elsewhere in this white paper but Cohen and Dolan’s paper is a particularly potent, if unusual, argument. Read this paper to get some provocative perspectives! A link to Cohen and Dolan’s paper is provided in the references section at the end of this white paper.
The conflict between abstraction and control
In the MapReduce/Hadoop world, Pig and Hive are widely regarded as valuable abstractions that allow the programmer to focus on database semantics rather than programming directly in Java. But several analysts interviewed for this paper remarked that too much abstraction and too much distancing from where the data actually is stored can be disastrously inefficient. This seems like a reasonable concern when dealing with the very largest data sets, where a bad algorithm could result in runtimes measured in days. For the breaking wave of the biggest data sets, programming tools will need to allow considerable control over the storage strategy, and the processing approaches, but without requiring programming using the lowest level code.
Data warehouse organization changes in the coming decade
The growing importance of big data analytics amounts to something between a midcourse correction and a revolution for enterprise data warehousing. New skill sets, new organizations, new development paradigms, and new technology will need to be absorbed by many enterprises, especially those facing the use cases described in this paper. Not every enterprise needs to jump into the petabyte ocean, but it is this author’s prediction that the upcoming decade will see a steady growth in the percentage of large enterprises recognizing the value of big data analytics.
Most observers would agree that big data analytics falls within “information management,” but the same observers may quibble about whether this affects the “data warehouse.” Rather than worrying about whether the box on the organization chart labeled EDW has responsibility for big data analytics, we take the perspective that enterprise data warehousing without the capital letters absolutely encompasses big data analytics. Having said that, there will be many different organizational structures and management perspectives as industries expand their information management.
This kind of tinkering and adjusting to the new paradigm is normal and expected. We went through a very similar phase in the mid 1980s when data warehousing itself was a new paradigm for IT and the business. Many of the most successful early data warehousing initiatives started in the business organizations and were eventually incorporated into those IT organizations that then made major commitments to being business relevant. It is likely the same evolution will take place with big data analytics.
The challenge before information managers in large enterprises is how to encourage three separate data warehouse endeavors: conventional RDBMS applications, MapReduce/Hadoop applications, and advanced analytics.
Technical skill sets required
It is worth repeating here the message of the very first sentence of this white paper. Petabyte scale data sets are of course a big challenge but big data analysis is often about difficulties other than data volume. You can have fast arriving data or complex data or complex analyses which are very challenging even if all you have are terabytes of data!
The care and feeding of RDBMS-oriented data warehouses involves a comprehensive set of skills that is pretty well understood: SQL programming, ETL platform expertise, database modeling, task scheduling, system building and maintenance skills, one or more scripting languages such as Python or Perl, UNIX or Windows operating system skills, and business intelligence tools skills. SQL programming, which is at the core of an RDBMS implementation, is a declarative language, which contrasts with the mindset of the procedural language skills needed for MapReduce/Hadoop programming, at least in Java. The data warehouse team also needs to have a good partnership within other areas of IT including storage management, security, networking, and support of mobile devices. Finally, good data warehousing also requires an extensive involvement with the business community, and with the cognitive psychology of end-users!
The care and feeding of MapReduce/Hadoop data warehouses, including any of the big data analytics use cases described in this paper, involves a set of skills that only partially overlap traditional RDBMS data warehouse skills. Therein lies a significant challenge. These new skills include lower-level programming languages such as Java, C++, Ruby, Python, and MapReduce interfaces most commonly available via Java.
Although the requirement to program via procedural based lower-level programming languages will be reduced significantly during the upcoming decade in favor of Pig,
Hive, and HBase, it may be easier to recruit MapReduce/Hadoop application developers from the programming community rather than the data warehouse community, if the data warehouse job applicants lack programming and UNIX skills. If MapReduce/Hadoop data warehouses are managed exclusively with open source tools, then Zookeeper and Oozie skills will be needed too. Keep in mind that the open-source community innovates quickly. Hive, Pig and HBase are not the last word in high-level interfaces to Hadoop for analysis. It is likely that we will see much more innovation in this decade including entirely new interfaces.
ETL platform providers have a big opportunity to provide much of the glue that will tie together the big data sources, MapReduce/Hadoop applications, and existing relational databases. Developers with ETL platform skills will be able to leverage a great deal of their experience and instincts in system building when they incorporate MapReduce/Hadoop applications.
Finally, the analysts whom we have described as often working in sandbox environments will arrive with an eclectic and unpredictable set of skills starting with deep analytic expertise. For these people it is probably more important to be conversant in SAS, Matlab, or R than to have specific programming language or operating system skills. Such individuals typically will arrive with UNIX skills, and some reasonable programming proficiency, and most of these people are extremely tolerant of learning new complex technical environments. Perhaps the biggest challenge with traditional analysts is getting them to rely on the other resources available to them within IT, rather than building their own extract and data delivery pipelines. This is a tricky balance because you want to give the analysts unusual freedom, but you need to look over their shoulders to make sure that they are not wasting their time.
New organizations required
At this early stage of the big data analytics revolution, there is no question that the analysts must be part of the business organization, both to understand the microscopic workings of the business, but also to be able to conduct the kind of rapid turnaround experiments and investigations we have described in this paper. As we have described, these analysts must be heavily supported in a technical sense, with potentially massive compute power and data transfer bandwidth. So although the analysts may reside in the business organizations, this is a great opportunity for IT to gain credibility and presence with the business. It would be a significant mistake and a lost opportunity for the analysts and their sandboxes to exist as rogue technical outposts in the business world without recognizing and taking advantage of their deep dependence on the traditional IT world.
In some organizations we interviewed for this white paper, we saw separate analytic groups embedded within different business organizations, but without very much cross communication or common identity established among the analytic groups. In some noteworthy cases, this lack of an “analytic community” led to lost opportunities to leverage each other’s work, and led to multiple groups reinventing the same approaches, and duplicating programming efforts and infrastructure demands as they made separate copies of the same data.
We recommend that a cross divisional analytics community be established mimicking some of the successful data warehouse community building efforts we have seen in the past decade. Such a community should have regular cross divisional meetings, as well as a kind of private LinkedIn application to promote awareness of all the contacts and perspectives and resources that these individuals collect in their own investigations, and a private web portal where information and news events are shared. Periodic talks can be given, hopefully inviting members of the business community as well, and above all the analytics community needs T-shirts and mugs!
New development paradigms required
Even before the arrival of big data analytics, data warehousing has been transforming itself to provide more rapid response to new opportunities and to be more in touch with the business community. Some of the practices of the agile software development movement have been successfully adopted by the data warehouse community, although realistically this has not been a highly visible transformation. But, in particular, the agile development approach supports the data warehouse by being organized around small teams driven by the business, not typically by IT. An agile development effort also produces frequent tangible deliveries, deemphasizes documentation and formal development methodologies, and tolerates midcourse correction and the incremental acceptance of new requirements. The most sensitive ingredient for success of agile development projects is the personality and skills of the business leader who ultimately is in charge. The agile business leader needs to be a thoughtful and sophisticated observer of the development process and the realities of the information world. Hopefully the agile business leader is a pretty good manager as well.
Big data analytics certainly opens the door to business involvement since the central analysis is probably done in the business environment directly. But it is probably unlikely that the professional analyst is the right person to be the overall agile data warehouse project leader. The agile project leader needs to be well skilled in facilitating short effective meetings, resolving issues and development choices, determining the truth of progress reports from individual developers, communicating with the rest of the organization, and getting funding for initiatives.
Traditional data warehouse development has discovered the attractiveness of building incrementally from a modest start, but with a good architectural foundation that provides a blueprint for where future development will go. This author has described in many papers the techniques for “graceful modification” of dimensional data warehouse schemas. In a dimensionally modeled data warehouse, new measurement facts, new dimensional attributes, and even new dimensions can be added to existing data warehouse applications without changing, invalidating, or rolling over existing information delivery pipelines to the end users. Many of the use cases we have described in this paper for big data analytics suggest that new facts, new attributes, and new dimensions will routinely become available.
Integration of new data sources into a data warehouse has always been a significant challenge, since often these new data sources arrive without any thought to integration with existing data sources. This will certainly be the case with big data analytics. Again for dimensionally modeled data warehouses, this author has described techniques for incremental integration, where “enterprise dimensional attributes” are defined and planted in the dimensions of the separate data sources. We call these conformed dimensions. The development and deployment of conformed dimensions fits the agile development approach beautifully, since this kind of integration can be implemented one data source at a time, and one dimensional attribute at a time, again in a way that is nondestructive to existing applications. Please see the references section at the end of this white paper for more information on conformed dimensions.
Finally, at least one organization interviewed for this white paper has taken agility to its logical extreme. Individual developers are given complete end-to-end responsibility for a project, all the way from original sourcing of the data, through experimental analysis, re-implementing the project for production use, and working with the end-users and their BI tools in supportive mode. Although this development approach remains an experiment, early results are very interesting because these developers feel a significant sense of responsibility and pride for their projects.
Lessons from the early data warehousing era
It took most of the 1990s for organizations to understand what a data warehouse was and how to build and manage those kinds of systems. Interestingly, at the end of the 1990s, data warehousing was effectively relabeled as business intelligence. This was a very positive development because it reflected the need for the business to own and take responsibility for the uses of data.
The earliest data warehouse pioneers had no choice but to do their own systems integration, assembling best-of-breed components, and coping with the inevitable incompatibilities in issues of dealing with multiple vendors. By the end of the 1990s, the best of breed approach gave way to vendor stacks of integrated products, a trend which continues until today. At this point, there are only a few independent vendors in the data warehouse space, and those vendors have succeeded by interfacing with nearly every conceivable format and interface, thereby providing bridges between the more limited proprietary vendor stacks.
With the benefit of hindsight gained from the traditional data warehouse experience, the big data analytics version of data warehousing is likely to consolidate quite quickly. Only the bravest organizations with very strong software development skills should consider rolling their own big data analytics applications directly on raw MapReduce/Hadoop. For information management organizations wishing to focus on the business issues rather than on the breaking wave of software development, a packaged Hadoop distribution (e.g., Cloudera) makes a lot of sense. The leading ETL platform vendors likely will also introduce packaged environments for handling many of the phases of MapReduce/Hadoop development.
Analytics in the cloud
This white paper has not discussed cloud implementations of big data analytics. Most of the enterprises interviewed for this white paper were not using public cloud implementations for their production analytics. Nevertheless, cloud implementations may be very attractive in the startup phase for an analytics effort. A cloud service can provide instant scalability during this startup phase, without committing to a massive legacy investment in hardware. Data analysis projects can be turned on and turned off on short notice. Recall that typical analytic environments may involve hundreds of separate sandboxes and parallel experiments.
Many of the organizations interviewed for this paper stated that mature analytics should be brought in-house, perhaps implemented technically as a cloud but within the confines of the organization. Of course, such an in-house cloud may reduce fears of security and privacy breaches (fairly or not).
A remote cloud implementation raises issues of network bandwidth, especially in a broadly integrated application with multiple very large data sets in different locations. Imagine solving the big join problem where your trillion row fact table is out on the cloud, and your billion row dimension table is located in-house.
Although the best performing systems try to achieve a three-way balance among CPU, disk speed, and bandwidth, most organizations interviewed for this paper predicted that bandwidth would emerge as the number one limiting factor for big data analytics system performance.
The enterprise data warehouse must expand to encompass big data analytics as part of overall information management. The mission of the data warehouse has always been to collect the data assets of the organization and structure them in a way that is most useful to decision-makers. Although some organizations may persist with a box on the org chart labeled EDW that is restricted to traditional reporting activities on transactional data, the scope of the EDW should grow to reflect these new big data developments. In some sense there are only two functions of IT: getting the data in (transaction processing), and getting the data out. The EDW is getting the data out.
The big choice facing shops with growing big data analytics investments is whether to choose an RDBMS-only solution, or a dual RDBMS and MapReduce/Hadoop solution.
This author predicts that the dual solution will dominate, and in many cases the two architectures will not exist as separate islands but rather will have rich data pipelines going in both directions. It is safe to say that both architectures will evolve hugely over the next decade, but this author predicts that both architectures will share the big data analytics marketplace at the end of the decade.
Sometimes when an exciting new technology arrives, there is a tendency to close the door on older technologies as if they were going to go away. Data warehousing has built an enormous legacy of experience, best practices, supporting structures, technical expertise, and credibility with the business world. This will be the foundation for information management in the upcoming decade as data warehousing expands to include big data analytics.