The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics Parte II

Vamos à Parte II 

Making sense of big data analytic use cases

Certainly the purpose of developing this list of use cases is to convince the reader that the use cases come in all shapes and sizes and formats, and require many specialized approaches to analyze. Up until very recently all these use cases existed as separate endeavors, often involving special purpose built systems. But the industry awareness of the “big data analytics challenge” is motivating everyone to look for the architectural similarities and differences across all these use cases. Any given enterprise is increasingly likely to encounter one or more of these use cases. That realization is driving the interest in system architectures that addresses the big data analytics problem in a general way. Please study the following table.

 

The sheer density of this table makes it clear that systems to support big data analytics have to look very different than the classic relational database systems from the 1980s and 1990s. The original RDBMSs were not built to handle any of the requirements represented as columns in this table!

Big data analytics system requirements

Before discussing the exciting new technical and architectural developments of the 2010s, let’s summarize the overall requirements for supporting big data analytics, keeping in mind that we are not requiring a single system or a single vendor’s technology to provide a blanket solution for every use case. From the perspective of 2011, we have the luxury of standing back from all these use cases gathered in the last few years, and we are now in a position to surround the requirements with some confidence.

The development of big data analytics has reached a point where it needs an overall mission statement and identity independent of a list of use cases. Many of us have lived through earlier instantiations of advanced analytics that went by the names of advanced statistics, artificial intelligence and data mining. None of these earlier waves became a coherent theme that transcended the individual examples, as compelling as those examples were.

Here is an attempt to step back and define the characteristics of big data analytics at the highest levels. In the following, the term “UDF” is used in the broadest sense of any user defined function or program or algorithm that may appear anywhere in the end-to-end analysis architecture.

In the coming 2010s decade, the analysis of big data will require a technology or combination of technologies capable of:

  • scaling to easily support petabytes (thousands of terabytes) of data
  • being distributed across thousands of processors, potentially geographically unaware, and potentially heterogeneous
  • subsecond response time for highly constrained standard SQL queries
  • embedding arbitrarily complex user-defined functions (UDFs) within processing requests
  • implementing UDFs in a wide variety of industry-standard procedural languages
  • assembling extensive libraries of reusable UDFs crossing most or all of use cases
  • executing UDFs as “relation scans” over petabyte sized data sets in a few minutes
  • supporting a wide variety of data types growing to include images, waveforms, arbitrarily hierarchical data structures, and data bags
  • loading data to be ready for analysis, at very high rates, at least gigabytes per second
  • integrating data from multiple sources during the load process at very high rates (GB/sec)
  • loading data before declaring or discovering its structure
  • executing certain “streaming” analytic queries in real time on incoming load data
  • updating data in place at full load speeds
  • joining a billion row dimension table to a trillion row fact table without pre-clustering the dimension table with the fact table
  • scheduling and execution of complex multi-hundred node workflows
  • being configured without being subject to a single point of failure
  • failover and process continuation when processing nodes fail
  • supporting extreme mixed workloads including thousands of geographically dispersed on-line users and programs executing a variety of requests ranging from ad hoc queries to strategic analysis, and while loading data in batch and streaming fashion

Two architectures have emerged to address big data analytics: extended RDBMS, and MapReduce/Hadoop. These architectures are being implemented as completely separate systems and in various interesting hybrid combinations involving both architectures. We will start by discussing the architectures separately.

Extended relational database management systems

All of the major relational database management system vendors are adding features to address big data analytics from a solid relational perspective. The two most significant architectural developments have been the overtaking of the high end of the market with massively parallel processing (MPP), and the growing adoption of columnar storage. When MPP and columnar storage techniques are combined, a number of the system requirements in the above list can start to be addressed, including:

  • scaling to support exabytes (thousands of petabytes) of data
  • being distributed across tens of thousands of geographically dispersed processors
  • subsecond response time for highly constrained standard SQL queries
  • updating data in place at full load speeds
  • being configured without being subject to a single point of failure
  • failover and process continuation when processing nodes fail

Additionally, RDBMS vendors are adding some complex user-defined functions (UDF’s) to their syntax, but the kind of general purpose procedural language computing required by big data analytics is not being satisfied in relational environments at this time.

In a similar vein, RDBMS vendors are allowing complex data structures to be stored in individual fields. These kind of embedded complex data structures have been known as “blobs” for many years. It’s important to understand that relational databases have a hard time providing general support for interpreting blobs since blobs do not fit the relational paradigm. An RDBMS indeed provides some value by hosting the blobs in a structured framework, but much of the complex interpretation and computation on the blobs must be done with specially crafted UDFs, or BI application layer clients. Blobs are related to “data bags” discussed elsewhere in this paper. See the section entitled Data structures should be declared at query time.

MPP implementations have never satisfactorily addressed the “big join” issue where a billion row dimension table is attempted to be joined to a trillion row fact table without resorting to clustered storage. The big join crisis occurs when an ad hoc constraint is placed against the dimension table resulting in a potentially very large set of dimension keys that must be physically downloaded into every one of the physical segments of the trillion row fact table stored separately in the MPP system. Since the dimension keys are scattered randomly across the separate segments of the trillion row fact table, it is very hard to avoid a lengthy download step of the very large dimension table to every one of the fact table storage partitions. To be fair, the MapReduce/Hadoop architecture has not been able to address the big join problem either.

Columnar data storage fits the relational paradigm, and especially dimensionally modeled databases, very well. Besides the significant advantage of high compression of sparse data, columnar databases allow a very large number of columns compared to row-oriented databases, and place little overhead on the system when columns are added to an existing schema. The most significant Achilles’ heel, at least in 2011, is the slow loading speed of data into the columnar format. Although impressive load speed improvements are being announced by columnar database vendors, they have still not achieved the gigabytes-per-second requirement listed above.

The standard RDBMS architecture for implementing an enterprise data warehouse based on dimensional modeling principles is simple and well understood, as shown in Figure 1. Recall that throughout this white paper, the EDW is defined in the comprehensive sense to include all back room and front room processes including ETL, data presentation, and BI applications.

Figure 1. The standard RDBMS – based architecture for an enterprise data warehouse Source: The Data Warehouse Lifecycle Toolkit, 2nd edition, Kimball et al. (2008)

In this standard EDW architecture the ETL system is a major component that sits between the source systems and the presentation servers that are responsible for exposing all data to business intelligence applications. In this view, the ETL system adds significant value by cleaning, conforming, and arranging the data into a series of dimensional schemas which are then stored physically in the presentation server. A crucial element of this architecture is the preparation of conformed dimensions in the ETL system that serves as the basis of integration for the BI applications. It is the strong conviction of this author that deferring the building of the dimensional structures and the issues of integration until query time is the wrong architecture. Such a “deferred computation” approach requires an unduly expensive query optimizer to correctly query complex non-dimensional models every time a query is presented. The calculation of integration at query processing time generally requires complex application logic in the BI tools which also might have to be executed for every query.

The extended RDBMS architecture to support big data analytics preserves the standard architecture with a number of important additions, shown below in Figure 2 with large arrows:

 

Figure 2. The extended RDBMS – based architecture for an enterprise data warehouse

The fact that the high-level enterprise data warehouse architecture is not materially changed by the introduction of new data structures, or a growing library of specially crafted user-defined functions, or powerful procedural language-based programs acting as powerful BI clients, is the charm of the extended RDBMS approach to big data analytics. The major RDBMS players are able to marshal their enormous legacy of millions of lines of code, powerful governance capabilities, and system stability built over decades of serving the marketplace.

However, it is the opinion of this author that the extended RDBMS systems cannot be the only solution for big data analytics. At some point, tacking on non-relational data structures and non-relational processing algorithms to the basic, coherent RDBMS architecture will become unwieldy and inefficient. The Swiss Army knife analogy comes to mind. Another analogy closer to the topic is the programming language PL/1. Originally designed as an overarching, multipurpose, powerful programming language for all forms of data and all applications, it ultimately became a bloated and sprawling corpus that tried to do too many things in a single language. Since the heyday of PL/1 there has been a wonderful evolution of more narrowly focused programming languages with many new concepts and features that simply couldn’t be tacked on to PL/1 after a certain point. Relational database management systems do so many things so well that there is no danger of suffering the same fate as PL/1. The big data analytics space is growing so rapidly and in such exciting and unexpected new directions that a lighter weight, more flexible and more agile processing framework in addition to RDBMS systems may be a reasonable alternative.

Até a parte III!!!!

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s

%d blogueiros gostam disto: