Pessoal, vou apresentar um texto do Ralph Kimball, onde ele fala sobre grandes ambientes analíticos. Esse texto é grande, por isso vou dividi-lo em várias partes. Aqui vai a primeira.
Antes porém, não custa nada apresentar, à aqueles que não o conhecem, o Dr Ralph Kimball:
Ralph Kimball founded the Kimball Group. Since the mid 1980s, he has been the data warehouse/business intelligence (DW/BI) industry’s thought leader on the dimensional approach and trained more than 10,000 IT professionals. Prior to working at Metaphor and founding Red Brick Systems, Ralph co-invented the Star workstation at Xerox’s Palo Alto Research Center (PARC). Ralph has his Ph.D. in Electrical Engineering from Stanford University.
In this white paper, we describe the rapidly evolving landscape for designing an enterprise data warehouse (EDW) to support business analytics in the era of “big data.” We describe the scope and challenges of building and evolving a very stable and successful EDW architecture to meet new business requirements. These include extreme integration, semi- and un-structured data sources, petabytes of behavioral and image data accessed through MapReduce/Hadoop as well as massively parallel relational databases, and then structuring the EDW to support advanced analytics. This paper provides detailed guidance for designing and administering the necessary processes for deployment. This white paper has been written in response to a lack of specific guidance in the industry as to how the EDW needs to respond to the big data analytics challenge, and what necessary design elements are needed to support these new requirements.
What is big data? Its bigness is actually not the most interesting characteristic. Big data is structured, semi structured, unstructured, and raw data in many different formats, in some cases looking totally different than the clean scalar numbers and text we have stored in our data warehouses for the last 30 years. Much big data cannot be analyzed with anything that looks like SQL. But most important, big data is a paradigm shift in how we think about data assets, where do we collect them, how do we analyze them, and how do we monetize the insights from the analysis. The big data revolution is about finding new value within and outside conventional data sources. An additional approach is needed because the software and hardware environments of the past have not been able to capture, manage, or process the new forms of data within reasonable development times or processing times. We are challenged to reorganize our information management landscape to extend a remarkably stable and successful EDW architecture to this new era of big data analytics.
In reading this white paper please bear in mind that the consistent view of this author has always been that the “data warehouse” comprises the complete ecosystem for extracting, cleaning, integrating and delivering data to decision makers, and therefore includes the extract-transform-load (ETL) and business intelligence (BI) functions considered as outside of the data warehouse by more conservative writers. This author has always taken the view that data warehousing has a very comprehensive role in capturing all forms of enterprise data, and then preparing that data for the most effective use by decision-makers all across the enterprise. This white paper takes the aggressive view that the enterprise data warehouse is on the verge of a very exciting new set of responsibilities. The scope of the EDW will increase dramatically.
Also, in this white paper, although we consistently use the term “ETL” to describe the movement of data within the enterprise data warehouse, the conventional use of this term does not do justice to the much larger responsibility of moving data across networks and between systems and between profoundly different processes in the world of big data analytics. ETL is a portion of a much larger technology called data integration (DI). Since we have used ETL consistently in our books and classes for many years, we will keep that terminology in this paper, bearing in mind that ETL is meant in the larger sense of DI.
This white paper stands back from the marketplace as it exists in early 2011 to highlight the clearly emerging new trends brought by the big data revolution. And a revolution it is. As James Markarian, Informatica’s Executive Vice President and Chief Technology Officer, remarked: “the database market has finally gotten interesting again.” Because much of the new big data tools and approaches are version 1 or even version 0 developments, the landscape will continue to change rapidly. However there is growing awareness in the marketplace that new kinds of analysis are possible and that key competitors, especially e-commerce enterprises, are already taking advantage of the new paradigm. This white paper is intended to be a guide to help business intelligence, data warehousing and information management professionals and management teams understand and prepare for big data as a complementary extension to their current EDW architecture.
Data is an asset on the balance sheet
Enterprises increasingly recognize that data itself is an asset that should appear on the balance sheet in the same way that traditional assets from the manufacturing age such as equipment and land have always appeared. There are several ways to determine the value of the data asset, including
cost to produce the data
cost to replace the data if it is lost
revenue or profit opportunity provided by the data
revenue or profit loss if data falls into competitors hands
legal exposure from fines and lawsuits if data is exposed to the wrong parties
But more important than the data itself, enterprises have shown that insights from data can be monetized. When an e-commerce site detects an increase in favorable click throughs from an experimental ad treatment, that insight can be taken to the bottom line immediately. This direct cause-and-effect is easily understood by management, and an analytic research group that consistently demonstrates these insights is looked upon as a strategic resource for the enterprise by the highest levels of management. This growth in business awareness of the value of data-driven insights is rapidly spreading outward from the e-commerce world to virtually every business segment.
Data warehousing, of course, has been demonstrating the value of data-driven insights for at least 20 years. But until quite recently data warehousing has been focused on historical transaction data. During the past decade from 2000 to 2009, three major seismic shifts occurred in data warehousing. The first, early in the decade, was the decisive introduction of low latency operational data into the data warehouse together with the existing historical data. Of course, many of these new operational data use cases benefited from real-time data, in some cases demanding instantaneous delivery. The second seismic shift growing increasingly throughout the decade was the gathering of customer behavior data, which not only included traditional transactions such as purchases and click throughs but added huge volumes of “sub transactions” that represented measurable events leading up to the transactions themselves. For example, all the webpage events a customer engaged in prior to the final transaction event became a record of customer behavior. “Good paths” through these webpage event histories gave lots of insight into productive (i.e., monetizable) customer behavior.
The third seismic event, which is gathering enormous momentum as we transition into the current decade, is the extraction of product preferences and customers’ sentiments from social media, especially the massive quantities of machine-generated unstructured data generated by the new business paradigms of dot-com companies. It is this final seismic shift that has pushed many enterprises into looking seriously at unstructured data for the first time, and asking “how on earth do we analyze this stuff?” The point here is not that unstructured data is some new thing recently discovered, but rather the analysis of unstructured data has gone mainstream just recently.
Raising the curtain on big data analytics
Use cases for big data analytics
Big data analytics use cases are spreading like wildfire. Here is a set of use cases reported recently, including a benchmark set of “Hadoop-able” use cases proposed by Jeff Hammerbacher, Chief Scientist for Cloudera. Following these brief descriptions is a table summarizing the salient structure and processing characteristics of each use case. Note that none of these use cases can be satisfied with scalar numeric data, nor can any be properly analyzed by simple SQL statements. All of them can be scaled into the petabyte range and beyond with appropriate business assumptions.
Search ranking. All search engines attempt to rank the relevance of a webpage to a search request against all other possible webpages. Google’s page rank algorithm is, of course, the poster child for this use case.
Ad tracking. E-commerce sites typically record an enormous river of data including every page event in every user session. This allows for very short turnaround of experiments in ad placement, color, size, wording, and other features. When an experiment shows that such a feature change in an ad results in improved click through behavior, the change can be implemented virtually in real time.
Location and proximity tracking. Many use cases add precise GPS location tracking, together with frequent updates, in operational applications, security analysis, navigation, and social media. Precise location tracking opens the door for an enormous ocean of data about other locations nearby the GPS measurement. These other locations may represent opportunities for sales or services.
Causal factor discovery. Point-of-sale data has long been able to show us when the sales of a product goes sharply up or down. But searching for the causal factors that explain these deviations has been, at best, a guessing game or an art form. The answers may be found in competitive pricing data, competitive promotional data including print and television media, weather, holidays, national events including disasters, and virally spread opinions found in social media. See the next use case as well.
Social CRM. This use case is one of the hottest new areas for marketing analysis. The Altimeter Group has described a very useful set of key performance indicators for social CRM that include share of voice, audience engagement, conversation reach, active advocates, advocate influence, advocacy impact, resolution rate, resolution time, satisfaction score, topic trends, sentiment ratio, and idea impact. The calculation of these KPIs involves in-depth trolling of a huge array of data sources, especially unstructured social media.
Document similarity testing. Two documents can be compared to derive a metric of similarity. There is a large body of academic research and tested algorithms, for example latent semantic analysis, that is just now finding its way to driving monetized insights of interest to big data practitioners. For example, a single source document can be used as a kind of multifaceted template to compare against a large set of target documents. This could be used for threat discovery, sentiment analysis, and opinion polls. For example: “find all the documents that agree with my source document on global warming.”
Genomics analysis: e.g., commercial seed gene sequencing. A few months ago the cotton research community was thrilled by a genome sequencing announcement that stated in part “The sequence will serve a critical role as the reference for future assembly of the larger cotton crop genome. Cotton is the most important fiber crop worldwide and this sequence information will open the way for more rapid breeding for higher yield, better fiber quality and adaptation to environmental stresses and for insect and disease resistance.” Scientist Ryan Rapp stressed the importance of involving the cotton research community in analyzing the sequence, identifying genes and gene families and determining the future directions of research. (SeedQuest, Sept 22, 2010). This use case is just one example of a whole industry that is being formed to address genomics analysis broadly, beyond this example of seed gene sequencing.
Discovery of customer cohort groups. Customer cohort groups are used by many enterprises to identify common demographic trends and behavior histories. We are all familiar with Amazon’s cohort groups when they say other customers who bought the same book as you have also bought the following books. Of course, if you can sell your product or service to one member of a cohort group, then all the rest may be reasonable prospects. Cohort groups are represented logically and graphically as links, and much of the analysis of cohort groups involves specialized link analysis algorithms.
In-flight aircraft status. This use case as well as the following two use cases are made possible by the introduction of sensor technology everywhere. In the case of aircraft systems, in-flight status of hundreds of variables on engines, fuel systems, hydraulics, and electrical systems are measured and transmitted every few milliseconds. The value of this use case is not just the engineering telemetry data that could be analyzed at some future point in time, but drives real-time adaptive control, fuel usage, part failure prediction, and pilot notification.
Smart utility meters. It didn’t take long for utility companies to figure out that a smart meter can be used for more than just the monthly readout that produces the customer’s utility bill. By drastically cranking up the frequency of the readouts to as much as one readout per second per meter across the entire customer landscape, many useful analyses can be performed including dynamic load-balancing, failure response, adaptive pricing, and longer-term strategies for incenting customers to utilize the utility more effectively (either from the customers’ point of view or the utility’s point of view!)
Building sensors. Modern industrial buildings and high-rises are being fitted with thousands of small sensors to detect temperature, humidity, vibration, and noise. Like the smart utility meters, collecting this data every few seconds 24 hours per day allows many forms of analysis including energy usage, unusual problems including security violations, component failure in air-conditioning and heating systems and plumbing systems, and the development of construction practices and pricing strategies.
Satellite image comparison. Images of the regions of the earth from satellites are captured by every pass of certain satellites on intervals typically separated by a small number of days. Overlaying these images and computing the differences allows the creation of hot spot maps showing what has changed. This analysis can identify construction, destruction, changes due to disasters like hurricanes and earthquakes and fires, and the spread of human encroachment.
CAT scan comparisons. CAT scans are stacks of images taken as “slices” of the human body. Large libraries of CAT scans can be analyzed to facilitate the automatic diagnosis of medical issues and their prevalence.
Financial account fraud detection and intervention. Account fraud, of course, has immediate and obvious financial impact. In many cases fraud can be detected by patterns of account behavior, in some cases crossing multiple financial systems. For example, “check kiting” requires the rapid transfer of money back and forth between two separate accounts. Certain forms of broker fraud involve two conspiring brokers selling a security back-and-forth at ever increasing prices, until an unsuspecting third party enters the action by buying the security, allowing the fraudulent brokers to quickly exit. Again, this behavior may take place across two separate exchanges in a short period of time.
Computer system hacking detection and intervention. System hacking in many cases involves an unusual entry mode or some other kind of behavior that in retrospect is a smoking gun but may be hard to detect in real-time.
Online game gesture tracking. Online game companies typically record every click and maneuver by every player at the most fine grained level. This avalanche of “telemetry data” allows fraud detection, intervention for a player who is getting consistently defeated (and therefore discouraged), offers of additional features or game goals for players who are about to finish a game and depart, ideas for new game features, and experiments for new features in the games. This can be generalized to television viewing. Your DVR box can capture remote control keystrokes, recording events, playback events, picture-in-picture viewing, and the context of the guide. All of this can be sent back to your provider.
Big science including atom smashers, weather analysis, space probe telemetry feeds. Major scientific projects have always collected a lot of data, but now the techniques of big data analytics are allowing broader access and much more timely access to the data. Big science data, of course, is a mixture of all forms of data, scalar, vector, complex structures, analog wave forms, and images.
“Data bag” exploration. There are many situations in commercial environments and in the research communities where large volumes of raw data are collected. One example might be data collected about structure fires. Beyond the predictable dimensions of time, place, primary cause of fire, and responding firefighters, there may be a wealth of unpredictable anecdotal data that at best can be modeled as a disorderly collection of name value pairs, such as “contributing weather= lightning.” Another example would be the listing of all relevant financial assets for a defendant in a lawsuit. Again such a list is likely to be a disorderly collection of name value pairs, such as “shared real estate ownership =condominium.” The list of examples like this is endless. What they have in common is the need to encapsulate the disorderly collection of name value pairs which is generally known as a “data bag.” Complex data bags may contain both name value pairs as well as embedded sub data bags. The challenge in this use case is to find a common way to approach the analysis of data bags when the content of the data may need to be discovered after the data is loaded.
The final two use cases are old and venerable examples that even predate data warehousing itself. But new life has been breathed into these use cases because of the exciting potential of ultra-atomic customer behavior data.
Loan risk analysis and insurance policy underwriting. In order to evaluate the risk of a prospective loan or a prospective insurance policy, many data sources can be brought into play ranging from payment histories, detailed credit behavior, employment data, and financial asset disclosures. In some cases the collateral for a loan or the insured item may be accompanied by image data.
Customer churn analysis. Enterprises concerned with churn want to understand the predictive factors leading up to the loss of a customer, including that customer’s detailed behavior as well as many external factors including the economy, life stage and other demographics of the customer, and finally real time competitive issues.