Data warehouse archiving has traditionally been a low priority topic for data warehouse architects, but the landscape is changing rapidly. The passive, off-line archive of carefully selected data sets is being replaced by an active, online archive of all possible data assets, including hyper-granular data previously not even considered for long term retention. Ralph Kimball on April 2, 2014.
Technically, the big story is the continuing drop in the cost of online storage on spinning disk drives. It is not unusual to configure individual nodes of a local Hadoop cluster with 24 to 36 terabytes of disk storage, and the cost of cloud storage is dropping spectacularly these days with a price war going on between Google and Amazon.
Legally, the requirements for retaining data for very long periods are increasing. Intellectual property data, drug trial data, safety records, and financial records of all types need to be retained for decades. I was surprised when my father, an orthopedic surgeon, retired and was told that he had to maintain detailed patient treatment records for all former patients who had not yet turned 21 years of age. Since he had treated infants, that meant keeping these records for almost 21 years!
Operationally, moving all archiving to online spinning disk drives, and away from other “permanent” media such as CDs, DVDs, and tape systems means avoiding all the discussions of whether these media will be viable at various points in the future. I have an 8-inch floppy disk in my desk drawer that reminds me of this issue every time I look at it.
Migrate and Refresh
Keeping archived data continuously available online makes the tried and true migrate-and-refresh archiving strategy particularly simple. The idea behind migrate-and-refresh is every few years to verify that the data is physically available on modern media (migrate) and that the data can be interpreted in a usable way (refresh).
Traditional archiving has often meant that the data is stored in a more or less inaccessible location, to be restored only if there is a legitimate request. Obviously this creates a barrier and a delay in restoring the data, and the data cannot be used until the restore process is complete. An active archive, by contrast, not only serves legal and operational archiving requirements, but keeps the data continuously usable. When the barriers and delays in accessing the archived data go away, all sorts of analyses become feasible that would otherwise not be considered.
As the cost of spinning disk storage continues to drop, the whole notion of what can be archived changes. In the past, only carefully chosen subsets of the data were archived, and frequently the most atomic hyper-granular data was discarded. For example, in a communications switching network, each switch generates an enormous amount of detailed data that may not be archived. Similarly, every commercial airplane flight generates gigabytes of operational data. When the cost of active archiving of these data sets approaches zero, our thinking changes fundamentally. We can think of lots of reasons to keep this data.
Raw Data Formats
Finally, most data currently being collected does not start out as highly curated and well behaved relational data. All of us are aware that the big data revolution embraces a much wider gamut of unstructured, semi-structured, and uniquely structured data. Rather than carefully preparing this data in advance for archiving, it makes much more sense to capture the data in the original formats, and then make the data continuously available for later analysis. Certainly we can expect ongoing advances in image processing and complex event processing (as two examples) where future more sophisticated analyses of the original raw data would be valuable.
In summary, the archiving landscape has already changed. In particular, the Hadoop open source project and Hadoop Distributed File System (HDFS) has opened the door to many of these ideas. We’ll explore some of these ideas in future Design Tips.