Jim Damoulakis is chief technology officer at GlassHouse Technologies Inc., a consulting and IT services company in Southborough, Mass. In an interview with SearchDataManagement, Damoulakis shared his views on planning and managing data integration projects, including advice on dealing with big data challenges as part of integration initiatives.
Damoulakis warned that data integration efforts are complicated by the addition of external sources of information, such as demographic data and text-based data collected from social networks. In addition to ratcheting up the technical challenges of integrating data, he said external information can create data quality, security and privacy issues for IT managers and integration teams. Excerpts from the interview follow:
SearchDataManagement: What are the prerequisites for establishing data integration processes that can handle large data sets from different sources, including Hadoop clusters and other big data systems?
Jim Damoulakis: Really, the biggest sort of prerequisite is something that everyone should do, namely, having a fundamental and basic understanding of what you’re trying to accomplish. That means everything from the service levels you want to provide to the things you’re trying to integrate and a sense of the sort of business outcomes you need to achieve. Can you define the audience [of end users] you’re trying to satisfy and what their needs are? This is something you need to do at least at a fairly high level. From a planning standpoint, there are underlying things to think about, too, once you get more into the specifics of the kinds of data that you need to focus on. But, in essence, you need to have a good sense of your goal.
Once you have that general understanding and direction, what comes next?
Damoulakis: You might need to adjust as you go along. If your focus is on integrating disparate data for multiple purposes, that’s going to shape things — for instance, if your focus is on market analysis, security analysis or some other kind of demographic analysis or other types of needs. And, of course, the data sources that are being [integrated] are critical — whether it’s structured data in a database or unstructured data that could reside in some kind of text fields, or some kind of externally generated data. It’s also important, especially with external data, to determine how much you trust the data. With external data where you don’t have much insight into quality, that could influence what sort of data cleansing or normalization might need to be done and also what kinds of tools you will need as part of the integration.
Where should you start? What kind of team is needed and what kind of management involvement or commitment?
Damoulakis: Part of it depends on what the organization has done in the past. Many larger organizations have data warehouses and have a lot of experience with [integrating data for] analytics. Others may be starting from scratch. Even if you have a data warehouse, if you are now dealing with a new type of data, that is going to impact how integration is approached. There may be a learning curve; you may need to bring in outside expertise. From a team standpoint, you need to convene groups that represent who the data users are, who is responsible for owning and managing the data and who is responsible for certain infrastructure, such as storage. You may need to have security people involved and also database and data management people. It could be all of those or some subset of them.
Is regulatory compliance also an issue that needs to be addressed as part of a data integration program involving large volumes of data and corresponding big data challenges?
Damoulakis: Potentially, yes. I was lumping compliance in with security, but that’s a great point. Depending on the types of data you’re handling, there can be, for example, privacy concerns along with those other specific security concerns. That’s especially important when you’re pulling in external types of data.
Is there a way to future-proof investments in tools for handling something new, like big data integration with data warehouses, so they provide value for the long term?
Damoulakis: Probably one of the biggest concerns is how to get value from your investments. With more traditional products or more established vendors, I think they all have pretty well-defined and reasonable roadmaps for leading you through into the future. They are pretty well on top of the fact that the unstructured or new data types need to be addressed. On the other hand, if there is a startup that you are considering as a vendor, stick as much as possible with standards. They may not be formal standards, because so much is new — but at least de facto standards. There are technologies or classes of technologies, such as Hadoop, where there may be many versions supported by different vendors. But like the widespread adoption of Linux, it’s a pretty safe path.