Executive Summary

There is a lot of pressure on companies today to take advantage of Big Data. For many companies Big Data has become a competitive differentiator. They are making substantial investments in solutions and initiatives to collect and analyze the ever-growing volume of data available in the digital world. This pressure has led many companies to make decisions and buy tools that have hamstrung their ability to mine the data they are collecting. In this short paper we will explain what some of these critical pitfalls are and how to avoid them.

The Peril

The drive to collect, analyze, and profit from these large data sets often leads companies to buy new tools without fully understanding the implications for their business processes. For example, a large telecommunications company started ingesting their operational and transactional data to several Hadoop “Data Lakes”. They did it for good operational reasons, including scalability, speed, and simplicity of ingesting numerous disparate formats. Unfortunately, a lot of the traditional Data Management and Data Governance practices implemented around relation database management systems (RDMBS) were largely ignored. Things like data stewardship, quality control, normalization with an enterprise data model (EDM), and consistent metadata were not in place.

  • Files were duplicated across multiple Hadoop clusters. This made it difficult to define a single source of truth for the business community. In addition, it was an inefficient use of IT resources.
  • Multiple distributions of Hadoop prevented file sharing across the organization.
  • Inconsistent field definition and limited format controls introduced instability in data processing jobs. Extract, Transformation and Load (ETL) jobs frequently failed because developers changed the data type or content of a specific field.
  • There was no data provenance or lineage. It was nearly impossible to trace the origin of a field back to its source. This is critical in regulated industries like finance.
  • There was limited security and minimal access control auditing.

This resulted in what is colloquially called the “Data Swamp”. The value of the data in a Data Lake slowly (or quickly!) degraded without any data controls enforced prior to ingestion into the Hadoop cluster. This made the numerous data processing jobs prone to failure and limited the ability of the business community to leverage the data to make informed decisions or run predictive analytics.

The Antidote

The lesson learned is to not to forgo traditional Data Governance and Data Management practices just because the underlying technologies in the data platform have changed. Whether you are using an RDBMS, Hadoop, NoSQL or the next new technology, you still should not neglect the data management technologies and procedures that were developed around RDBMSes and evolved to support new technologies. CapTech recommends that organizations:

  • Have good Data Management practices in place prior to introducing Hadoop and NoSQL into the enterprise,
  • Expand the scope of the Information Architect role(s) to include ALL data resources including Hadoop and NoSQL
  • Define schemas for data in both the NoSQL and Hadoop ecosystems when possible,
  • Capture metadata about files before they are ingested into the Hadoop ecosystem,
  • Identify Data Stewards for the Hadoop ecosystem and NoSQL data stores.

This will go a long way in making sure your Data Lake does not turn into a Data Swamp.