Believe it or not, most every company has some bad data somewhere in their databases. They probably have a report that has been ‘vetted’ as being absolutely correct, yet presents an incorrect picture of the truth. As a former developer for a major transportation company, and trainer/consultant for an international software company, I’ve seen some bad stuff out there. And these weren’t ‘Mom & Pop’ shops, either. Many were Fortune 500 companies (5 in the Fortune 25), federal departments and agencies, and the military. The wide-spread existence of ‘dirty data’ wasn’t too obvious to me at first. But, as time went by, I realized that everyone had bad data. Everyone.

True Stories

When trying to build an OLAP cube from a bank’s data on car loans, I learned first-hand that there was more than one way to spell ‘Chevrolet’. The database held values like ‘Chevrolt’, ‘Cheverolay’, ‘chvy’, ‘Chevolet’, and more. And let’s not even talk about ‘Mitsubishi’! That same database had at least a dozen different ways to spell ‘Virginia Beach’. People typing into data capture screens messed up both the Virginia and the Beach parts. This was but the tip of the iceberg. There were no pick lists for input selections, or field validation in place. The business decided it did not want to fix the data, so I addressed the problem programmatically in the modeling tool – an all too common solution.

Another issue was at a paper company, whose software let employees enter their own product codes. Imagine the fun in trying to make sense of that! It took a lot of work (read time and money) to track down what each code meant, and how it should be mapped. Then, mappings had to be created to produce a report that bore any semblance to the true sales numbers.

There was a company that reported, among other things, distances from one place to another. The problem was that, due to the record structures involved, all of the distances were doubled when the cube was built.

The first company I worked for hired a vendor consultant to create a report for the senior executives. The report showed a revenue roll up around “XX” million, but I knew it should have been “X” billion. An incorrect decimal point placement in the consultant’s code would have torpedoed my career from the start had I not caught it.

A state agency asked me to create a replacement report for them. When they ran the old and new reports in parallel (a very good practice), the results did not jibe. Of course, they thought the new report was defective. Looking into the matter, I discovered that it was, in fact, the old report that was incorrect.

Why hasn’t this been fixed yet?

In the past, a user would request a report from “MIS”. The programmer would recognize the bad data, and programmatically fix it in the report. Since IT did not own the data, they could not fix it. And since the end users got correct reports, they never realized that they had such bad data.

Now, with end-users enabled via BI tools that get the data from the database with no IT assistance, they can see the bad data firsthand. Since they are the business, they own the data; they can (and certainly should) take steps to correct the data. As a result, business decisions based on this data get better.

Lessons learned:

  • Bad data can lead to bad decisions, increased cost and complexity, and unhappy users
  • Follow good design practices:
    • never let data entry users create their own product codes
    • minimize the use of free-form text fields in data entry screens
    • provide either pick lists or check boxes for input
    • validate all data before committing a transaction to the database
    • Always check the accuracy of report you are responsible for, especially when built by others
    • Don’t assume the old report is the correct one