Hazmat Data Scientist

Bad Data is Hazardous

In a 2001 study, The Data Warehousing Institute (TDWI) estimates that poor quality customer data costs U.S. businesses a staggering $611 billion a year in postage, printing and staff overhead (Eckerson 2002). Low quality data can lead to increased time for decision making, but also to incorrect decisions that can be even more costly. While perfect data isn’t a requirement for good decision making, the higher the data quality the higher the results and lower the risk. Too much time is spent cleaning data and not enough time truly analyzing data. Many projects ignore data quality in the planning phase and ultimately fail because of it.

Typically, 50-80% of time spent in data work is around data finding, cleansing and preparation. According to a Forbes survey, 60% of a data scientist’s time is spent cleaning and organizing data, with another 19% spent collecting the data. Nearly 80% of their time is spent in getting the data into a workable form. Not only is this extremely costly to an organization for these high-paid professionals to be doing data clean-up work, but in the same survey Forbes found that 76% of data scientists view that part of their work as the least enjoyable (Press 2016). Many feel they are data janitors instead of data scientists in companies without proper data governance policies.  Having high quality data would reduce costs from the time reductions, increase productivity and employee happiness, as well as lead to better results. Bad quality data can create missed opportunities and mistaken decisions. Bad decisions can also lead to a loss of confidence in consumers which can greatly impact an organization’s bottom line.

One of the highest profile examples of a project that had poor data that led to a disastrous loss of consumer confidence in addition to the time and money loss is Apple Maps. Instead of building a solution of their own to compete with the emerging navigation maps market, Apple decided to buy 24 suppliers’ data and mash it together (Cohan 2012). The data quality was terrible and not properly cleansed and aligned before roll-out. This led to results so bad that it was leading people looking for direction’s to Washington, DC’s Dulles Airport to be sent to a place where that “could get a driver arrested and possibly run over by a 747.” (Cohan 2012). Tim Cook, the Apple CEO at the time, had to come out and make a formal apology and recommended that consumers use their competitors’ products until they ironed out the bugs. Many years later, Apple Maps is still recovering from the loss of revenue and market share from that disaster.

Apple was able to make great improvements by eventually owning the data used so they could implement a proper data governance system that made it easier to correct and update information when needed. Data governance is often an overlooked aspect of new data system processes but is critical for success. A good example of data governance would be instead of allowing someone to type in the City name for their address, to use a drop-down list of cities in their selected state, or their zip code entry to auto-match to the appropriate name. This sounds simple enough, and it is, but often these types of controls aren’t setup initially and poor data quality becomes unmanageable. Apple also implemented human editors that could correct errors when needed, but also to run validations of smaller datasets to ensure things were lining up properly. Implementing humans to clean up poor data quality just adds costs that would otherwise be unnecessary with better data governance at the beginning of a project.

Data quality is a critical piece to successful data projects. Proper data governance policies up front can reduce data quality issues greatly, saving time and money downstream. The more time that is reduced from cleaning data, the more time can be spent coming up with the next insights and innovations from the data.

Author: Logan Callen

References

Cohan, Peter. 2012. “Apple Maps’ Six Most Epic Fails.” Forbes, February 18, 2020. https://www.forbes.com/sites/petercohan/2012/09/27/apple-maps-six-most-epic-fails/#28a3b755df9d

Eckerson, Wayne. 2002. “Data Warehousing Special Report: Data quality and the bottom line.” ADT Mag, February 18, 2020. https://adtmag.com/articles/2002/05/01/data-warehousing-special-report-data-quality-and-the-bottom-line_633729392210484545.aspx

Moreno, Hugo. 2017. “The Importance Of Data Quality — Good, Bad Or Ugly.” Forbes, February 18, 2020. https://www.forbes.com/sites/forbesinsights/2017/06/05/the-importance-of-data-quality-good-bad-or-ugly/#7c339c5f10c4

Press, Gil. 2016. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes, February 18, 2020. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#31d3c6d76f63

0 comments on “Bad Data is HazardousAdd yours →

Leave a Reply

Your email address will not be published. Required fields are marked *

Accessibility Toolbar