5 Steps to Achieve High Data Quality by Cleaning With AI

Data Quality Statistics

According to the Global Data Management Benchmark Report from Experian, 77 percent of c-level executives say their operations have been greatly disrupted by data in the last year. Data quality statistics for all industries report that on a percentage basis for all organizations:

99% have a strategy to maintain high quality data
98% believe they have inaccurate data
83% see data as an integral part of forming a business strategy
69% say inaccurate data undermines customer experience efforts

The report also zeroes in on three major industries and shows us the following:

Retail industry

77% have seen a return on investment on their data quality solutions
65% believe inaccurate data undermines customer experience efforts
78% trust their data to make key business decisions
89% struggle to implement a data governance program

Finance industry

74% have seen a return on investment on their data quality solutions
86% see data as an integral part of a business strategy
74% believe data quality issues impact customer trust and perception
97% have a data management project planned in the next 12 months

IT industry

77% have seen a return on investment on their data quality solutions
85% trust their data to make key business decisions
96% struggle to implement a data governance program
99% have a data management project planned in the next 12 months

How can it be that 99 percent of all organizations have a strategy to maintain high quality data, yet 98 percent of them believe they have inaccurate day and 69 percent blame inaccurate data for undermining customer experience efforts? The answer is two-fold.

Cleansing data is difficult and time-consuming, even when you have a strategy in place.
Data will never be 100 percent accurate, but data that is 99.9 percent accurate is far more valuable than data that is 80 percent accurate.

Fortunately, Artificial Intelligence (AI) and Machine Learning (ML) —
a subset of AI — make the job of cleansing and tuning data much easier, faster and more accurate than ever before.

Continue reading or use the chapter links to skip ahead:

Why Data Quality Matters What Is Bad Data? What Is High-Quality Data? What Are Data Cleansing and Tuning? What Is High-Quality Data? Benefits of Data Cleaning and Tuning With AI

Why Data Quality Matters

naccurate, incomplete and duplicate data are extremely wasteful and lead to less than optimal results when running campaigns based on that data.

10 Revealing Data Quality Stats

These data quality stats from multiple sources paint an ugly picture of the data quality landscape.

Bad data costs sales and marketing departments around 550 hours and as much as $32,000 per sales rep (DiscoverOrg).
25-30 percent of data becomes inaccurate every year, making sales and marketing campaigns less effective (MarketingSherpa).
Poor data quality costs businesses up to 20 percent of revenue each year (Kissmetrics).
Untreated duplicates cost an average of $1 to prevent, $10 to correct and $100 to store (SiriusDecisions).
50 percent of employee time is wasted dealing with commonplace data quality tasks (MITSloan).
40 percent of all leads contain inaccurate data (Integrate)
Inconsistent data across technologies (CRM’s, Marketing Automation System, etc) is considered the biggest challenge by 41 percent of companies (dun&bradstreet).
Only 16 percent of companies say their data is “very good” (ChiefMarketing).
Every hour, 59 business addresses change, 11 companies change their names and 41 new businesses open. (Reachforce). These figures don’t include the companies that close every hour.
15 percent of leads contain duplicate records (Integrate)

In Kaggle’s 2017 survey of data scientists, 7,376 responded to the question, “What barriers are faced at work?” The number one answer was, “Dirty data.”

How Duplicates Impact Your Business

Duplicates are one of the biggest problems with data, especially if that data comes from multiple sources. HubSpot uncovered a number of ways that duplicate data impacts your business:

Costs. After an outside firm lowered their percentage of duplicate data from 22 percent to 0.2 percent, Children’s Medical Center Dallas learned that each duplicate record cost them $96. Poor data quality costs U.S. businesses more than $600 billion each year (Data Warehouse Institute).
Productivity. A CSR trying to help a customer for which there are multiple records wastes their own and the customer’s time. The customer’s frustration is exacerbated if they have to call back multiple times for the same issue due to duplicate records.
Waste. Sending multiple direct mail pieces — especially catalogs — to one customer because they are in your database more than once is a complete waste of marketing dollars.
Brand. Marketing campaigns that send the same materials to customers with duplicate records several times makes your company look incompetent and can lead to a 25 percent drop in revenue gains.
Confidence. When employees run into the same duplication issues on an ongoing basis, they may become prone to losing confidence in the database, which promotes cutting corners and sloppiness (human error).

What Is Bad Data?

There is no single, accurate definition for bad data. All of the following factors are descriptors of bad data.

Inaccurate. Erroneous data is often the fault of poor manual data entry. These errors are often worse than having no data at all, as they use up valuable resources and can be difficult to catch.
Incomplete. Due to the faulty picture it creates, incomplete data leads to faulty decision making.
Inconsistent. Data sourced from multiple locations and platforms is usually formatted differently, which can lead to faulty interpretations.
Repetitive. Duplicate data presents many problems, as the section above discusses.

Analyzing bad data leads to making bad decisions.

What Is High-Quality Data?

Any one of the descriptors for bad data suffice to consider it bad. But with high-quality data, all of these descriptors work together.

Valid. Meets defined business rules or constraints for data
Accurate. Corrected for all errors
Complete. Thorough within the bounds of all available data
Consistent. Formatted to avoid ambiguous interpretation
Uniform. Using the same units and measures for all data
Traceable. Able to find and use the source data
Timely. Recently updated data

Some of these points are about cleansing data, while others speak to tuning its structure to make it more useful and reliable in AI projects.

What Are Data Cleansing and Tuning?

Data cleansing and tuning are the two essential processes that turn bad data into high-quality data. Data cleansing fixes errors, removes duplicates and adds the data necessary to complete records. Data tuning structures the data to be consistent and usable regardless of the source.

It is not unusual for data cleansing (or data cleaning) to be used as an umbrella term for both processes. The distinctions can be confusing. Here’s a simple way to look at how the terms used to describe data evolve as it is cleaned and tuned.

Data that has not been cleansed is raw data
Once raw data is cleansed it becomes technically accurate data
Technically accurate data that has been tuned is uniform data

While data scientists are best known for creating highly accurate predictive models using uniform data, it’s not unusual for those scientists to spend half their time (or more) cleaning and tuning the data.

Some of the challenges of data cleansing include:

Lack of clarity as to what is causing anomalies in the system
Records that have been partially deleted and cannot be accurately completed
Time and expense of ongoing data maintenance
Difficulty in building a data cleansing graph ahead of time to help direct the process

But the excessive costs of bad data and the rewards of high-quality data more than justify the time and resources devoted to cleansing and tuning data.

What Is High-Quality Data?

Before you get started with a data-cleansing project, you have to be sure of what you want to accomplish and how you will accomplish it. Keeping these best practices in mind will give you a good conceptual footing.

Take a holistic view of your data. Think about who will be using the results from the data, not just the person doing the analysis.
Increase controls on inputs. Make sure that only the cleanest data is used in the system.
Identify and resolve bad data. Stop bad data before it becomes problematic by choosing software that gives you that capability.
Limit the sample size. With large datasets, large samples unnecessarily increase prep time and slow performance.
Run spot checks. Catch errors before they can be replicated throughout the data.

This is another list of best practices, but this one is a step-by-step checklist for your cleansing project.

Set up a quality plan before you begin
Fill out missing values
Remove rows with missing values
Fix errors in the structure
Reduce data for proper data handling

Let AI Do the Hard Work

Most large data cleansing projects are now performed with Artificial Intelligence — or, to be more precise, Machine Learning. Here’s why.

The amount of data that organizations collect increases significantly every year. When cleansing data without the aid of Machine Learning, whenever bad data is found the data scientist or data analyst has to write a new rule to tell the system what to do with it. Because new data (including bad data) enters the system at an increasingly rapid rate, the data scientist is constantly playing catch up. Consider the scope of the problem. Data creates patterns. Bad data is found by recognizing anomalies in the patterns. The more data there is, the more complex the patterns become, making them more difficult for humans to analyze. That means some anomalies are not found unless the data scientist spends a massive amount of time looking for them among increasingly complex patterns.

Every minute the data scientist spends cleaning up data is a minute they are not using the data to make the organization more productive. And if they can’t catch up with the bad data, they are effectively caught in a trap like a hamster on a wheel. This is not as uncommon a problem as you might think.

Machine Learning frees the data scientist from the trap. The data scientist creates a learning model to predict matches. As more data enters the system, the more Machine Learning fine tunes the model.

With manual data cleaning using standard computer programs, as more data is added, the worse the problem becomes. With Machine Learning, as more data is added, the more the problem is eliminated. The key is to focus machine learning on systematically improving how it analyzes, rates and utilizes data so that it becomes the expert at correcting, updating, repairing, and improving data.

That puts the data scientist back to doing what they were hired to do.

Benefits of Data Cleaning and Tuning With AI

Clean and tuned data is a necessity for numerous parts of an organization, including:

Marketing. Reach the right people with the right messages at the right time
Sales. Rely on a complete and accurate view of your customer
Compliance. Be confident that you are complying with all regulations for every customer
Operations. Better data = better decisions

Moreover, tuned data is needed to conduct any type of analysis with confidence.

The fastest and most accurate way to cleanse and tune data is by incorporating AI. Machine Learning produces more accurate data models in less than a day than it takes a data scientist armed only with standard computer programs to produce in weeks. An example of one tool often used to build automated data cleaning processes that integrate AI is KNIME. For other tools see technology.

There is simply no way to be sure that uncleansed data is accurate enough to use in any type of campaign. Data cleansing and tuning are necessities, and the more effective and efficient way to accomplish them is with Artificial Intelligence.