Data Cleaning Steps and Techniques

: [Missouri State - Forbes]

- Overview

In machine learning, data scientists agree that better data is more important than the most powerful algorithms. This is because machine learning models perform as well as the data they were trained on. If you train a model with bad data, the final analysis results are not only often untrustworthy, but are often completely detrimental to your organization.

Following is a 6 step data cleaning process to make sure your data is ready to go.

Step 1: Remove irrelevant data
Step 2: Deduplicate your data
Step 3: Fix structural errors
Step 4: Deal with missing data
Step 5: Filter out data outliers
Step 6: Validate your data

- Step 1. Delete Irrelevant Data

First, you need to figure out which analyses you will be running and what your downstream needs are. What question do you want to answer or what problem do you want to solve?

Take a close look at your data to see what's relevant and what you might not need. Filter out data or observations that are not relevant to downstream requirements.

For example, if you're analyzing SUV owners, but your dataset contains data on sedan owners, this information is irrelevant to your needs and will only skew your results.

You should also consider removing hashtags, URLs, emojis, HTML tags, etc. unless they must be part of your analysis.

- Step 2. Deduplication

If you collect data from multiple sources or departments, use the scraped data for analysis, or receive multiple surveys or customer responses, you often encounter data duplication.

Duplicate records slow down analysis and require more storage space. More importantly, however, if you train a machine learning model on a dataset with duplicate results, the model may give more weight to the duplicates, depending on how many times they are replicated. Therefore, they need to be removed to obtain balanced results.

Even simple data cleaning tools can help deduplicate data because duplicate records are easily identified by artificial intelligence programs.

- Step 3. Fix Structure Errors

Structural errors include spelling mistakes, inconsistent naming conventions, incorrect capitalization, incorrect word usage, etc. These can affect analysis because while they may be obvious to a human, most machine learning applications will not identify errors and your analysis will be biased.

For example, if you are running an analysis on different datasets - one with a "women" column and the other with a "women" column, you must normalize the headers. Likewise, dates, addresses, phone numbers, etc. need to be standardized so that computers can understand them.

- Step 4. Handling Missing Data

Scan your data or run it through a cleaner to find missing cells, gaps in text, unanswered survey responses, and more. This may be due to incomplete data or human error. You need to decide whether you should completely discard everything related to this missing data (whole columns or rows, whole surveys, etc.), manually enter individual cells, or leave them as-is.

The best course of action for dealing with missing data will depend on the analysis you want to perform and how you plan to preprocess the data. Sometimes you can even restructure the data so missing values don't affect your analysis.

- Step 5. Filter Out Data Outliers

Outliers are data points that are far outside the normal range and may skew your analysis too much in a certain direction. For example, if you average the test scores for your class, and one student refuses to answer any questions, his/her 0% will make a big difference to the overall average. In this case, you should consider removing this data point entirely. This may give results that are "actually" closer to the average.

However, just because one number is much smaller or larger than the other numbers you are analyzing doesn't mean the final analysis is inaccurate. Just because there is an outlier, doesn't mean it shouldn't be considered. You must consider what kind of analysis you are running and how removing or retaining outliers will affect your results.

- Step 6. Verify Your Data

Data validation is the ultimate data cleaning technique used to validate your data and confirm that it is high quality, consistent and well-formed for downstream processes.

Do you have enough data to meet your needs?
Is it uniformly formatted in a design or language that your analysis tools can use?
Will your clean data immediately prove or disprove your theory before analysis?

Verify that your data is regularly structured and clean enough for your needs. Cross-check the corresponding data points to make sure there are no omissions or inaccuracies.

Machine learning and AI tools can be used to verify that your data is valid and ready to use. Once you've done the proper data cleansing steps, you can use data wrangling techniques and tools to help automate the process.

[More to come ...]

Document Actions

Send this

Sections

Personal tools