In a Nutshell: Automating Data Cleaning.

The junior data scientist perceives a fresh set of difficulties with every new dataset. It appears that data cleaning cannot be automated. Decision-makers and stakeholders are so far removed from this issue that they are unaware of the amount of money that has been squandered on it. Although they may not fully see the difficulties, they may believe that automation is the best course of action. Despite the significant expense, it is typically not given top priority in many enterprises.

Still, data cleaning comes with a few dozen problems at most. Not a few hundred, not a few thousand. With a bit of coding, you can catalog them and address them all at once. One that you may use every time you come across a fresh batch of data. Here, I outline the principal problems and suggest solutions. Your data scientists will be pleased if you automate the tedious, repetitive activities associated with data cleaning, which can save you a ton of time.

Taking a Quick Look at Your Data.

To begin, you must make a summary table for each feature that was considered independently, including the kind (numerical, categorical, text, or mixed). Obtain the top 5 values along with their frequencies for each feature. It might reveal an incorrect or unassigned zip code, such 99999. Other unusual numbers to look out for include NaN (not a number), N/A, an improper date format, blank values, or special characters. For example, accented characters, dollar signs, percentage signs, and so on can interfere with regular expressions and text parsing. Determine the numerical features’ minimum, maximum, median, and other percentiles.

Examining feature interactions is the next stage. Determine the cross-correlations of every two features and look for features that are redundant or duplicates so you can disregard them. These are two simple steps that can be automated.

The Principal Problems.

I touched on a few points in the previous paragraph that, if you look for them, are immediately obvious. I add more here. I wrap off by offering advice on how to handle these issues, once more via automation.

Look for values that are outside of the range. Before beginning your study, try to establish the expected range from your client.

For encrypted data, like credit card numbers or ID fields, use checksums whenever at all practicable.

Search for same or almost identical keys or IDs. Additionally, a single person may be associated with two distinct IDs. This might highlight errors in your data. Using a table containing typical mistakes can be beneficial.

Additionally, if possible, use pre-populated fields in web forms to gather data rather than having consumers manually enter information like date, zip code, state, or city.

For mismatched fields, check. When data, including URLs, is processed and saved in CSV files before being uploaded into databases, this is a common occurrence in NLP problems.

Do dates have distinct encodings? This is the result of combining several data sets from various sources. To deal with this issue, standardize the dates.

Following the cleansing procedure, your data can be standardized. After that, you can process information using your regular algorithms in your machine learning system.

Other Concerns.

Occasionally, the data has problems that are out of your control. During my tenure at Wells Fargo, I observed that the Tealeaf software’s internet session IDs were fragmented into several brief sessions, leading to incorrect userIDs and extremely brief Internet experiences. This occurred whenever a user switched servers, to put it another way, it happened frequently. By manually replicating these sessions and examining their database tracking, this enigma was resolved and accurate assessments were produced.

Every now and then, the database has an entire population section that is completely absent. People who never tested and recovered on their own, which made up the great majority of the population in the early days, for example, did not appear in any database in the Covid data. This resulted in a fatality rate of 6% instead of the more accurate 1%, with expensive ramifications for public policy. To identify such problems, use common sense and creative problem-solving techniques. Then, notify relevant parties of the difficulties. When feasible, consider several data sources as well. Sewage data, a proxy dataset, offers the solution in this instance.

Taking Care of the Matters.

Fixing certain problems is simple. For example, if you represent missing data in this way, replace NaN with a blank. Alternatively, compiling a table listing every method used to encapsulate missing data. A dictionary can be used to correct typos, misspelled names, and zip code correlations across cities. Such a dictionary can be constructed using keyword associations and frequencies.

Regular expressions or alternative methods to identify split fields—such as those resulting from the incorrect use of a comma as a field separator—can be used to correctly realign misaligned fields. Combine duplicate IDs and features. Errors, however, are not always reversible. You can disregard a portion of the dataset in this situation. Biases, though, may be introduced by this. Occasionally, it’s best to request a fresh (rectified) data dump from your provider.

8 thoughts on “In a Nutshell: Automating Data Cleaning.

Leave a Reply

Your email address will not be published. Required fields are marked *