top of page

Mastering Categorical Data Cleaning and Text Data Handling in Python



I. Handling Categorical Variables


A. Introduction to Categorical Variables


Definition of categorical variables and examples


In the world of data, variables or attributes often play the central role. These variables can be broadly divided into two categories - numerical and categorical. While numerical variables represent counts or measurements, categorical variables represent groupings or 'categories'. For instance, if you have a dataset of students, their ages would be a numerical variable, while their genders (male, female, non-binary, etc.) would be a categorical variable.


Challenges in dealing with categorical data


While categorical data is vital, it can be rife with challenges. These can range from inconsistencies (like the same category being represented in different ways, such as 'Male', 'male', 'M', all meaning the same thing), incorrect categories, to having too many categories which could be more succinctly represented.


Reason for inconsistencies in categorical data


Inconsistencies in categorical data often arise due to human error, lack of standardized data entry processes, or sometimes, system glitches. For instance, a person entering data might denote a customer's gender as 'M' in one entry and 'Male' in another. This could lead to complications when analyzing the data.


B. Approaches for Treating Categorical Data Problems


Dropping incorrect categories


The simplest approach to deal with incorrect or inconsistent categories is to drop them. But, this should be done carefully as it could lead to a loss of valuable information.


Remapping incorrect categories to correct ones


An alternative and often better approach is to 'clean' the data by remapping incorrect or inconsistent categories to the correct ones. For example, 'Male', 'M', and 'male' could all be remapped to 'Male' to ensure consistency.


C. Practical Example: Handling Categorical Data


A dataset containing blood types


Let's consider a dataset of blood donors with a column for blood type. However, this column has been filled inconsistently with entries like 'A', 'a', 'Type A', 'type a', 'B', 'b', 'Type B', 'type b', etc.


import pandas as pd

data = {'Name': ['Donor1', 'Donor2', 'Donor3', 'Donor4'],
        'Blood Type': ['A', 'a', 'Type A', 'type a']}
df = pd.DataFrame(data)
print(df)

Output:


    Name Blood Type
0  Donor1          A
1  Donor2          a
2  Donor3     Type A
3  Donor4     type a


Identifying inconsistencies in the dataset


From a quick glance at the data, you can identify that there's inconsistency in the way the blood types are denoted.


D. Understanding Data Joins


Explanation of Anti-joins and Inner joins


In data processing, join operations are used to combine rows from two or more tables based on a related column. Inner join returns only the rows where there is a match in both tables. Anti-join, on the other hand, returns rows from the left table which are not present in the right table.


Practical application of Anti-join and Inner join on dataset to treat

inconsistent data


In the context of cleaning categorical data, inner join could be used to ensure that only consistent data is retained, while anti-join could be used to identify inconsistent data.


E. Python Approach for Finding and Treating Inconsistent Categories


How to identify inconsistent categories in Python


To identify inconsistent categories, you can use the unique() function in pandas. For instance, in our blood donor data:


print(df['Blood Type'].unique())

Output:


['A', 'a', 'Type A', 'type a']


How to drop inconsistent rows in Python


You can drop inconsistent rows by subsetting the dataframe to retain only the rows you want. For instance, to drop all rows with blood type 'a':


df_clean = df[df['Blood Type'] != 'a']
print(df_clean)

Output:


    Name Blood Type
0  Donor1          A
2  Donor3     Type A
3  Donor4     type a


Now, as this section has already become quite long, I would propose to continue the next section "Dealing with Different Types of Errors in Categorical Variables" in a new part of our tutorial. Please let me know if you would like to proceed this way.


II. Dealing with Different Types of Errors in Categorical Variables


A. Types of Errors in Categorical Data


Value Membership Constraint


A categorical variable's value should ideally belong to a defined set of categories. Any value not part of these categories violates the value membership constraint. For example, in a dataset of animal species, an entry like 'Blue Whale' under the column 'Bird Species' is clearly a violation.


Value Inconsistency


As previously mentioned, value inconsistency happens when the same category is represented differently. This could be due to capitalization, leading/trailing spaces, or even spelling errors.


Presence of too many categories that could be collapsed into one


Sometimes, data might have too many categories that could be more effectively represented with fewer categories. For example, a dataset containing country names might include 'USA', 'United States', 'United States of America', etc., all of which could be represented by one category.


Ensuring data is of the right type


Finally, the data must be of the correct type. For example, a category denoting age groups (like '10-20', '20-30', etc.) should be a string, not a numeric type.


B. Dealing with Value Inconsistency


Explanation and example of value inconsistency due to capitalization


In a dataset, you might find that the same category is represented with different capitalization. For instance, 'Male', 'male', and 'MALE' are inconsistent due to capitalization.


How to treat capitalization inconsistency with Python


The easiest way to treat this inconsistency is by converting all the entries to a standard form, like lower case:


df['Gender'] = df['Gender'].str.lower()


Explanation and example of value inconsistency due to leading or trailing spaces


A common source of inconsistency in data can be trailing or leading spaces. For example, ' Male' and 'Male ' are the same categories but treated differently because of the extra space.


How to remove leading or trailing spaces with Python


Python's strip() function can remove leading and trailing spaces from string data:


df['Gender'] = df['Gender'].str.strip()


C. Collapsing Data into Categories


How to create categories from data


You can create categories from data using binning or grouping methods. For example, numerical age data can be grouped into categories like 'child', 'teenager', 'adult', etc.


Using qcut function in Python


In Python, you can use the qcut function to divide numerical data into quantile-based bins:


age_labels = ['child', 'teen', 'adult', 'elderly']
df['Age Category'] = pd.qcut(df['Age'], q=4, labels=age_labels)


Using cut function in Python


The cut function can be used to divide data into equally sized bins:


age_labels = ['0-20', '21-40', '41-60', '61-80']
df['Age Group'] = pd.cut(df['Age'], bins=4, labels=age_labels)


Mapping categories to fewer ones


Sometimes, you may need to map a large number of categories to a smaller number. This can be done using a dictionary and the map function:


mapping = {'USA': 'United States', 'United States of America': 'United States'}
df['Country'] = df['Country'].map(mapping)


This marks the end of the second section, "Dealing with Different Types of Errors in Categorical Variables". We'll next cover "Cleaning Text Data". Let me know if you're ready to proceed, and I'll continue with the next part of this tutorial.


III. Cleaning Text Data


A. Introduction to Text Data


What is text data and examples


Text data, or string data, is any data that's stored as a sequence of characters. It might include names, addresses, descriptions, etc. For instance, a column storing phone numbers in a customer database is a form of text data.


Common problems with text data


Text data can be tricky. Common issues include inconsistent formatting, typos, special characters, and unstructured data.


B. Practical Example: Cleaning Text Data


Suppose we have a dataset containing customer information with a column for phone numbers. The phone numbers are entered in various formats (e.g., "+1-555-555-5555", "1 555 5555555", "(555) 555-5555"), which makes it hard to process and analyze this data uniformly.


C. Fixing the Phone Number Column


How to replace a certain symbol in the dataset


You can replace symbols or characters in a dataset using Python's replace() function. Here's how to remove the '+' symbol:


df['Phone Number'] = df['Phone Number'].str.replace('+', '')


How to remove certain symbols from the dataset


Similarly, to remove multiple symbols like '-' and '(', we can chain the replace() function:


df['Phone Number'] = df['Phone Number'].str.replace('+', '').replace('-', '').replace('(', '').replace(')', '')


Replacing phone numbers below a certain length


If you notice some phone numbers are too short to be valid, you can replace them with NaN:


df.loc[df['Phone Number'].str.len() < 10, 'Phone Number'] = np.nan


Writing assert statements to test the column


To ensure that all phone numbers now have the correct format, you can write an assert statement:


assert df['Phone Number'].str.contains(r'^\\\\d{10}$').all(), "Not all phone numbers are in the correct format!"

This will throw an AssertionError if any phone number doesn't match the format.


D. Handling Complex Text Data with Regular Expressions


How regular expressions can be used in handling text data


Regular expressions, or regex, is a powerful tool for searching, matching, and manipulating text data. It allows you to define search patterns for complex string manipulation.


Practical example of using regular expressions to clean a complex dataset


Suppose the phone number data is more complex, with entries like "+1-555-abc-5555" or "555.555.555", and we want to extract only the numeric parts.

We can use a regex pattern to achieve this:


df['Phone Number'] = df['Phone Number'].str.replace(r'\\\\D', '')


The \\\\D pattern matches any non-digit character and replaces it with nothing, effectively removing it.



IV. Visualizing Categorical Data


Visualizing your data is an important step in understanding it, and it's particularly useful when dealing with categorical data because it can help identify patterns, trends, and outliers that may not be apparent in a tabular format.


A. Introduction to Categorical Data Visualization


Categorical data, as we discussed, is data that can be divided into multiple categories but having no order or priority. Examples are blood type, nationality, or product category.


Visualizations can help us understand the distribution of these categories better and even compare them against each other or against a numerical variable.


B. Python Libraries for Visualization


Python offers multiple libraries for data visualization, such as Matplotlib, Seaborn, and Plotly. Each has its own strengths and weaknesses, so the choice of library often depends on the specific needs of your project.

For this tutorial, we'll use Seaborn because it's powerful, versatile, and it integrates well with pandas dataframes.


Bar plot


Bar plots are the most common tool for visualizing categorical data. They display the count (or some other aggregation) of the data points for each category.


import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.countplot(x='Blood Type', data=df)
plt.title('Distribution of Blood Types')
plt.show()

This would create a bar plot showing the count of each blood type in our data.


Box plot


Box plots are useful when you want to compare a categorical variable against a numerical variable.


plt.figure(figsize=(10,6))
sns.boxplot(x='Blood Type', y='Age', data=df)
plt.title('Age distribution by Blood Type')
plt.show()


This would display the distribution of ages for each blood type, showing the median (the line inside the box), the interquartile range (the box), and possible outliers (the dots outside the whiskers).


Remember, visualizing your data is a key step in any data science project. It helps you understand the data, find patterns and outliers, and communicate your findings effectively.


That concludes our comprehensive tutorial on handling and visualizing categorical data in Python. We've explored what categorical data is, how to clean and preprocess it, and how to visualize it. I hope this tutorial has been informative and useful to you. Happy coding!

bottom of page