Welcome to this in-depth guide on handling categorical variables in pandas. Through this tutorial, we aim to provide you with a complete understanding of manipulating categorical variables using the powerful Python library, pandas. The methods and code snippets covered here will enable you to handle real-world datasets and extract meaningful insights.
We will be working with a dataset named 'Adoptable Dogs' in this tutorial, and focus on one of its features, 'coat', to illustrate our examples.
I. Handling Categorical Variables in Pandas
Categorical variables are a critical component of many datasets and are prevalent across a variety of fields. Understanding how to manipulate these variables in pandas is essential for data analysis and modeling.
Example: Consider a supermarket dataset that includes a 'Product Category' feature. This column could include categories such as 'Groceries', 'Electronics', 'Clothing', etc., and each of these would be a categorical variable.
Explanation of categorical data types in pandas
Pandas offer a 'category' data type for categorical variables, allowing more efficient memory usage and faster computation.
import pandas as pd
# Creating a pandas series
series = pd.Series(["small", "medium", "large", "medium", "small"])
# Converting the series into 'category' data type
series = series.astype('category')
print(series.dtype)
Output:
category
Introduction to the 'Adoptable Dogs' dataset
Let's introduce the dataset we will be working with.
# Loading the dataset
dogs = pd.read_csv('adoptable_dogs.csv')
Please remember to replace 'adoptable_dogs.csv' with the path where your dataset is located. After loading the data, let's check the first few rows.
# Preview the dataset
print(dogs.head())
Assume the output is:
coat age weight
0 short 3 30
1 medium 2 20
2 long 1 10
3 medium 4 40
4 short 3 30
In this tutorial, we will primarily work with the 'coat' column, which describes the length of the dogs' coat.
Converting 'coat' variable into a categorical data type
To convert a pandas Series to the 'category' data type, we use the 'astype' method.
# Convert 'coat' column to 'category'
dogs['coat'] = dogs['coat'].astype('category')
print(dogs.dtypes)
Output:
coat category
age int64
weight int64
dtype: object
Utilizing the .cat accessor object in pandas to manipulate categories
The '.cat' accessor in pandas provides several useful methods to work with categorical data.
# Access the 'coat' categorical variable
print(dogs['coat'].cat.categories)
Output:
Index(['long', 'medium', 'short'], dtype='object')
The .cat.categories attribute gives us the categories in our data.
Note: Code snippets provided here are examples. If you're following along with a different dataset or column, make sure to replace the dataset name and column name with yours.
II. Setting Categories in a Pandas Series
Ordering categories and setting specific categories can bring better structure to your data. In this section, we will cover the process of setting and ordering categories in pandas.
Example: Imagine you are analyzing a restaurant dataset where the 'Meal Size' feature includes 'small', 'medium', and 'large'. Here, there is an inherent order in the categories, making it an ordered categorical variable.
How to set categories for a series using 'set_categories'
The 'set_categories' method allows you to explicitly set the categories in your data.
# Set new categories
dogs['coat'] = dogs['coat'].cat.set_categories(["short", "medium", "long"])
print(dogs['coat'].cat.categories)
Output:
Index(['short', 'medium', 'long'], dtype='object')
Here we've set the categories to follow a specific order: 'short', 'medium', 'long'.
Explanation of the effects of not including a category in 'set_categories'
If a category in your data is not included when using 'set_categories', pandas will consider it as missing data. However, the actual data will remain unchanged. You can validate this using the 'value_counts' method which excludes missing data by default.
# Set categories excluding 'short'
dogs['coat'] = dogs['coat'].cat.set_categories(["medium", "long"])
print(dogs['coat'].value_counts())
Output:
medium 2
long 1
short 0
dtype: int64
Even though 'short' is not included in the set categories, it still exists in the original data.
Assigning order to categories using the 'ordered' parameter
You can set an order to your categories using the 'ordered' parameter in 'astype' or 'set_categories' methods.
# Set categories with order
dogs['coat'] = dogs['coat'].cat.set_categories(["short", "medium", "long"], ordered=True)
print(dogs['coat'].cat.categories)
Output:
Index(['short', 'medium', 'long'], dtype='object')
The order is now set as 'short' < 'medium' < 'long'.
Explanation of missing categories and ways to handle them
As we saw earlier, any category not included when setting categories is considered missing. You can include missing categories using the 'include' parameter in 'value_counts' method.
# Value counts with missing categories
print(dogs['coat'].value_counts(dropna=False))
Output:
medium 2
short 2
long 1
NaN 0
dtype: int64
NaN here represents any category that is missing in the 'set_categories' method.
Now that we've covered setting and ordering categories, let's move to adding and removing categories in a pandas series.
III. Adding and Removing Categories in a Pandas Series
Handling categories in a dataset isn't always about what you have; it's also about what you might need to add or remove. This part will take you through the process of adding and removing categories in a pandas Series.
Example: Let's consider the 'Product Category' in a supermarket dataset again. As
the supermarket expands, they might start selling 'Toys', which needs to be added as a new category. On the other hand, if they stop selling 'Electronics', that category would need to be removed.
Adding new categories with 'add_categories'
The 'add_categories' method allows us to add new categories without changing the existing ones.
# Adding new categories
dogs['coat'] = dogs['coat'].cat.add_categories(['extra_short', 'extra_long'])
print(dogs['coat'].cat.categories)
Output:
Index(['short', 'medium', 'long', 'extra_short', 'extra_long'], dtype='object')
You can see the new categories 'extra_short' and 'extra_long' have been added.
Updating the categories with new values
Sometimes, you might want to change the category values in your dataset. This can be done using the 'set_categories' method.
# Updating categories
dogs['coat'] = dogs['coat'].cat.set_categories(['short', 'medium', 'long', 'very_short', 'very_long'])
print(dogs['coat'].cat.categories)
Output:
Index(['short', 'medium', 'long', 'very_short', 'very_long'], dtype='object')
In the output, 'extra_short' and 'extra_long' have been replaced with 'very_short' and 'very_long'.
Removing categories using 'remove_categories' method
To remove categories, we use the 'remove_categories' method.
# Removing categories
dogs['coat'] = dogs['coat'].cat.remove_categories(['very_short', 'very_long'])
print(dogs['coat'].cat.categories)
Output:
Index(['short', 'medium', 'long'], dtype='object')
As you can see, the categories 'very_short' and 'very_long' have been removed.
A summary of methods covered
So far, we've seen how to add, update, and remove categories. Remember, using the correct method for your requirements will help you efficiently modify your categorical data.
IV. Updating and Collapsing Categories in a Pandas Series
Up next, we'll discuss how to rename categories and deal with the resulting data type changes.
Example: Suppose you're analyzing a company's employee dataset, and the 'Department' feature has categories like 'HR', 'Marketing', 'Finance', etc. Later, if the company decides to merge the 'HR' and 'Admin' departments into one 'HR & Admin', we need to update our categories accordingly.
Renaming categories with 'rename_categories'
To rename categories, we use the 'rename_categories' method. This method is especially useful when we need to collapse multiple categories into a single one.
# Renaming categories
dogs['coat'] = dogs['coat'].cat.rename_categories({'short': 'short_coat', 'medium': 'medium_coat', 'long': 'long_coat'})
print(dogs['coat'].cat.categories)
Output:
Index(['short_coat', 'medium_coat', 'long_coat'], dtype='object')
You can see the categories are renamed as specified.
Common issues while renaming categories
Remember to pass all current category names in the renaming dictionary. If a current category name is not included, it will be removed, and the data of that category will be treated as missing.
Creating a new categorical column from existing categories
You can create new categories by mapping existing ones to new values.
# Create new category column
dogs['coat_length'] = dogs['coat'].map({'short_coat': 'short', 'medium_coat': 'medium', 'long_coat': 'long'})
dogs['coat_length'] = dogs['coat_length'].astype('category')
print(dogs.dtypes)
Output:
coat category
age int64
weight int64
coat_length category
dtype: object
Here, we have created a new categorical column 'coat_length' from the 'coat' column.
Handling data type changes after updating categories
Keep an eye on the data type after updating categories. Pandas might change the data type of a categorical column to 'object' if the updated categories are not explicitly set as 'category'.
print(dogs['coat'].dtype)
Output:
object
In this case, you need to convert the 'object' data type back to 'category' as we did when creating the 'coat_length' column.
V. Reordering Categories in a Pandas Series
Sometimes, the order of your categories might have significant implications on your analysis. We are going to explore situations where reordering might be needed and how to go about it.
Example: Let's take our supermarket dataset. The 'Product Category' might include 'Bakery', 'Dairy', 'Meat', and 'Produce'. If we want to analyze the data in terms of product shelf life, we might need to reorder our categories as 'Bakery', 'Meat', 'Dairy', and 'Produce'.
Situations where reordering of categories might be needed
Reordering might be needed when the order of categories has inherent meaning. For instance, in our example, 'Bakery' items have the shortest shelf life, followed by 'Meat', 'Dairy', and 'Produce'.
How to reorder categories in pandas
You can reorder categories using the 'reorder_categories' method. Here's how you can do it:
# Reorder categories
dogs['coat'] = dogs['coat'].cat.reorder_categories(['medium_coat', 'short_coat', 'long_coat'])
print(dogs['coat'].cat.categories)
Output:
Index(['medium_coat', 'short_coat', 'long_coat'], dtype='object')
The categories have been reordered as per the sequence mentioned in the 'reorder_categories' method.
Understanding how grouping works with ordered and unordered categories
Grouping or sorting data can behave differently depending on whether your categories are ordered or not.
# Grouping ordered categories
grouped_dogs = dogs.groupby('coat').size()
print(grouped_dogs)
Output:
coat
medium_coat 75
short_coat 95
long_coat 50
dtype: int64
The output will be sorted in the order of categories rather than by size when 'coat' is an ordered category.
VI. Cleaning and Accessing Categorical Data
Cleaning and accessing data is an integral part of any data analysis process. In this section, we will discuss potential issues with categorical data and how to handle them.
Example: Let's say we are working on a survey dataset where one of the questions is about the respondent's 'Education Level' with categories like 'Primary', 'Secondary', 'Bachelors', 'Masters', and 'PhD'. There might be inconsistencies like different respondents referring to 'Bachelors' as 'Bachelor's', 'bachelors', 'Bach.', etc. Such inconsistencies need to be fixed for accurate analysis.
Highlighting potential issues in categorical data
Some potential issues in categorical data could be:
Inconsistent categorization: As mentioned in our example.
Spelling issues: Typos can create unnecessary categories.
Case sensitivity: 'Bachelors' and 'bachelors' would be treated as two different categories.
Identifying issues in categorical data using 'cat' method and 'value_counts' method
To identify such issues, you can use the 'value_counts' method, which gives a count of each category.
# Identifying issues
print(dogs['coat'].value_counts())
Output:
short_coat 95
medium_coat 75
long_coat 50
dtype: int64
You can inspect the output for any anomalies in your categories.
VI. Cleaning and Accessing Categorical Data (Continued)
Let's continue cleaning our categorical data and see how to handle various issues.
Fixing issues such as white spaces and inconsistent capitalization
We can utilize the 'str' accessor along with various string methods to fix issues like leading/trailing whitespaces and inconsistent capitalization.
# Removing white spaces and converting to lowercase
dogs['coat'] = dogs['coat'].str.strip().str.lower()
print(dogs['coat'].cat.categories)
Output:
Index(['long_coat', 'medium_coat', 'short_coat'], dtype='object')
Now, our categories are more consistent.
Handling typos in categories
Typos can be tricky to handle. One common method is to manually map the
misspelt categories to correct ones using the 'replace' method.
# Fixing typos
typo_fix = {'long_coat': 'long coat', 'medium_coat': 'medium coat', 'short_coat': 'short coat'}
dogs['coat'] = dogs['coat'].replace(typo_fix)
print(dogs['coat'].cat.categories)
Output:
Index(['long coat', 'medium coat', 'short coat'], dtype='object')
Checking and changing the data type of the columns
It's good practice to check the data types of your columns, especially after making changes to them.
# Checking data type
print(dogs['coat'].dtype)
Output:
category
Using 'str' accessor object for updating and filtering categories
Pandas provides a powerful 'str' accessor for string manipulation. This can be particularly useful for categorical data. For example, you can easily filter out categories that contain a specific word.
# Filtering coats that are short
short_coats = dogs[dogs['coat'].str.contains('short')]
print(short_coats.head())
Output:
name coat
1 Buddy short coat
4 Max short coat
5 Bella short coat
7 Lucy short coat
8 Daisy short coat
Accessing categorical data with 'loc' and 'iloc'
Just like any other pandas Series, you can access categorical data using 'loc' (label-based location) and 'iloc' (integer-based location).
# Accessing the first five 'coat' entries using 'iloc'
print(dogs['coat'].iloc[:5])
Output:
0 medium coat
1 short coat
2 long coat
3 medium coat
4 short coat
Name: coat, dtype: category
Categories (3, object): ['long coat', 'medium coat', 'short coat']
As you can see, categorical data can be manipulated in various ways to meet your specific analytical needs. Understanding how to clean, access, and manipulate categorical data is a key skill for any data scientist.
That brings us to the end of our tutorial. We hope you found it insightful and that it's helped you understand how to handle categorical variables in pandas better. Happy data wrangling!