Handling Missing Data in Data Science
Understanding the Impact of Missing Data
Data, akin to the pieces of a jigsaw puzzle, forms the basis of any analytical or predictive model in data science. Missing pieces, or in this case, missing data, can lead to distorted or incorrect pictures. When data is missing, it can affect the distribution and statistical properties of the dataset. For instance, suppose we were analyzing the average height of a group of people, and several entries were missing. The calculated average height would only reflect the people with recorded heights, potentially leading to a skewed result.
Just like an artist might misinterpret the subject of their portrait if certain features are obscured, a data scientist might draw incorrect conclusions if data is missing. Missing data might lead to biased or incorrect results, influencing critical decisions in a business context.
Demonstrating Missing Data Impacts with Data Professionals' Dataset
To understand this better, let's dive into a dataset of data professionals. The dataset includes several variables such as year data was obtained, job title, experience level, type of employment, location, company size, time spent working remotely, and salary in US dollars.
# Import the necessary libraries
import pandas as pd
# Load the dataset
df = pd.read_csv('data_professionals.csv')
# Display the first 5 rows of the dataframe
df.head()
Output
YearJob TitleExperience LevelType of EmploymentLocationCompany SizeRemote WorkSalary (USD)02023Data ScientistMid-levelFull-TimeSan Francisco100-500Occasionally12000012023Data AnalystEntry-levelPart-TimeNew York<50Never5000022023Machine Learning EngineerSeniorFull-TimeLos Angeles500-1000Always15000032023Data EngineerMid-levelFull-TimeChicago1000-5000Never10000042023Data AnalystEntry-levelFull-TimeSan Francisco50-100Occasionally70000
Impact of Missing Values on Salary Analysis
Let's now focus on the "Salary (USD)" column, specifically the average salary by experience level. If the salary data is missing for many "Senior" level roles, the calculated average salary for this group might be lower than the actual average. This is akin to estimating the average age of a group of adults and children, but forgetting to include the adults; the result would be significantly skewed.
# Group data by Experience Level and calculate the mean of the Salary (USD)
df.groupby('Experience Level')['Salary (USD)'].mean()
Checking for Missing Values with Python's Pandas
We can check for missing values in our dataset using the Pandas library. It's like shining a flashlight into a dark room, looking for any missing furniture. In the code snippet below, isnull().sum() returns the number of missing values for each column.
# Check for missing values
df.isnull().sum()
Strategies to Handle Missing Data
Dealing with missing data is much like dealing with a hole in a boat; you can't ignore it. We have several strategies to handle this situation:
Removing observations: Like removing a rotten apple from a basket to prevent it from affecting the other apples.
Imputation using summary statistics: If a student misses an exam, we might fill in that score with the average score of the class.
Subgroup imputation: If the student who missed the exam was an A student, it might make more sense to fill in their score with the average of other A students, not the entire class. This is an example of subgroup imputation.
And remember, each strategy comes with its own assumptions and impacts on the data distribution.
Implementing Strategies: Dropping and Imputing Missing Values
Let's set a threshold of 20% for missing data. If a column has more than 20% missing values, we'll drop that column, just like how a baker might throw away a batch of dough that is more than 20% undercooked.
# Set the threshold for missing values
threshold = 0.2
# Drop columns that have more than 20% missing values
df = df.dropna(thresh=len(df)*(1-threshold), axis=1)
Now, we'll impute the remaining missing values with the median of each column, like filling in the holes of a Swiss cheese with more cheese.
# Fill missing values with median of each column
df = df.fillna(df.median())
Let's check the remaining missing values. The room should be well-lit now with our flashlight.
# Check for missing values
df.isnull().sum()
Imputing Missing Values by Sub-group
Now, let's try the third strategy, subgroup imputation. We'll group the data by 'Experience Level' and calculate the median salary for each group. It's like splitting students into different groups (freshman, sophomore, junior, senior) and using the average score of each group to fill in missing test scores.
# Group by 'Experience Level' and impute missing values in 'Salary (USD)' with the median salary of each group
df['Salary (USD)'] = df.groupby('Experience Level')['Salary (USD)'].transform(lambda x: x.fillna(x.median()))
Finally, let's do a last check for missing values. With all the strategies we've applied, the missing data should be well handled.
# Check for missing values
df.isnull().sum()
Working with Categorical Data
Overview of Working with Categorical Data
Data science isn't always about numbers. Often, we encounter non-numeric data, such as words, categories, or yes/no responses. This type of data is termed 'categorical data', and in the data professional dataset, 'Job Title' is one such example. If we imagine our data as a bookstore, then dealing with categorical data is like arranging books by genres.
Exploring Job Titles in the Dataset
Let's delve into the 'Job Title' column of our dataset. Imagine this column as a bag full of job title tags. We can start by checking the frequency of each tag or count the number of unique job titles.
# Count the frequency of each job title
job_freq = df['Job Title'].value_counts()
# Print the job frequency
print(job_freq)
# Count the unique job titles
unique_jobs = df['Job Title'].nunique()
# Print the number of unique job titles
print(f"Number of unique job titles: {unique_jobs}")
Extracting Information from Categorical Data
Sometimes we might be interested in finding specific categories. Let's use the pandas series-string-contains method to search for job titles that include 'Data Scientist'. It's similar to using a search bar in an online store to find specific items.
# Filter for job titles containing 'Data Scientist'
df_data_scientist = df[df['Job Title'].str.contains('Data Scientist')]
# Display the first 5 rows of the filtered dataframe
df_data_scientist.head()
Filtering Rows Containing One or More Phrases
Often, we might want to filter for rows containing more than one phrase. For instance, we might be interested in finding 'Data Scientist' and 'Machine Learning Engineer' roles. Here, the pipe symbol (|) acts like an 'or' operator. It's like searching for mystery or thriller books in an online bookstore.
# Filter for job titles containing 'Data Scientist' or 'Machine Learning Engineer'
df_selected_jobs = df[df['Job Title'].str.contains('Data Scientist|Machine Learning Engineer')]
# Display the first 5 rows of the filtered dataframe
df_selected_jobs.head()
Categorizing Job Titles
Now, let's create some categories of data roles and assign them to a new column in the DataFrame. It's like creating sub-genres for our books.
# Define function to categorize job titles
def categorize_job(title):
if 'data scientist' in title.lower():
return 'Data Scientist'
elif 'machine learning engineer' in title.lower():
return 'Machine Learning Engineer'
elif 'data analyst' in title.lower():
return 'Data Analyst'
elif 'data engineer' in title.lower():
return 'Data Engineer'
else:
return 'Other'
# Create a new column 'Job Category'
df['Job Category'] = df['Job Title'].apply(categorize_job)
# Display the first 5 rows of the dataframe
df.head()
Visualizing Categorical Data
Let's visualize our newly created job categories using Seaborn's countplot. A countplot can be thought of as a histogram across a categorical, instead of quantitative, variable. It's like creating a bar chart of book genres to see which ones are most popular.
# Import Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
# Create a countplot of 'Job Category'
sns.countplot(y = df['Job Category'])
plt.title('Count of Job Categories')
plt.show()
In this part, we learned how to handle, extract information from, categorize, and visualize categorical data. We will delve into numeric data in the next part.
Working with Numeric Data
Overview of Working with Numeric Data
As we've worked through the categorical data, it's time to steer our ship into the ocean of numeric data. In our dataset, salary and experience level are examples of numeric data. If we revisit our bookstore analogy, numeric data is akin to the page count or the publication year of a book.
Converting Strings to Numbers
The salary data in our dataset is provided in strings, which makes it tough to perform calculations. Let's clean this up and convert the salary data to a numeric format. Additionally, we will convert the salary data from rupees to US dollars (assuming 1 USD = 74.85 INR as of the time of this tutorial). It's like converting the price of a book from one currency to another so that it's easier to compare prices across different countries.
# Convert salary to numeric and to USD
df['Salary_USD'] = df['Salary'].str.replace(',', '').str.replace('INR', '').astype(float) / 74.85
# Display the first 5 rows of the dataframe
df.head()
Adding Summary Statistics into a DataFrame
Pandas' groupby function is powerful when we want to calculate summary statistics. Let's say we want to calculate the average salary based on experience. Adding the standard deviation as a new column in our dataframe is like adding a 'Reader's Rating' section in a bookstore's book description.
# Calculate average salary by 'Experience'
df_exp_salary = df.groupby('Experience')['Salary_USD'].mean()
# Create a new column 'Salary_Stdev' containing the standard deviation of salaries based on experience
df['Salary_Stdev'] = df.groupby('Experience')['Salary_USD'].transform('std')
# Display the dataframe
df_exp_salary.head()
df.head()
Exploring Data with Multiple Columns
Let's print the combinations of values for experience and job category using the value_counts method. It is like finding the most common combinations of book genre and author.
# Print combinations of 'Experience' and 'Job Category'
exp_job_combinations = df[['Experience', 'Job Category']].value_counts()
# Display the combinations
exp_job_combinations.head()
Also, we can add a column for the median salary based on company size. It's like estimating the average price of books published by different publishers.
# Calculate median salary by 'Company Size'
df['Median_Salary_By_Size'] = df.groupby('Company Size')['Salary_USD'].transform('median')
# Display the dataframe
df.head()
Through this part, we've turned the messy strings into numbers, created summary statistics, and explored data with multiple columns. Up next, we'll dive into handling outliers, which are equivalent to the unique, standout books in our bookstore.
Handling Outliers
Overview of Outliers
An outlier is a data point that is distant from other similar points. They could be due to variability in the data or errors. Outliers are like that one high-priced rare antique book in a store filled with moderately priced novels - they stand out from the crowd and can skew our analysis if not treated appropriately.
Identifying Outliers with Box Plots
A commonly used graphical method for outlier detection is the Box Plot. We will
create a box plot for the 'Salary_USD' column. This is like looking at all book prices and noticing that most are under $30, but there's one priced at $2000.
import seaborn as sns
# Create a box plot
sns.boxplot(x = df['Salary_USD'])
# Display the plot
plt.show()
In this plot, points that are displayed as dots are the outliers.
IQR Score
The Interquartile Range (IQR) is a measure of statistical dispersion, and it is often used to detect outliers. An outlier in our context would be a salary that is significantly higher or lower than the rest of the salaries.
Q1 = df['Salary_USD'].quantile(0.25)
Q3 = df['Salary_USD'].quantile(0.75)
IQR = Q3 - Q1
# Defining Limits
lower_limit = Q1-1.5*IQR
upper_limit = Q3+1.5*IQR
# Identify the outliers
outliers = df[(df['Salary_USD']<lower_limit)|(df['Salary_USD']>upper_limit)]
outliers
Handling Outliers
Once we've identified the outliers, we need to decide what to do with them. We can either delete these observations or we can cap them at the upper_limit or lower_limit. It's like deciding whether to remove that $2000 book from the store or change its price to match the other books in the store.
# Option 1: Removing Outliers
df_no_outliers = df[(df['Salary_USD'] > lower_limit) & (df['Salary_USD'] < upper_limit)]
# Option 2: Capping Outliers
df_capped = df.copy()
df_capped['Salary_USD'] = df_capped['Salary_USD'].apply(lambda x: upper_limit if x>upper_limit else lower_limit if x<lower_limit else x)
# Display the new dataframe
df_no_outliers.head()
df_capped.head()
Outliers can affect the outcome of an analysis significantly. However, it's not always the right choice to remove them - they could represent valuable information. Hence, handle them with care.
And there we have it - from understanding missing values and categorical data to handling numeric data and outliers, we have covered the main steps in data preprocessing. To be a good data scientist, it's like being a good bookstore owner who knows how to organize books, identify misprints, and treat special editions. Happy data analyzing!