top of page

25% Discount For All Pricing Plans "welcome"

Advanced Data Cleaning Tutorial for Data Science


In this tutorial, we will take a deep dive into some of the complex yet essential aspects of data cleaning. We will explore topics such as uniformity, cross-field validation, and handling missing data. Let's get started!


Unit 1: Advanced Data Cleaning Challenges


1.1 Uniformity in data


Uniformity in data is crucial to avoid skewed or incorrect analytical results. Think of it like running a 100-meter race where some athletes are asked to run in yards and others in meters - it just wouldn't be a fair race. Similarly, data needs to be in a uniform format to allow accurate analysis.


# Example dataframe
import pandas as pd

data = {'Temperature': [23, 70, 21, 75, 24, 72]}
df = pd.DataFrame(data)
df

The output of the above code:


   Temperature
0           23
1           70
2           21
3           75
4           24
5           72

Here we have temperature values, but we don't know whether they are in Celsius or Fahrenheit.


1.2 Temperature data inconsistency example


To tackle this issue, let's assume the temperatures above 50 are in Fahrenheit, and those below are in Celsius. We will convert Fahrenheit to Celsius.


df['Temperature'] = df['Temperature'].apply(lambda x: (x-32)*5/9 if x > 50 else x)
df

The output:


   Temperature
0           23
1           21.1
2           21
3           23.9
4           24
5           22.2

Now, all temperatures are in Celsius, providing a uniform unit of measure.


1.3 Date data inconsistency example


Date data can often be formatted differently, causing inconsistencies. Imagine you're planning a global virtual meeting but the attendees are noting down the date as per their regional format. Some might write 'mm/dd/yyyy' while others 'dd/mm/yyyy'. Confusion is inevitable. Hence, data needs to be made consistent.


# Example of inconsistent date data
data = {'Date': ['01/02/2023', '02/03/2023', '03/04/2023']}
df = pd.DataFrame(data)

# Converting to uniform datetime format
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df

Output:


        Date
0 2023-02-01
1 2023-03-02
2 2023-04-03

Now all the dates are in a uniform format, preventing any potential confusion.


Unit 2: Cross Field Validation


2.1 Introduction to cross field validation


Cross-field validation is like a detective verifying an alibi – cross-checking the statements of witnesses against known facts to find inconsistencies. In data science, it involves verifying the relationship between multiple data fields.


2.2 Flight data inconsistency example


Let's say you have a dataset of flights with the columns: 'Flight Distance', 'Flight Time', and 'Average Speed'. We know that distance = speed * time, so we can use this relation to find any inconsistencies.


data = {'Distance': [500, 1200, 900],
        'Time': [1, 2.5, 1.8],
        'Speed': [450, 520, 490]}
df = pd.DataFrame(data)

# Add a check column that calculates speed from distance and time
df['check_Speed'] = df['Distance'] / df['Time']

# Check if the 'Speed' column is almost equal to 'check_Speed'
df['is_valid'] = np.isclose(df['Speed'], df['check_Speed'], rtol=1e-02)
df

Output:

 
     Distance  Time  Speed  check_Speed  is_valid
0       500   1.0    450   500.000000     False
1      1200   2.5    520   480.000000     False
2       900   1.8    490   500.000000     False

Here, we can see that none of our speed values are consistent with distance and time, pointing towards some inconsistencies.


2.3 User data inconsistency example


Consider another case where we have 'Birth year' and 'Age'. We can verify this data by comparing the age with the current year.


data = {'Birth year': [1980, 1995, 2000],
        'Age': [40, 25, 20]}
df = pd.DataFrame(data)

# Assuming the current year is 2023
df['check_Age'] = 2023 - df['Birth year']

# Check if the 'Age' column is equal to 'check_Age'
df['is_valid'] = df['Age'] == df['check_Age']
df

Output:

 
     Birth year  Age  check_Age  is_valid
0        1980   40        43     False
1        1995   25        28     False
2        2000   20        23     False


From this, we can see that there are discrepancies between the reported age and the age calculated from the birth year, suggesting a data inconsistency issue.


Unit 3: Dealing with Missing Data


3.1 Understanding missing data


Missing data is like missing pieces in a puzzle. When you have a puzzle with missing pieces, the picture is incomplete and can lead to misinterpretations. Similarly, missing data can distort the interpretation and analysis of datasets.


# Example dataframe with missing data
data = {'CO2': [0.03, None, 0.04, 0.03, None, 0.02],
        'Temperature': [22, 23, None, 24, 25, None]}
df = pd.DataFrame(data)
df

Output:


    CO2  Temperature
0  0.03         22.0
1   NaN         23.0
2  0.04          NaN
3  0.03         24.0
4   NaN         25.0
5  0.02          NaN

In this dataset, we can see that there are missing values in both the 'CO2' and 'Temperature' columns.


3.2 CO2 and temperature data example


To visualize this missing data, we can use matplotlib and missingno libraries. Let's create a matrix plot to visualize the missingness.


import matplotlib.pyplot as plt
import missingno as msno

msno.matrix(df)
plt.show()

This would produce a matrix where the white lines represent missing values.


3.3 Types of missing data


Just as there are different reasons for why a student might miss school (illness, vacation, or truancy), there are different types of missing data:

  1. Missing Completely at Random (MCAR): This is like students missing school randomly due to varied personal reasons. The missingness has no relationship with any values, observed or missing.

  2. Missing at Random (MAR): This is like students from a particular class missing school because they went on a field trip. The missingness has a systematic relationship with other observed data, but not the missing data itself.

  3. Missing Not at Random (MNAR): This is like students who struggle academically being more likely to skip school. The missingness is related to the value of the missing data itself.


3.4 Strategies to handle missing data


There are many ways to deal with missing data, much like there are different ways to handle a student's absence (sending homework, rescheduling exams, etc.). The strategy we choose depends on the type of missingness and the specific dataset and task.


# 1. Dropping missing data
df_dropped = df.dropna()

# 2. Imputing with mean
df_mean_imputed = df.fillna(df.mean())

# 3. Forward fill
df_forward_fill = df.fillna(method='ffill')

Each method has its pros and cons. Dropping the data is the simplest method, but you lose data. Imputing with the mean is also easy and doesn't lose data, but it can introduce bias. Forward filling propagates the last valid observation forward, which can work well with time-series data but may not be valid for other types of data.


By understanding and addressing these advanced data cleaning challenges, you're now better equipped to prepare your data for high-quality, accurate analyses.


Conclusion


Data cleaning is an essential step in the data science process. It might seem tedious, but consider this: building a model on dirty data is like constructing a house on a faulty foundation - sooner or later, it's going to cause problems. Taking the time to ensure your data is uniform, valid, and complete can save you from skewed or inaccurate results down the line. Through understanding and applying these advanced techniques, you can ensure your data is ready for whatever analysis you wish to conduct, providing a strong foundation for your data science endeavors.

댓글


댓글 작성이 차단되었습니다.
bottom of page