Welcome! This tutorial provides an in-depth dive into pandas, a powerful Python package used primarily for data manipulation and visualization. Here, we'll learn how to handle DataFrames, pandas' core object, and perform various operations including sorting, subsetting, and creating new columns. Let's get started!
1. Introduction to Pandas
Pandas is built on top of two essential Python packages: NumPy and Matplotlib. NumPy facilitates efficient numerical operations on multi-dimensional arrays, while Matplotlib offers a suite of robust data visualization tools. Together, they form the backbone of pandas, providing a powerful platform for data analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. Understanding DataFrames
At the heart of pandas is the DataFrame object—a rectangular grid of data, where each row represents an observation and each column a variable. It's akin to an Excel spreadsheet or a SQL table.
For example, if we have a dataset of dogs, each dog is a row, and each attribute of the dog (such as breed, weight, or height) is a column.
# Creating a DataFrame
data = {'Breed': ['Labrador', 'Beagle', 'Chihuahua', 'Husky'],
'Weight': [30, 15, 5, 20],
'Height': [60, 40, 25, 55]}
df = pd.DataFrame(data)
print(df)
Output:
BreedWeightHeight0Labrador30601Beagle15402Chihuahua5253Husky2055
2.1 Exploring DataFrames
When we first get a DataFrame, we'd want to understand its structure and contents.
2.1.1 The .head() method
The head method displays the first few rows of the DataFrame—great for a quick peek.
print(df.head(2))
Output:
BreedWeightHeight0Labrador30601Beagle1540
2.1.2 The .info() method
info offers a brief overview of the DataFrame—displaying column names, their data types, and the number of non-null entries.
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Breed 4 non-null object
1 Weight 4 non-null int64
2 Height 4 non-null int64
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes
None
2.1.3 The .shape attribute
The shape attribute holds a tuple that specifies the DataFrame's dimensions—number of rows followed by columns.
print(df.shape)
Output:
(4, 3)
2.1.4 The .describe() method
The describe method computes summary statistics for numerical columns, providing a quick numerical overview of the data.
print(df.describe())
Output:
WeightHeightcount4.0000004.00000mean17.50000045.00000std10.60660215.55635min5.00000025.0000025%12.50000037.5000050%17.50000047.5000075%22.50000055.00000max30.00000060.00000
2.2 Components of a DataFrame
A DataFrame consists of three key parts: data values, column labels, and row labels. These are accessible using the .values, .columns, and .index attributes respectively.
print('Values:\\\\n', df.values)
print('Columns:\\\\n', df.columns)
print('Index:\\\\n', df.index)
Output:
Values:
[['Labrador' 30 60]
['Beagle' 15 40]
['Chihuahua' 5 25]
['Husky' 20 55]]
Columns:
Index(['Breed', 'Weight', 'Height'], dtype='object')
Index:
RangeIndex(start=0, stop=4, step=1)
3. Selecting and Manipulating Data in DataFrames
Once we understand the DataFrame's basic structure, we can perform various operations to extract or manipulate data.
3.1 Selecting Data
Pandas offers several ways to select data: .loc, .iloc, and [].
3.1.1 The .loc method
.loc stands for 'location'. It's used for label-based indexing, meaning we use the labels of the rows or columns to select data.
# Selecting a single row
print(df.loc[2])
# Selecting multiple rows
print(df.loc[[1, 3]])
Output:
Breed Chihuahua
Weight 5
Height 25
Name: 2, dtype: object
Breed Weight Height
1 Beagle 15 40
3 Husky 20 55
3.1.2 The .iloc method
.iloc is short for 'integer location'. Unlike .loc, it uses integer-based indexing.
# Selecting a single row
print(df.iloc[2])
# Selecting multiple rows
print(df.iloc[[1, 3]])
Output:
Breed Chihuahua
Weight 5
Height 25
Name: 2, dtype: object
Breed Weight Height
1 Beagle 15 40
3 Husky 20 55
Notice that the output is the same as .loc in this case. However, they behave differently if our DataFrame's index is not a sequence of integers.
3.1.3 The [] operator
While .loc and .iloc are used for row selection, we use the [] operator for column selection.
# Selecting a single column
print(df['Breed'])
# Selecting multiple columns
print(df[['Breed', 'Weight']])
Output:
0 Labrador
1 Beagle
2 Chihuahua
3 Husky
Name: Breed, dtype: object
Breed Weight
0 Labrador 30
1 Beagle 15
2 Chihuahua 5
3 Husky 20
3.2 Filtering Data
Often, we need to select rows that meet certain conditions. Let's say we want to find dogs that weigh more than 15 kg.
filtered_df = df[df['Weight'] > 15]
print(filtered_df)
Output:
Breed Weight Height
0 Labrador 30 60
3 Husky 20 55
3.3 Creating New Columns
We can create new columns using existing ones. Let's create a 'BMI' column, computed as Weight/Height.
df['BMI'] = df['Weight'] / df['Height']
print(df)
Output:
Breed Weight Height BMI
0 Labrador 30 60 0.500000
1 Beagle 15 40 0.375000
2 Chihuahua 5 25 0.200000
3 Husky 20 55 0.363636
4. Applying Functions to Columns
Often, we need to apply some function to a column or a row of a DataFrame. Pandas provides several methods for this.
4.1 The map() Function
map() applies a function to each element of a Series (a DataFrame column, for instance).
Let's create a function that categorizes dogs into small, medium, and large based on their weight.
def size_category(weight):
if weight < 10:
return 'Small'
elif weight < 20:
return 'Medium'
else:
return 'Large'
Now, let's apply this function to the 'Weight' column and store the result in a new column, 'Size'.
df['Size'] = df['Weight'].map(size_category)
print(df)
Output:
Breed Weight Height BMI Size
0 Labrador 30 60 0.500000 Large
1 Beagle 15 40 0.375000 Medium
2 Chihuahua 5 25 0.200000 Small
3 Husky 20 55 0.363636 Medium
4.2 The apply() Function
While map() is used for element-wise operations on a Series, apply() is used for applying a function along an axis of the DataFrame (columns or rows).
Let's create a function that calculates the range of a Series and apply it to each column.
def range_of_series(column):
return column.max() - column.min()
# Apply the function to each column (axis=0)
df.apply(range_of_series, axis=0)
Output:
Weight 25.000000
Height 35.000000
BMI 0.300000
This tells us the range of weights, heights, and BMIs in our DataFrame.
4.3 The applymap() Function
Lastly, applymap() is used to apply a function to each element of the DataFrame.
Let's multiply each numeric element in the DataFrame by 2.
df[['Weight', 'Height', 'BMI']] = df[['Weight', 'Height', 'BMI']].applymap(lambda x: x * 2)
print(df)
Output:
Breed Weight Height BMI Size
0 Labrador 60 120 1.000000 Large
1 Beagle 30 80 0.750000 Medium
2 Chihuahua 10 50 0.400000 Small
3 Husky 40 110 0.727273 Medium
This concludes the section on applying functions to columns. In the next section, we will delve into aggregating data in Pandas.