top of page

Mastering Pandas for Data Manipulation and Visualization



Welcome! This tutorial provides an in-depth dive into pandas, a powerful Python package used primarily for data manipulation and visualization. Here, we'll learn how to handle DataFrames, pandas' core object, and perform various operations including sorting, subsetting, and creating new columns. Let's get started!


1. Introduction to Pandas


Pandas is built on top of two essential Python packages: NumPy and Matplotlib. NumPy facilitates efficient numerical operations on multi-dimensional arrays, while Matplotlib offers a suite of robust data visualization tools. Together, they form the backbone of pandas, providing a powerful platform for data analysis.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


2. Understanding DataFrames


At the heart of pandas is the DataFrame object—a rectangular grid of data, where each row represents an observation and each column a variable. It's akin to an Excel spreadsheet or a SQL table.


For example, if we have a dataset of dogs, each dog is a row, and each attribute of the dog (such as breed, weight, or height) is a column.


# Creating a DataFrame
data = {'Breed': ['Labrador', 'Beagle', 'Chihuahua', 'Husky'],
        'Weight': [30, 15, 5, 20],
        'Height': [60, 40, 25, 55]}
df = pd.DataFrame(data)
print(df)


Output:


BreedWeightHeight0Labrador30601Beagle15402Chihuahua5253Husky2055


2.1 Exploring DataFrames


When we first get a DataFrame, we'd want to understand its structure and contents.


2.1.1 The .head() method


The head method displays the first few rows of the DataFrame—great for a quick peek.


print(df.head(2))


Output:


BreedWeightHeight0Labrador30601Beagle1540


2.1.2 The .info() method


info offers a brief overview of the DataFrame—displaying column names, their data types, and the number of non-null entries.


print(df.info())

Output:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Breed   4 non-null      object
 1   Weight  4 non-null      int64
 2   Height  4 non-null      int64
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes
None


2.1.3 The .shape attribute


The shape attribute holds a tuple that specifies the DataFrame's dimensions—number of rows followed by columns.


print(df.shape)

Output:


(4, 3)


2.1.4 The .describe() method


The describe method computes summary statistics for numerical columns, providing a quick numerical overview of the data.


print(df.describe())


Output:


WeightHeightcount4.0000004.00000mean17.50000045.00000std10.60660215.55635min5.00000025.0000025%12.50000037.5000050%17.50000047.5000075%22.50000055.00000max30.00000060.00000


2.2 Components of a DataFrame


A DataFrame consists of three key parts: data values, column labels, and row labels. These are accessible using the .values, .columns, and .index attributes respectively.


print('Values:\\\\n', df.values)
print('Columns:\\\\n', df.columns)
print('Index:\\\\n', df.index)

Output:


Values:
[['Labrador' 30 60]
 ['Beagle' 15 40]
 ['Chihuahua' 5 25]
 ['Husky' 20 55]]
Columns:
Index(['Breed', 'Weight', 'Height'], dtype='object')
Index:
RangeIndex(start=0, stop=4, step=1)


3. Selecting and Manipulating Data in DataFrames


Once we understand the DataFrame's basic structure, we can perform various operations to extract or manipulate data.


3.1 Selecting Data


Pandas offers several ways to select data: .loc, .iloc, and [].


3.1.1 The .loc method


.loc stands for 'location'. It's used for label-based indexing, meaning we use the labels of the rows or columns to select data.


# Selecting a single row
print(df.loc[2])

# Selecting multiple rows
print(df.loc[[1, 3]])

Output:


Breed     Chihuahua
Weight            5
Height           25
Name: 2, dtype: object

    Breed  Weight  Height
1  Beagle      15      40
3   Husky      20      55


3.1.2 The .iloc method


.iloc is short for 'integer location'. Unlike .loc, it uses integer-based indexing.


# Selecting a single row
print(df.iloc[2])

# Selecting multiple rows
print(df.iloc[[1, 3]])

Output:


Breed     Chihuahua
Weight            5
Height           25
Name: 2, dtype: object

    Breed  Weight  Height
1  Beagle      15      40
3   Husky      20      55


Notice that the output is the same as .loc in this case. However, they behave differently if our DataFrame's index is not a sequence of integers.


3.1.3 The [] operator


While .loc and .iloc are used for row selection, we use the [] operator for column selection.


# Selecting a single column
print(df['Breed'])

# Selecting multiple columns
print(df[['Breed', 'Weight']])

Output:


0     Labrador
1       Beagle
2    Chihuahua
3        Husky
Name: Breed, dtype: object

       Breed  Weight
0   Labrador      30
1     Beagle      15
2  Chihuahua       5
3      Husky      20


3.2 Filtering Data


Often, we need to select rows that meet certain conditions. Let's say we want to find dogs that weigh more than 15 kg.


filtered_df = df[df['Weight'] > 15]
print(filtered_df)

Output:


      Breed  Weight  Height
0  Labrador      30      60
3     Husky      20      55


3.3 Creating New Columns


We can create new columns using existing ones. Let's create a 'BMI' column, computed as Weight/Height.


df['BMI'] = df['Weight'] / df['Height']
print(df)

Output:


       Breed  Weight  Height       BMI
0   Labrador      30      60  0.500000
1     Beagle      15      40  0.375000
2  Chihuahua       5      25  0.200000
3      Husky      20      55  0.363636


4. Applying Functions to Columns


Often, we need to apply some function to a column or a row of a DataFrame. Pandas provides several methods for this.


4.1 The map() Function


map() applies a function to each element of a Series (a DataFrame column, for instance).


Let's create a function that categorizes dogs into small, medium, and large based on their weight.


def size_category(weight):
    if weight < 10:
        return 'Small'
    elif weight < 20:
        return 'Medium'
    else:
        return 'Large'


Now, let's apply this function to the 'Weight' column and store the result in a new column, 'Size'.


df['Size'] = df['Weight'].map(size_category)
print(df)

Output:


       Breed  Weight  Height       BMI     Size
0   Labrador      30      60  0.500000    Large
1     Beagle      15      40  0.375000   Medium
2  Chihuahua       5      25  0.200000    Small
3      Husky      20      55  0.363636   Medium


4.2 The apply() Function


While map() is used for element-wise operations on a Series, apply() is used for applying a function along an axis of the DataFrame (columns or rows).


Let's create a function that calculates the range of a Series and apply it to each column.


def range_of_series(column):
    return column.max() - column.min()

# Apply the function to each column (axis=0)
df.apply(range_of_series, axis=0)

Output:


Weight    25.000000
Height    35.000000
BMI        0.300000


This tells us the range of weights, heights, and BMIs in our DataFrame.


4.3 The applymap() Function


Lastly, applymap() is used to apply a function to each element of the DataFrame.

Let's multiply each numeric element in the DataFrame by 2.


df[['Weight', 'Height', 'BMI']] = df[['Weight', 'Height', 'BMI']].applymap(lambda x: x * 2)
print(df)

Output:


       Breed  Weight  Height       BMI     Size
0   Labrador      60     120  1.000000    Large
1     Beagle      30      80  0.750000   Medium
2  Chihuahua      10      50  0.400000    Small
3      Husky      40     110  0.727273   Medium


This concludes the section on applying functions to columns. In the next section, we will delve into aggregating data in Pandas.

bottom of page