top of page

Dive into Summary Statistics, Counting Techniques, and Grouped Summary Statistics


Python is an incredibly versatile language, favored by data scientists worldwide due to its simplicity and powerful libraries that enable high-level data analysis and manipulation. This tutorial is dedicated to showcasing some of the Python techniques used in the field of data science, particularly focusing on Summary Statistics, Counting Techniques, and Grouped Summary Statistics using pandas, a powerful data manipulation library. Let's dive in!


I. Working with Summary Statistics


1. Basics of Summary Statistics


Summary statistics are an essential part of data analysis. Just as a book summary gives you a brief understanding of a story, summary statistics provide a compressed view of your dataset, telling you the vital details without you needing to examine each data point.


2. Numerical Data Summary


While diving into a new dataset, you might want to know the average (mean) of a numeric column. Imagine you're an athlete who's recorded their running times for several months, and you want to know your average speed. In Python, this can be achieved with ease using pandas.


Let's start by importing pandas and creating a hypothetical DataFrame.


import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'Run Time': [23, 21, 22, 25, 24, 23, 21, 20, 24, 22]})

# Calculate mean
mean_run_time = df['Run Time'].mean()
print(f"Average Run Time: {mean_run_time}")

This will return:


Average Run Time: 22.5


Similar to mean, you can calculate other summary statistics like median, mode, minimum, maximum, variance, standard deviation, sums, and quantiles using median(), mode(), min(), max(), var(), std(), sum(), and quantile() methods respectively on the DataFrame column.


3. Summary Statistics for Dates


Summary statistics are not just limited to numerical data; they can also be applied to date columns. For instance, if you want to know the earliest and latest dates in your dataset, you can use the min() and max() methods.


Assuming we have a DataFrame containing birth dates:


# Create a sample DataFrame with dates
df = pd.DataFrame({
    'Birth Date': pd.to_datetime(['2000-02-10', '1995-04-15', '1998-01-06', '2001-07-23', '1997-12-05'])
})

# Calculate minimum (earliest) and maximum (latest) date
min_date = df['Birth Date'].min()
max_date = df['Birth Date'].max()

print(f"Earliest Birth Date: {min_date}")
print(f"Latest Birth Date: {max_date}")

This will output:


Earliest Birth Date: 1995-04-15 00:00:00
Latest Birth Date: 2001-07-23 00:00:00


4. The .agg() Method


The agg() method is a more advanced feature in pandas, allowing us to apply custom functions for summary statistics. Imagine you want to find the 30th percentile of running times. This isn't a standard function, but we can create our own and apply it using agg().


import numpy as np

# Define a function for 30th percentile
def pct30(column):
    return np.percentile(column, 30)

# Apply the function
thirtieth_percentile = df['Run Time'].agg(pct30)
print(f"30th Percentile of Run Time: {thirtieth_percentile}")

This will return the 30th percentile of the run times.


5. Applying Summaries on Multiple Columns


The agg() method also works on multiple columns. Imagine our DataFrame has an additional column 'Heart Rate', and we want the 30th percentile for both 'Run Time' and 'Heart Rate'. We can do so by selecting the columns and applying agg().


# Add a Heart Rate column
df['Heart Rate'] = [150, 152, 154, 158, 156, 150, 151, 152, 155, 153]

# Apply the function on multiple columns
multi_column_percentile = df[['Run Time', 'Heart Rate']].agg(pct30)
print(multi_column_percentile)

This will return:


Run Time     21.7
Heart Rate  151.8
dtype: float64


6. Calculating Multiple Summaries


You can use agg() to calculate multiple summary statistics simultaneously. Let's say you want to calculate the 30th and 40th percentiles. You can pass a list of functions into agg() to achieve this.


# Define a function for 40th percentile
def pct40(column):
    return np.percentile(column, 40)

# Apply multiple functions
multi_function_percentile = df['Run Time'].agg([pct30, pct40])
print(multi_function_percentile)


7. Understanding Cumulative Sum


Cumulative statistics are other types of summary statistics that can be handy in certain analyses. A cumulative sum, for instance, is the sum of a sequence of numbers where each new number is added to the sum of the previous numbers.


# Calculate cumulative sum of Run Time
cumulative_sum = df['Run Time'].cumsum()
print(cumulative_sum)


This returns a Series of numbers where each number is the sum of the 'Run Time' up to that point.


8. Other Cumulative Statistics


Similarly, pandas provides methods for other cumulative statistics like cummax(), cummin(), and cumprod() for cumulative maximum, minimum, and product respectively. You can experiment with these methods on your DataFrame.

This completes our first section on Summary Statistics. Let's move on to the next section: Counting Techniques. This tutorial is getting longer, so let's continue with that section in the next response.


II. Counting Techniques


Counting techniques are extremely useful when working with categorical data. Just like a headcount provides an overview of the people in a room, these techniques provide a summary of categorical data.


1. Counting Unique Values


In data analysis, you might often need to count the number of unique values in a column. Imagine you're a botanist observing different types of plants in a garden. The pandas method value_counts() can easily provide this information.


# Create a sample DataFrame with plant types
df = pd.DataFrame({'Plant Type': ['Rose', 'Lily', 'Rose', 'Daisy', 'Tulip', 'Rose', 'Lily', 'Daisy']})

# Count unique values
unique_plant_counts = df['Plant Type'].value_counts()
print(unique_plant_counts)

This will return a series with the counts of each unique plant type:


Rose     3
Daisy    2
Lily     2
Tulip    1
Name: Plant Type, dtype: int64

2. Counting Missing Values


Data often comes with missing values, and knowing the count of these missing values is crucial in data cleaning and preprocessing. We can use the isna() method in conjunction with sum() to achieve this.


# Add some missing values to our DataFrame
df = pd.DataFrame({
    'Plant Type': ['Rose', 'Lily', None, 'Daisy', 'Tulip', 'Rose', 'Lily', None],
    'Height': [20, 30, 35, None, 25, 22, None, 28]
})

# Count missing values
missing_plant_types = df['Plant Type'].isna().sum()
missing_heights = df['Height'].isna().sum()

print(f"Missing Plant Types: {missing_plant_types}")
print(f"Missing Heights: {missing_heights}")

This will print:


Missing Plant Types: 2
Missing Heights: 2


3. Grouped Counts


Often, we want to know the count of values but grouped by another column. Let's say you want to count the plant types but grouped by another column 'Garden'. We can do this using groupby() and value_counts().


# Add a Garden column
df['Garden'] = ['A', 'A', 'B', 'A', 'B', 'B', 'A', 'B']

# Grouped count
grouped_counts = df.groupby('Garden')['Plant Type'].value_counts()
print(grouped_counts)

This will give us counts of each plant type grouped by garden:


Garden  Plant Type
A       Lily          2
        Rose          1
        Daisy         1
B       Rose          2
        Tulip         1
        Daisy         1
Name: Plant Type, dtype: int64


4. Cross-Tabulation


A cross-tabulation (or crosstab) is a two- (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation can be viewed as a type of grouped count.


# Create a crosstab
crosstab = pd.crosstab(df['Garden'], df['Plant Type'])
print(crosstab)

This will return a DataFrame that shows the count of each plant type for each garden:


Plant Type  Daisy  Lily  Rose  Tulip
Garden
A              1     2     1      0
B              1     0     2      1
Sure, let's proceed to the next section of our tutorial.


III. Grouped Summary Statistics


Grouped summary statistics involves using a grouping variable to calculate statistics for each group. This can be incredibly valuable when you want to compare different subsets of your data.


1. Average per Group


In this example, let's compute the average height per plant type. We'll use the groupby() method followed by mean().


# Add Height column with some values
df['Height'] = [20, 30, 35, 25, 25, 22, 28, 30]

# Compute average height per plant type
average_height = df.groupby('Plant Type')['Height'].mean()
print(average_height)

This will output the average height for each type of plant:


Plant Type
Daisy    27.5
Lily     29.0
Rose     22.0
Tulip    25.0
Name: Height, dtype: float64


2. Multiple Summary Statistics


What if we want multiple summary statistics for each group? Pandas makes this simple with the agg() function. Let's say we want the minimum, maximum, and average height for each type of plant.


# Compute multiple summary statistics
summary_statistics = df.groupby('Plant Type')['Height'].agg(['min', 'max', 'mean'])
print(summary_statistics)

This will return a DataFrame with the minimum, maximum, and average height for each type of plant:


           min  max  mean
Plant Type
Daisy       25   30  27.5
Lily        28   30  29.0
Rose        20   22  21.0
Tulip       25   25  25.0


3. Summary Statistics for Multiple Columns


Similarly, you can calculate summary statistics for multiple columns at once by passing a list of column names to the agg() function. Let's calculate the minimum and maximum height and width for each plant type.


# Add Width column with some values
df['Width'] = [5, 6, 8, 5, 6, 7, 8, 6]

# Compute summary statistics for multiple columns
summary_statistics = df.groupby('Plant Type')[['Height', 'Width']].agg(['min', 'max'])
print(summary_statistics)

This will output:


           Height     Width
              min max   min max
Plant Type
Daisy         25  30     5   6
Lily          28  30     6   8
Rose          20  22     5   7
Tulip         25  25     6   6


IV. Comprehensive Aggregation


In this part, we'll learn how to leverage the power of Pandas' groupby() method with multiple columns for both the grouping and aggregating parts. The goal here is to calculate different summary statistics for different columns grouped by multiple categories.


# Add 'Season' column with some values
df['Season'] = ['Spring', 'Spring', 'Summer', 'Summer', 'Autumn', 'Autumn', 'Winter', 'Winter']

# Compute summary statistics for multiple columns and groups
comprehensive_summary = df.groupby(['Plant Type', 'Season'])[['Height', 'Width']].agg(['min', 'max', 'mean'])
print(comprehensive_summary)

This will output a table with the minimum, maximum, and average height and width for each type of plant in each season. The output may look something like this:


                     Height             Width
                        min max  mean   min max mean
Plant Type Season
Daisy      Spring       25  25  25.0     5   5  5.0
           Summer       30  30  30.0     6   6  6.0
Lily       Autumn       30  30  30.0     8   8  8.0
Rose       Summer       22  22  22.0     7   7  7.0
           Winter       20  20  20.0     5   5  5.0
Tulip      Autumn       25  25  25.0     6   6  6.0
           Winter       25  25  25.0     6   6  6.0


This table shows the min, max, and average height and width for each type of plant during each season.


As you can see, the groupby() method in conjunction with agg() offers powerful flexibility when calculating summary statistics over different groups and categories in your dataset.


Conclusion


Understanding and summarizing your dataset is a crucial step in data analysis. Using Python's Pandas library, we can easily calculate summary statistics for entire columns, subsets of data, or even different groups. With the help of groupby(), agg(), and various statistical functions like mean(), min(), max(), and others, you can quickly derive insights from your data, no matter how complex or large it may be. Keep practicing and exploring these functions, as mastery over these will significantly boost your data wrangling skills!

bottom of page