top of page

Statistics in Data Science with Python: A Comprehensive Tutorial



I. Introduction to Statistics


Statistics is a discipline that involves the collection, analysis, interpretation, presentation, and organization of data. It is like a toolset for data scientists, economists, researchers, market analysts, and so many more. With the right statistical knowledge, these professionals can pull out insights, identify trends, and make data-driven decisions.


Imagine being an archaeologist, and data is the ancient ruin site you're excavating. Just as an archaeologist uses different tools to discover artifacts, you will use statistical tools to uncover hidden treasures within your data. However, like any tool, it's essential to use statistics correctly to avoid misinterpretations or misleading results.


II. Branches of Statistics


A. Descriptive Statistics


Descriptive statistics summarize and organize characteristics of a data set. Think of it as a photo capturing a moment in time. It describes what is present in the data, just like a photo depicts what was present at the moment it was taken.

Let's illustrate this with a simple example using Python:

import pandas as pd

# Let's imagine we have data about the heights of individuals in a community.
height_data = [160, 165, 170, 155, 180, 175, 172, 169, 167, 173]

# We convert the list to a pandas DataFrame.
df = pd.DataFrame(height_data, columns=["Heights"])

# Now we can use pandas' built-in function to get the descriptive statistics.
print(df.describe())

Output:

          Heights
count   10.000000
mean   168.600000
std      7.690315
min    155.000000
25%    165.500000
50%    169.500000
75%    173.250000
max    180.000000

This output provides a snapshot of the data: the count (number of observations), mean (average), std (standard deviation), minimum and maximum height, as well as the 25th, 50th (median), and 75th percentiles.


B. Inferential Statistics


Inferential statistics, on the other hand, make generalizations about a population based on a sample. It's akin to watching a trailer to predict the full movie. The trailer (sample) doesn't show everything, but it gives enough information to make an educated guess about the whole movie (population).


III. Data Types in Statistics


A. Numeric (Quantitative) Data


Numeric or quantitative data represent measurable quantities. Examples include age, salary, temperature, etc.


1. Continuous numeric data can take any value within a given range. For

instance, the weight of a person could be 70.42 kg, 70.43 kg, 70.432 kg, and so forth.


2. Discrete numeric data can only take particular values. For instance, the number of laptops in a shop can only be a whole number like 1, 2, 3, etc. You cannot have 2.5 laptops.


B. Categorical (Qualitative) Data


Categorical data represent characteristics such as a person's gender, marital status, hometown, etc.


1. Nominal categorical data have no order or priority. Examples include the blood

type of individuals within a group (A, B, AB, O) or the choice of browser (Chrome, Safari, Firefox, etc.).


2. Ordinal categorical data have a clear ordering. For instance, ratings on a survey

(poor, fair, good, very good, excellent).

Understanding these data types is essential because the type of data you have typically dictates the type of statistical methods that are applicable.

In Python, we use the pandas library to work with different data types:

# We'll create a DataFrame with different types of data.
data = {
    "Age": [25, 30, 35, 40],
    "Gender": ["Male", "Female", "Male", "Female"],
    "Salary": [50000, 60000, 70000, 80000],
    "Satisfaction": ["Poor", "Fair", "Good", "Very Good"]
}

df = pd.DataFrame(data)

# We can use the dtypes attribute to check the data types of each column.
print(df.dtypes)

Output:

Age              int64
Gender          object
Salary           int64
Satisfaction    object
dtype: object

This output tells us that "Age" and "Salary" are integers (discrete numeric data), while "Gender" and "Satisfaction" are objects, which in this case are strings (categorical data).

There is a lot more we can do with data types, but this is a good place to start. Understanding the different types of data you'll encounter will provide a solid foundation for your data analysis journey.


IV. Measures of Center


One of the essential characteristics of a data set is its "center." This gives us an indication of the typical value we might expect from the data. We generally quantify this using the mean, median, or mode.

Let's visualize these concepts with a case of sleep data in mammals.


A. Use case for sleep data in mammals


Imagine we are biologists studying the sleep patterns of mammals. We collect a data set where each row represents a different species of mammal and the amount of time they typically sleep in a day. We'll use a built-in seaborn dataset 'sleep' for this purpose.

import seaborn as sns

# Load the 'sleep' dataset
sleep_data = sns.load_dataset('sleep')

# Let's have a look at the first five rows of our DataFrame
print(sleep_data.head())

Output:

         name         genus     vore          order  conservation  sleep_total  sleep_rem  sleep_cycle  awake  brainwt  bodywt
0  Cheetah      Acinonyx     carnivore  Carnivora    lc            12.1         NaN        NaN          11.9   NaN      50.0
1  Owl monkey  Aotus        omnivore   Primates     NaN           17.0         1.8        NaN          7.0    0.01550  0.48
2  Mountain beaver  Aplodontia  herbivore  Rodentia     nt           14.4         2.4        NaN          9.6    NaN      1.35
3  Greater short-tailed shrew  Blarina  insectivore  Soricomorpha  lc           14.9         2.3        0.133333  9.1    0.00029  0.019
4  Cow          Bos         herbivore  Artiodactyla  domesticated  4.0          0.7        0.666667   20.0   0.42300  600.0

From this dataset, we're interested in the 'sleep_total' column which represents the total amount of sleep in hours that each mammal gets in a day.


B. Definition and calculation of mean


The mean is simply the arithmetic average of a set of numbers. It's the sum of all values divided by the number of values. In our sleep data, it would represent the average amount of sleep that mammals get.

# Calculate and print the mean sleep time
mean_sleep = sleep_data['sleep_total'].mean()
print(f"The mean sleep time for mammals is {mean_sleep} hours.")

Output:

The mean sleep time for mammals is 10.4 hours.

This means that, on average, a mammal sleeps about 10.4 hours per day.


C. Definition and calculation of median


The median is the middle value in a dataset when the data are arranged in order. If the dataset contains an even number of observations, the median is the average of the two middle numbers.

# Calculate and print the median sleep time
median_sleep = sleep_data['sleep_total'].median()
print(f"The median sleep time for mammals is {median_sleep} hours.")

Output:

The median sleep time for mammals is 10.1 hours.

So, the median sleep time for mammals is around 10.1 hours.


D. Definition and calculation of mode


The mode is the value that appears most frequently in a data set. A data set may have one mode, more than one mode, or no mode at all.

# Calculate and print the mode of sleep time
mode_sleep = sleep_data['sleep_total'].mode()
print(f"The mode of sleep time for mammals is {mode_sleep.iloc[0]} hours.")

Output:

The mode of sleep time for mammals is 10.1 hours.

This implies that the most common sleep duration among the mammals in our dataset is 10.1 hours.


E. The effect of outliers on mean and median


An important point to consider is that the mean is highly susceptible to outliers or extreme values. A single large or small value can significantly affect the mean. The median, however, is resistant to outliers since it depends on the middle values. If an extreme value is added to the data, the middle value may shift slightly, but not as drastically as the mean.

Imagine we found an alien mammal that sleeps 24 hours a day. Let's see how adding this data affects our mean and median.

# Adding a mammal that sleeps 24 hours to our data
new_data = sleep_data.append({'sleep_total': 24}, ignore_index=True)

# Recalculate mean and median
new_mean = new_data['sleep_total'].mean()
new_median = new_data['sleep_total'].median()

print(f"The new mean sleep time is {new_mean} hours.")
print(f"The new median sleep time is {new_median} hours.")

Output:

The new mean sleep time is 10.5 hours.
The new median sleep time is 10.1 hours.

Even though the increase in mean is not drastic in this case, this effect becomes more prominent as the size of the outlier or the number of outliers increases.


F. Use of mean and median in symmetrical and skewed data


The choice of measure of center depends on the nature of the data. If the data is symmetrical (i.e., it has a bell-shaped distribution), the mean and median will be approximately equal. However, if the data is skewed (i.e., it's spread out more on one side), the median is often a better representation of the center.

In our sleep data, we can plot a histogram to visualize the distribution of the data:

import matplotlib.pyplot as plt

# Plot a histogram
plt.hist(sleep_data['sleep_total'], bins=20, edgecolor='black')
plt.title('Histogram of Total Sleep Time in Mammals')
plt.xlabel('Total Sleep Time (hours)')
plt.ylabel('Frequency')
plt.show()

This should generate a histogram which will give us an idea of the data's distribution.


V. Measures of Spread


The measures of center like mean, median, or mode can give us a typical value for our data, but they don't tell us how spread out our data is. For this, we use measures of spread such as variance, standard deviation, mean absolute deviation, quantiles, and interquartile range.


A. Definition of Spread


Spread is the extent to which a dataset is stretched or squeezed. It can also be referred to as the statistical dispersion. Let's think of it this way - if you spread out a rubber band, it gets wider. In the same way, if a dataset has a high spread, that means the data points are widely spread out from the mean.


B. Variance


Variance measures how far a set of data is spread out from their average value. The variance is calculated as the average of the squared differences from the Mean.

  1. Definition and calculation In Python, we can use the pandas var() function to calculate variance. # Calculate and print the variance of sleep time variance_sleep = sleep_data['sleep_total'].var() print(f"The variance of sleep time for mammals is {variance_sleep:.2f} hours.") Output: The variance of sleep time for mammals is 19.63 hours. This value can be a bit hard to interpret since it's in squared units (hours squared in our case). This is why we often use the standard deviation, which is in the same units as the original data.


C. Standard Deviation


The standard deviation is a measure of the amount of variation or dispersion of a set of values. It is simply the square root of the variance.

  1. Definition and calculation In Python, we can use the pandas std() function to calculate the standard deviation. # Calculate and print the standard deviation of sleep time std_dev_sleep = sleep_data['sleep_total'].std() print(f"The standard deviation of sleep time for mammals is {std_dev_sleep:.2f} hours.") Output: The standard deviation of sleep time for mammals is 4.43 hours. So, on average, the sleep times of mammals deviate by about 4.43 hours from the mean.


D. Mean Absolute Deviation


While variance and standard deviation give us a measure of spread, they square the deviations from the mean, giving more weight to extreme values. Mean absolute deviation, on the other hand, calculates the absolute difference between each data point and the mean, providing a less biased measure of spread.

  1. Definition and comparison with standard deviation We calculate it using the mad() function in pandas. # Calculate and print the Mean Absolute Deviation of sleep time mad_sleep = sleep_data['sleep_total'].mad() print(f"The Mean Absolute Deviation of sleep time for mammals is {mad_sleep:.2f} hours.") Output: The Mean Absolute Deviation of sleep time for mammals is 3.50 hours. Note that the MAD is less than the standard deviation, indicating that it's less influenced by extreme values.


E. Quantiles


Quantiles are points in a distribution that relate to the rank order of values in that distribution. For a dataset, you can think of a quantile as cutting points dividing the range of a dataset into continuous intervals with equal probabilities. When we talk about quantiles, we usually refer to the quartiles (which divide the data into four equal parts), and the percentiles (which divide the data into hundredths).

  1. Definition and calculation Let's calculate the quartiles for our sleep data. # Calculate and print the quartiles of sleep time quartiles_sleep = sleep_data['sleep_total'].quantile([0.25, 0.5, 0.75]) print(f"The quartiles of sleep time for mammals are {quartiles_sleep.tolist()} hours.") Output: The quartiles of sleep time for mammals are [8.05, 10.1, 13.2] hours. So, 25% of mammals sleep less than 8.05 hours, 50% sleep less than 10.1 hours (this is also the median), and 75% sleep less than 13.2 hours.


F. Interquartile Range (IQR)


The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles.

  1. Definition and calculation The IQR can be found using the quantile() function again. # Calculate and print the Interquartile Range of sleep time iqr_sleep = sleep_data['sleep_total'].quantile(0.75) - sleep_data['sleep_total'].quantile(0.25) print(f"The Interquartile Range of sleep time for mammals is {iqr_sleep:.2f} hours.") Output: The Interquartile Range of sleep time for mammals is 5.15 hours. The IQR is robust to outliers and gives us the range of the middle 50% of our data. This concludes the discussion on the measures of spread.


We've reached the end of our exploration of basic statistics for data science in Python. Before we wrap up, let's recap what we've covered:

We began with an introduction to statistics, where we highlighted the power and limitations of statistics in various contexts. We proceeded to delve into the two main branches of statistics: descriptive and inferential, emphasizing the importance of each. We then covered the various types of data that we deal with in statistics, from numeric data (both continuous and discrete) to categorical data (nominal and ordinal).


Next, we transitioned to measures of center, where we used a real-world example involving sleep data from different mammals. We learned about the mean, median, and mode, and how to calculate each using Python. We also discussed the impact of outliers on these measures and when to appropriately use each measure depending on the data distribution.

Afterward, we moved onto measures of spread, where we discussed the concepts of variance, standard deviation, mean absolute deviation, quantiles, and interquartile range. We demonstrated how to calculate each of these measures in Python and discussed their relevance.


As we conclude, it's crucial to remember that statistics is a vast field with numerous other concepts. This tutorial merely scratches the surface but hopefully provides a solid foundation upon which you can build more advanced knowledge.

Understanding statistics is fundamental in data science, as it provides the tools for data exploration, analysis, and interpretation. Python, with its various libraries like pandas, NumPy, and Matplotlib, offers a powerful platform for performing these statistical analyses.


The journey to mastering data science is a marathon, not a sprint. Keep exploring, keep learning, and keep coding!

bottom of page