top of page

Understanding Probability, Distributions, and Implementing them in Python



I. Understanding the Concept of Chance


We interact with the concept of chance every day, sometimes without even realizing it. Consider deciding whether to carry an umbrella. You might glance at a weather forecast predicting a 30% chance of rain and make your decision based on that. In this instance, you're using the concept of probability. Probability is a mathematical framework that allows us to quantify the uncertainty inherent in such situations.


A great way to illustrate probability is by considering a simple coin flip. A coin has two possible outcomes: heads or tails. When you flip a fair coin, the probability of getting heads (or tails) is 1/2, or 0.5 when expressed as a decimal. This is calculated by dividing the number of favorable outcomes (1, in this case, as there's only one 'heads' side on a coin) by the total number of outcomes (2, because a coin can land on either heads or tails).


II. Applying Probability in Real-Life Scenarios


Let's consider a slightly more complex scenario. Suppose you're a sales manager, and you need to decide which of your five salespeople should meet with a big potential client. The choice is random because all your salespeople are equally talented. What's the chance that a particular salesperson, say Salesperson A, gets selected?


Here, the total number of outcomes is 5 (because there are 5 salespeople), and the number of favorable outcomes is 1 (because we're interested in selecting Salesperson A). Thus, the probability of selecting Salesperson A is 1/5, or 0.2 in decimal form.


III. Implementing Probability Selection in Python


Python's built-in library, random, has a sample() method that allows us to simulate such random selection scenarios. Here's how you can use it to pick a salesperson randomly:

import random

salespeople = ['A', 'B', 'C', 'D', 'E']
selected_salesperson = random.sample(salespeople, 1)

print("Selected Salesperson: ", selected_salesperson[0])

This script will print out the name of the randomly selected salesperson each time you run it. Try running it a few times and observe the results.

When working with randomness in computing, it's essential to understand the concept of a random seed. A random seed is a starting point in generating random numbers, and by setting a specific seed, you can reproduce the same sequence of random numbers. This is useful when testing code or debugging issues related to randomness.


You can set the seed using the random.seed() function, as shown:

random.seed(1)
selected_salesperson = random.sample(salespeople, 1)

print("Selected Salesperson: ", selected_salesperson[0])

Now, no matter how many times you run the script, it will always select the same salesperson.


IV. Diversifying Scenarios


While our sales scenario is an example of random selection, it is also an example of what is known as sampling without replacement. This is because once a salesperson is selected, they are not placed back into the pool of candidates.

Let's modify the scenario a bit. Suppose you want to select three salespeople for a series of meetings. After each meeting, the selected salesperson is placed back into the pool, and a new random selection occurs. This is known as sampling with replacement. The probability calculations here are slightly more complex as they involve dependent events. An event is dependent if the outcome of the first event affects the outcome of the second event. In this case, because a salesperson can be selected more than once, the selections are dependent events.


V. Implementing Diverse Sampling in Python


We can use the random.choices() function in Python to perform sampling with replacement. Let's adjust our code to select three salespeople with replacement.

random.seed(1)
selected_salespeople = random.choices(salespeople, k=3)

print("Selected Salespeople: ", selected_salespeople)

This script will output a list of three salespeople, selected randomly from the pool with replacement. You might even see the same salesperson selected more than once!


VI. Introduction to Probability Distributions


We've discussed individual events and their probabilities, but often, we're interested in the distribution of probabilities across all outcomes. This brings us to the concept of probability distributions.

A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can take. For simplicity, let's start with a discrete distribution, which means our random variable can only take specific (countable) values.


Consider a six-sided die. When you roll the die, the outcome can be any integer from 1 to 6. Each outcome has a probability of 1/6, or approximately 0.1667. The distribution of probabilities across all outcomes forms a discrete uniform distribution, as each outcome is equally likely.

We can represent this distribution as a list of probabilities:

Outcome: [1, 2, 3, 4, 5, 6] Probability: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

In addition to the probability distribution, another important concept is the expected value (or mean) of a distribution, which is the average outcome we expect if we repeat the experiment many times. For our die, the expected value is (1+2+3+4+5+6)/6 = 3.5.


VII. Visualizing Probability Distributions


A good way to understand distributions is through visualizations. For this purpose, we can use Python's matplotlib library. Let's plot the distribution of our die roll.

import matplotlib.pyplot as plt

outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]

plt.bar(outcomes, probabilities)
plt.xlabel('Outcomes')
plt.ylabel('Probability')
plt.title('Probability Distribution of a Die Roll')
plt.show()

The output will be a bar plot with outcomes on the x-axis and their respective probabilities on the y-axis. All bars have the same height, representing the equal likelihood of each outcome.


But what if we adjust our die, making some outcomes more likely than others? Suppose our die is loaded to favor the number 6. The new distribution might look like this:


Outcome: [1, 2, 3, 4, 5, 6] Probability: [1/8, 1/8, 1/8, 1/8, 1/8, 3/8]

The effect on the expected value is notable: it's now (1\*1/8 + 2\*1/8 + 3\*1/8 + 4\*1/8 + 5\*1/8 + 6\*3/8) = 4.25. The change in the distribution and expected value reflects the fact that 6 has become more likely.


VIII. Sampling from Discrete Distributions


In Python, we can use the random.choices() function to sample from a distribution. This function allows us to specify the probabilities for each outcome. Let's simulate 1000 rolls of our loaded die and visualize the results.

outcomes = [1, 2, 3, 4, 5, 6]
probabilities = [1/8, 1/8, 1/8, 1/8, 1/8, 3/8]
rolls = random.choices(outcomes, probabilities, k=1000)

plt.hist(rolls, bins=range(1, 8), align='left', rwidth=0.8)
plt.xlabel('Outcomes')
plt.ylabel('Frequency')
plt.title('Outcome Distribution of 1000 Rolls of a Loaded Die')
plt.show()

You should see that the number 6 appears more frequently than the others, reflecting the loaded probabilities.

This leads us to the law of large numbers, which states that as the size of a sample gets larger, the mean of the sample gets closer to the expected value of the population. In our case, if we increase the number of die rolls (the sample size), the mean of the outcomes will approach the expected value of 4.25.


IX. Continuous Distributions


So far, we've focused on discrete distributions, where the outcomes can only take specific values. However, in many real-world scenarios, we deal with continuous distributions, where outcomes can take any value within a certain range.

Consider waiting for a city bus. If the bus comes every 15 minutes, the waiting time can be any real number between 0 and 15, not just specific discrete times. In such a case, we have a continuous uniform distribution.


A continuous uniform distribution is defined by its lower limit (a) and its upper limit (b). In the case of the bus, a = 0 minutes and b = 15 minutes. All waiting times in this interval are equally likely.


X. Calculating Probabilities for Continuous Distributions


With continuous distributions, we calculate the probability that the outcome lies in a certain interval, rather than the probability of a single outcome. For example, we might ask: "What is the probability that I will wait more than 10 minutes for the bus?"


In Python, we can use the scipy.stats library to work with continuous uniform distributions.

from scipy.stats import uniform

# parameters for the distribution
a = 0  # lower limit
b = 15  # upper limit

# create the distribution
bus_wait_time = uniform(a, b)

# calculate the probability of waiting more than 10 minutes
prob_greater_than_10 = 1 - bus_wait_time.cdf(10)
print(prob_greater_than_10)

The cdf() function calculates the cumulative distribution function, which is the probability that a random variable is less than or equal to a certain value. To find the probability that the waiting time is greater than 10 minutes, we subtract the cdf(10) from 1.


Let's say you want to find the probability of waiting between 5 and 10 minutes. You would calculate this as the difference of two cdf() values:

prob_5_to_10 = bus_wait_time.cdf(10) - bus_wait_time.cdf(5)
print(prob_5_to_10)

In this part of the tutorial, we have introduced the concept of continuous distributions and learned how to calculate probabilities for specific intervals. These tools will be invaluable when working with real-world data that follows a continuous distribution.


XI. Visualizing Continuous Distributions


Now that we've learned how to calculate probabilities for continuous distributions, let's visualize these distributions. A probability density function (PDF) plot is useful for this. The area under the PDF curve between two points corresponds to the probability that the random variable falls within that range.


In Python, we can plot the PDF using the matplotlib library together with scipy.stats.

import matplotlib.pyplot as plt
import numpy as np

# define the x range for which we want to plot the PDF
x = np.linspace(a-1, b+1, 100)  # we add/subtract 1 to see the tails of the distribution

# plot the PDF
plt.figure(figsize=(8, 6))
plt.plot(x, bus_wait_time.pdf(x))
plt.title('Probability Density Function of Bus Waiting Time')
plt.xlabel('Waiting Time (minutes)')
plt.ylabel('Probability Density')
plt.grid(True)
plt.show()

This will generate a plot showing a flat line (indicating equal probability for all outcomes) between 0 and 15 minutes, and 0 probability outside this range.


XII. Summary Statistics for Continuous Distributions


As with discrete distributions, we can calculate summary statistics for continuous distributions. The mean() and var() methods return the mean and variance, respectively.

mean_wait_time = bus_wait_time.mean()
var_wait_time = bus_wait_time.var()
print("Mean waiting time:", mean_wait_time, "minutes")
print("Variance of waiting time:", var_wait_time, "minutes squared")

The mean waiting time should be 7.5 minutes (halfway between 0 and 15), and the variance can be calculated as (b-a)^2 / 12, which gives us 18.75 for our case.

In this part of our tutorial, we learned to visualize continuous probability distributions and calculate their summary statistics. In the next part, we'll learn about the normal distribution, a particularly important continuous distribution in statistics and data science.


XIII. Introduction to Normal Distribution


Next, we're going to learn about a special type of continuous distribution: the Normal Distribution. The normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution appears as a bell curve.

Let's imagine you're running a factory that produces light bulbs. Your light bulbs have an average lifespan of 1000 days, with a standard deviation of 100 days. We could model this with a normal distribution.

from scipy.stats import norm

# Create a normally distributed random variable
light_bulb_lifespan = norm(loc=1000, scale=100)

# Generate random numbers
samples = light_bulb_lifespan.rvs(10000)

# Verify the mean and standard deviation
print("Sample mean:", np.mean(samples))
print("Sample standard deviation:", np.std(samples))

The loc parameter defines the mean of the distribution, and the scale parameter defines the standard deviation.


XIV. Visualizing Normal Distribution


We can also visualize the normal distribution using the probability density function (PDF). Just like we did for the uniform distribution, we'll use matplotlib to plot our normal distribution.

# Define the x range for which we want to plot the PDF
x = np.linspace(700, 1300, 100)

# Plot the PDF
plt.figure(figsize=(8, 6))
plt.plot(x, light_bulb_lifespan.pdf(x))
plt.title('Probability Density Function of Light Bulb Lifespan')
plt.xlabel('Lifespan (days)')
plt.ylabel('Probability Density')
plt.grid(True)
plt.show()

The plot will show a bell-shaped curve, centered at 1000 (our mean), and most of the data is between 800 and 1200 (roughly one standard deviation away from the mean).


XV. Calculating Probabilities for Normal Distributions


The real power of the normal distribution comes from how we can calculate the probabilities. For example, we can ask: What is the probability that a randomly chosen light bulb will last at least 1100 days?

prob = 1 - light_bulb_lifespan.cdf(1100)
print("Probability that a light bulb lasts at least 1100 days:", prob)

We subtract the cumulative distribution function (CDF) from 1 because the CDF gives us the probability that a variable is less than or equal to a certain value.

In this part of the tutorial, we got introduced to the Normal Distribution, visualized it, and calculated probabilities for it. Stay tuned for the next part where we'll go deeper into advanced probability distributions and explore the Central Limit Theorem.


XVI. The Central Limit Theorem


The Central Limit Theorem (CLT) is one of the most powerful and useful ideas in all of statistics. The CLT states that for a random sample from any population with a finite standard deviation, the sum of the random variables (or equivalently, the sample mean) will have a distribution that approaches a normal distribution as the sample size becomes large.


To illustrate this, let's consider the rolling of a six-sided die. We know that this follows a discrete uniform distribution. But what would the distribution of the average of multiple dice rolls look like? Let's find out.

def dice_roll_simulation(num_rolls, num_simulations):
    averages = []

    for _ in range(num_simulations):
        rolls = np.random.choice(range(1, 7), size=num_rolls)
        averages.append(np.mean(rolls))

    return averages

# Run the simulation
num_rolls = 10
num_simulations = 10000
averages = dice_roll_simulation(num_rolls, num_simulations)

# Plot the distribution
plt.hist(averages, bins=11, density=True, edgecolor='k')
plt.title(f'Distribution of Average of {num_rolls} Dice Rolls Over {num_simulations} Simulations')
plt.xlabel('Average')
plt.ylabel('Density')
plt.grid(True)
plt.show()


This code will output a histogram that shows the distribution of the average of num_rolls dice rolls, based on num_simulations simulations. Even though we started with a uniform distribution, you will notice that the resulting distribution of averages is approximately normal. This is the essence of the Central Limit Theorem!


XVII. The Law of Large Numbers


A concept closely related to the CLT is the Law of Large Numbers. It states that as a sample size becomes larger, the sample mean will get closer to the population mean. To illustrate this, let's modify our dice simulation code to observe how the average changes as we increase the number of rolls.

def dice_roll_large_numbers(num_rolls, num_simulations):
    averages = []

    for i in range(1, num_rolls+1):
        rolls = np.random.choice(range(1, 7), size=i)
        averages.append(np.mean(rolls))

    return averages

# Run the simulation
num_rolls = 1000
num_simulations = 1
averages = dice_roll_large_numbers(num_rolls, num_simulations)

# Plot the distribution
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_rolls+1), averages)
plt.title('Convergence of Dice Roll Average to Expected Value')
plt.xlabel('Number of Rolls')
plt.ylabel('Average')
plt.grid(True)
plt.show()

This script will produce a line plot showing how the average dice roll value converges towards the expected value of 3.5 as the number of rolls increases.


Conclusion


Understanding and using probability distributions is an essential part of data science and statistics. In this tutorial, we've covered basic probability concepts, discrete and continuous distributions, and advanced concepts like the Central Limit Theorem and the Law of Large Numbers. We've learned how to calculate and visualize these concepts using Python, making use of libraries like NumPy, Matplotlib, and SciPy. Now, you're ready to apply these concepts to your data science projects! Don't hesitate to revisit this tutorial if you need a refresher. Happy data analyzing!

bottom of page