Normal Distribution: The Bell of the Ball
Consider a party where guests represent various data points. The most popular person at this party is often found in the center, surrounded by a majority of the attendees - much like our mean in the normal distribution. This person draws people in like a magnet with the majority of the guests clustering around them. This is precisely how the data points behave in a normal distribution.
import numpy as np
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
The output of the above code is a histogram of random values from a normal distribution and a line plot for the probability density function. You can see a bell shape, right?
Different Normal Distributions: Exploring Diversity
Just like different species of birds have different shapes and sizes but are still categorized as 'birds', different normal distributions can have different means and standard deviations but still share the same symmetric, bell-shaped characteristics.
The Importance of Area in Normal Distribution
Think of it like a massive pizza party. The whole pizza represents the entire population. The slices of pizza represent various groups within the population. The size of each slice is analogous to the area under the curve for the normal distribution.
from scipy.stats import norm
# Calculate the area under the curve within one standard deviation
one_std_right = mu + (1 * sigma)
one_std_left = mu - (1 * sigma)
area = norm.cdf(one_std_right, loc=mu, scale=sigma) - norm.cdf(one_std_left, loc=mu, scale=sigma)
print("Area under the curve within one standard deviation: ", area)
Running the above Python code will return the area under the curve within one standard deviation, which should be approximately 0.6827 or 68.27%.
A Look into Real-World Applications
It's time to see the normal distribution in action, much like seeing a car put through its paces on the road after learning about its engineering concepts.
import pandas as pd
data = pd.read_csv('data.csv')
plt.hist(data['Variable'], bins=30, density=True, alpha=0.6, color='g')
plt.show()
The output of this code, given an appropriate dataset, will be a histogram that should show a bell curve-like shape, revealing the data's normal distribution.
Computing Percentages and Quantiles
Imagine you're a teacher who wants to give A grades to the top 10% of students. By knowing the mean and standard deviation of the scores, you can use the normal distribution to determine the cut-off score for the top 10%.
# Suppose we have a mean score of 60 and a standard deviation of 10
mu = 60
sigma = 10
# We can find the cut-off for the top 10% as follows:
top_10_percent = norm.ppf(0.9, loc=mu, scale=sigma)
print("Cut-off score for top 10%: ", top_10_percent)
This code calculates the score below which 90% of the distribution lies, which is the cut-off for the top 10% of the scores.
Generating Random Numbers: Like Drawing Raffle Tickets
To understand how random numbers are generated from a normal distribution, think about drawing raffle tickets from a hat. The chance of picking any number is governed by the properties of the distribution.
random_numbers = np.random.normal(mu, sigma, 1000)
plt.hist(random_numbers, bins=30, density=True, alpha=0.6, color='g')
plt.show()
This code will generate 1000 random numbers from the normal distribution defined by mu and sigma, and plot a histogram of the results. The plot should roughly follow the bell shape of a normal distribution.
The Central Limit Theorem: The Rule of the Crowd
The Central Limit Theorem (CLT) is a crucial cornerstone in the field of statistics. Just as a chameleon blends into its surroundings to create a balanced picture, the CLT tells us how individual data points blend together as we collect more data to form a normal distribution.
The Story of Dice: Demonstrating the Central Limit Theorem
Imagine we are rolling a fair six-sided die. The outcomes are equally likely, and each roll of the die doesn't affect the next one.
import random
def roll_dice():
return random.randint(1, 6)
roll_dice()
If you run the above function roll_dice(), you will get a random number between 1 and 6.
Now, if we roll the die multiple times and calculate the mean of the results, the CLT begins to emerge.
def roll_n_dice(n):
results = [roll_dice() for _ in range(n)]
return sum(results) / n
mean_of_rolls = roll_n_dice(1000)
print(mean_of_rolls)
The result is the mean of rolling a die 1000 times. If you repeat this function multiple times, you'll notice that the mean converges to 3.5, the expected value.
Sampling Distribution: Populations in a Nutshell
A sampling distribution can be thought of as the "footprint" left by a population. It summarizes the possible outcomes of a statistic from many random samples.
def repeat_experiment(num_repeats):
means = [roll_n_dice(1000) for _ in range(num_repeats)]
plt.hist(means, bins=30, alpha=0.6, color='g')
plt.show()
repeat_experiment(1000)
The result is a histogram that should show a bell curve-like shape, revealing that the means of our dice rolls follow a normal distribution. This is the essence of the Central Limit Theorem.
Increasing Sample Size: The More the Merrier
As with most parties, the more guests, the better the party. The same applies to data sampling. The larger the sample size, the closer our sample mean gets to the population mean.
sample_sizes = [10, 100, 1000, 5000]
for size in sample_sizes:
repeat_experiment(size)
By running this code and comparing the resulting histograms, you'll see that as the sample size increases, the spread of the sample means decreases. This is another important aspect of the CLT - as we collect more data, our estimate of the mean gets more precise.
The Central Limit Theorem in Depth
The CLT is like the law of large numbers on steroids. Not only does it tell us that the sample mean approaches the population mean as the sample size increases, but it also tells us how quickly this convergence occurs.
Other Statistics and The Central Limit Theorem
The CLT extends to other statistics as well. For example, it applies to proportions or standard deviations, just as it applies to means. Imagine you're examining the heights of sunflowers in a field. The mean height, the proportion of flowers above a certain height, the standard deviation of the heights - as long as you have enough flowers, these statistics will all be normally distributed.
# Generating 1000 sunflower heights
heights = np.random.normal(100, 10, 1000)
# Proportion of sunflowers above 110 cm
proportion = sum(heights > 110) / 1000
print(proportion)
This code generates 1000 sunflower heights and calculates the proportion of sunflowers above 110 cm. If we repeat the experiment enough times, the distribution of the calculated proportions will be normally distributed.
The Poisson Distribution: Counting the Unusual
The Poisson distribution is like a lookout tower for rare events. With this tool, we can quantify the probability of a rare event's occurrence within a specific timeframe.
Introduction to the Poisson Distribution
Think of the Poisson distribution as the traffic light system for rare events. When the light is green, the event is not happening; when it's red, the event occurs. The Poisson distribution helps us understand the patterns of these lights - how frequently they change from green to red, or how often the event happens.
from scipy.stats import poisson
# defining the parameters for the Poisson distribution
mu = 2
# creating the Poisson distribution
poisson_dist = poisson(mu)
# plotting the Poisson distribution
x = np.arange(0, 10)
plt.plot(x, poisson_dist.pmf(x), 'bo', ms=8, label='poisson pmf')
plt.vlines(x, 0, poisson_dist.pmf(x), colors='b', lw=5, alpha=0.5)
plt.show()
This code generates a Poisson distribution with a mean (mu) of 2. The distribution's shape reflects the probability of different numbers of events occurring.
Lambda (λ) in the Poisson Distribution
In the Poisson distribution, the Greek letter λ (lambda) is like a dial we can turn to adjust the average number of events. The higher the λ, the higher the rate of events.
Calculating Probabilities with the Poisson Distribution
Imagine a call center where the average number of calls per hour is 10. Using the Poisson distribution, we can answer questions like "What is the probability of receiving exactly 15 calls in the next hour?" or "What is the probability of receiving more than 20 calls?"
# the average number of calls per hour
lambda_ = 10
# creating the Poisson distribution
dist = poisson(lambda_)
# probability of receiving exactly 15 calls
prob_15 = dist.pmf(15)
print(f"Probability of 15 calls: {prob_15}")
# probability of receiving more than 20 calls
prob_20 = 1 - dist.cdf(20)
print(f"Probability of more than 20 calls: {prob_20}")
This code calculates the probability of receiving exactly 15 calls and the probability of receiving more than 20 calls in an hour.
Sampling from a Poisson Distribution
We can also generate random variables that follow a Poisson distribution using Python's numpy package.
# generate 1000 random variables
random_variables = np.random.poisson(lambda_, 1000)
# plot the histogram
plt.hist(random_variables, bins=30)
plt.show()
This will display a histogram of 1000 random variables that follow a Poisson distribution with λ=10.
The Central Limit Theorem and the Poisson Distribution
The beauty of the Central Limit Theorem (CLT) is its universal applicability, extending even to the Poisson Distribution. As the sample size increases, the distribution of sample means obtained from a Poisson distribution will approach a normal distribution.
Discovering More Probability Distributions
Just as there are many types of species in the animal kingdom, there are many types of probability distributions in statistics, each with its unique features and characteristics. We will briefly introduce some other common probability distributions you might encounter.
Binomial Distribution
The binomial distribution helps us model scenarios where we have a fixed number of 'Bernoulli trials' - independent trials that have only two possible outcomes (like a coin toss).
from scipy.stats import binom
# defining parameters for the binomial distribution
n = 10 # number of trials
p = 0.5 # probability of success
# creating the binomial distribution
binom_dist = binom(n, p)
# plotting the binomial distribution
x = np.arange(0, 11)
plt.plot(x, binom_dist.pmf(x), 'bo', ms=8, label='binom pmf')
plt.vlines(x, 0, binom_dist.pmf(x), colors='b', lw=5, alpha=0.5)
plt.show()
This code generates a binomial distribution where we have 10 trials (like flipping a coin 10 times), and each trial has a 50% chance of success.
Exponential Distribution
The exponential distribution is used to model the time between events in a Poisson process - a process where events occur continuously and independently at a constant average rate.
from scipy.stats import expon
# defining the parameter for the exponential distribution
lambda_ = 1/10 # rate parameter (average time between events)
# creating the exponential distribution
expon_dist = expon(scale=lambda_)
# plotting the exponential distribution
x = np.linspace(0,1,100)
plt.plot(x, expon_dist.pdf(x), 'r-', lw=5, alpha=0.6, label='expon pdf')
plt.show()
This code creates an exponential distribution where the average time between events is 10 units.
Gamma Distribution
The Gamma distribution is a two-parameter family of continuous probability distributions. It has a shape parameter (k) and an inverse scale parameter (θ, theta).
from scipy.stats import gamma
# defining parameters for the gamma distribution
k = 2 # shape parameter
theta = 2 # scale parameter
# creating the gamma distribution
gamma_dist = gamma(k, scale=theta)
# plotting the gamma distribution
x = np.linspace(0,10,100)
plt.plot(x, gamma_dist.pdf(x), 'g-', lw=5, alpha=0.6, label='gamma pdf')
plt.show()
This code creates a gamma distribution with a shape parameter of 2 and a scale parameter of 2.
Each distribution is tailored for certain types of events and phenomena. Selecting the right distribution can significantly enhance the accuracy of your data analysis and predictive models.
It's important to note that the real power of statistics comes not from mastering a single probability distribution, but from understanding how different distributions can be used in concert to model complex real-world phenomena.
We hope you enjoyed the journey from understanding the Normal distribution, Central Limit Theorem, Poisson distribution, and meeting a few more probability distributions. Remember, this is just the beginning of your statistical journey, there's much more to explore and understand. Happy data science journey!