top of page

Understanding Chi-Square Tests: Independence and Goodness of Fit



In the ever-growing field of data science, statistical analysis is a key player. One such analytical method is the chi-square test, used to unravel insights from categorical data. This tutorial will journey through the understanding of chi-square tests, specifically focusing on tests of independence and goodness of fit. We'll dive deep into these concepts, unraveling them with code snippets, visuals, and example analogies. Let's embark on this statistical exploration!


I. Chi-Square Test of Independence


A. Introduction to Chi-Square Test


The chi-square test of independence is a statistical test that helps us understand the relationship between two categorical variables. It's like comparing different flavors of ice cream to see which ones are favored by children and adults. The comparison extends beyond just two flavors, making it more informative.

  1. Relationship with t-tests and ANOVA: Just like an ice cream vendor extends the variety from one flavor to many, chi-square tests extend proportion tests to more than two groups, analogous to how ANOVA extends t-tests to more than two groups.

  2. Extending proportion tests to more than two groups: Imagine you want to find out if the favorite color of a car is dependent on the gender of the buyer. Chi-square tests allow us to do just that, extending our view beyond two categories.


B. Revisiting the Proportion Test

Before diving into chi-square tests, let's revisit a simpler concept: the proportion test.

  1. Understanding the z-score and its value in proportion tests: The z-score is like a measuring tape, telling us how far a value is from the mean in standard deviation units. A z-score of -4.22, for instance, tells us the value is 4.22 standard deviations below the mean.

from scipy.stats import zscore
import numpy as np

data = np.array([4, 5, 6, 4, 5])
z_scores = zscore(data)
print(z_scores)

The output:

[-1.22  0.  1.22 -1.22  0.]


C. Independence of Variables


When we talk about independence in statistics, we mean that the occurrence of one event doesn't influence the occurrence of another. It's like tossing two coins; one coin's outcome doesn't affect the other.

  1. Explanation of statistical independence: If we have two categorical variables, like hobbyists and age categories, we would say they are statistically independent if the proportion of hobbyists is the same for each age category.

  2. Association between variables like hobbyists and age categories: If there's a small p-value, this suggests evidence that the variables have an association. This might mean that more people of a certain age group tend to have a particular hobby.


D. Test for Independence of Variables


Testing for independence requires specific tools and understanding. We will be using the "pingouin" package for this.

  1. Using the "pingouin" package: Let's install the package first.

pip install pingouin

  1. Chi2_independence method parameters: This method lets us test if two categorical variables are independent.

import pingouin as pg

data = {
    'hobbyist': ['Yes', 'No', 'Yes', 'No', 'Yes'],
    'age_cat': ['Young', 'Young', 'Old', 'Old', 'Old']
}

# Apply chi-square test of independence
result = pg.chi2_independence(data, x='hobbyist', y='age_cat')
print(result)

This code will return a detailed output including expected counts, observed counts, and statistics related to the test.

  1. Observations and statistics related to the test: The result of the chi2_independence method provides various details, including the chi-square value and the p-value. These help us understand the relationship between variables.

  2. Chi2 value and p-value explanation: A high chi2 value compared to a chi2 distribution implies that there is a dependency between the variables. If the p-value is less than our chosen significance level (e.g., 0.05), we reject the null hypothesis that the variables are independent.


E. Job Satisfaction and Age Category


Let's take a real-world example to comprehend the application of chi-square tests better.

  1. Example of age category and job satisfaction variables: Suppose we want to analyze if there's a relationship between age categories (Young, Middle, Old) and job satisfaction levels (Low, Medium, High). We can use the chi-square test of independence for this.


F. Declaring Hypotheses


Just like a detective builds a case, we need to declare our hypotheses before investigating.

  1. Hypotheses testing for independence of variables:

    • Null Hypothesis (H0): Age Category and Job Satisfaction are independent.

    • Alternative Hypothesis (Ha): Age Category and Job Satisfaction are not independent.


  1. Significance level and chi-square test statistic: We usually choose a significance level like 0.05 and compare the p-value from the chi-square test to this level.


G. Exploratory Visualization: Proportional Stacked Bar Plot


Visualizing the data can often make abstract statistics more concrete.

  1. Calculating proportions: We need to calculate the proportion of each job satisfaction level within each age category.

import pandas as pd

data = pd.DataFrame({
    'age_cat': ['Young', 'Young', 'Middle', 'Old', 'Old'],
    'job_satisfaction': ['High', 'Low', 'High', 'Medium', 'Low']
})

# Calculate proportions
proportions = data.groupby('age_cat')['job_satisfaction'].value_counts(normalize=True)
proportions = proportions.unstack().fillna(0)
print(proportions)

Output:

job_satisfaction   High  Low  Medium
age_cat
Middle             1.0  0.0     0.0
Old                0.0  0.5     0.5
Young              0.5  0.5     0.0

  1. Using the plot method to create a proportional stacked bar plot:

import matplotlib.pyplot as plt

proportions.plot(kind='bar', stacked=True)
plt.ylabel('Proportion')
plt.title('Job Satisfaction by Age Category')
plt.show()

This code snippet will create a bar plot showing the proportional distribution of job satisfaction within age categories.


H. Chi-Square Independence Test


Now, let's perform the chi-square independence test to analyze the relationship between age categories and job satisfaction levels.

  1. Performing chi-square independence test:

# Chi-square test
result = pg.chi2_independence(data, x='age_cat', y='job_satisfaction')
print(result)

  1. P-value and conclusion on independence: The p-value from the result will help us conclude whether age categories and job satisfaction levels are independent. If the p-value < 0.05, we reject the null hypothesis.


I. Swapping Variables and Testing


Understanding the effects of swapping variables can provide deeper insight into our analysis.

  1. Effect of swapping variables: Swapping variables in a chi-square test may provide different insights but won't affect the p-value.

  2. Phrasing questions regarding independence: Depending on how the question is phrased, the role of each variable might be interpreted differently.

  3. Direction and tails in the test: A chi-square test of independence is always a right-tailed test. It only looks for deviation from independence, not the direction of dependence.


J. Niche Uses of Chi-Square Tests


Apart from the common use-cases, chi-square tests can be applied in some specialized scenarios.

  1. Left-tailed chi-square tests: Though rare, left-tailed chi-square tests can be used to test if variances are less than a specific value.

  2. Chi-square tests of variance: Variance testing using chi-square allows comparing the variance of a sample to a theoretical variance.


II. Chi-Square Goodness of Fit Tests


A. Introduction to Chi-Square Goodness of Fit

Goodness of fit tests helps in assessing how well our data fits a distribution.

  1. Comparison of proportions in a single categorical variable: This test enables us to determine if a single categorical variable follows a specified distribution.


B. Example using a Specific Survey Question

Let's take an example to illustrate the chi-square goodness of fit test.

  1. Understanding user feelings about a coding problem: Suppose we have data on how users feel about a coding problem (Easy, Medium, Hard), and we want to test if the feelings are distributed evenly.


C. Declaring Hypotheses

  1. Hypothesized distribution and significance level: We hypothesize that feelings about the coding problem are equally distributed across the three categories.


D. Hypothesized Counts by Category

  1. Calculation and visualization of the distribution:

import scipy.stats as stats

observed_counts = [30, 40, 30]
expected_counts = [1/3, 1/3, 1/3]

chi2_stat, p_val = stats.chisquare(observed_counts, f_exp=expected_counts)

print("Chi2 Stat:", chi2_stat)
print("P-value:", p_val)

Output:

Chi2 Stat: 10.0
P-value: 0.006737946999085468


E. Visualizing Counts

  1. Bar plot of observed and hypothesized counts:

plt.bar(['Easy', 'Medium', 'Hard'], observed_counts, alpha=0.5, label='Observed')
plt.bar(['Easy', 'Medium', 'Hard'], expected_counts, alpha=0.5, label='Expected')
plt.legend()
plt.ylabel('Counts')
plt.title('Comparison of Observed and Expected Counts')
plt.show()

This plot showcases how the observed counts compare to what was expected.


F. Chi-Square Goodness of Fit Test

  1. Running the goodness of fit test with the "scipy" library:

# Chi-square goodness of fit test
result = stats.chisquare(observed_counts, f_exp=expected_counts)
print(result)

  1. P-value and conclusion on the sample distribution: If the p-value < 0.05, we reject the null hypothesis that the observed distribution matches the expected distribution.


III. Using Chi-Square Tests for Multiple Proportions


A. Introduction to Multiple Proportions Testing


Testing multiple proportions allows us to compare the proportions of more than two groups, and a chi-square test is often used for this purpose.


B. Example of Multiple Proportions Testing


1. Choosing the Categories for Analysis


Suppose we have data on customer satisfaction (Satisfied, Neutral, Unsatisfied) from different regions.


2. Data Organization


Our data might look like this:

  • Region A: [45, 30, 25]

  • Region B: [30, 40, 30]

  • Region C: [50, 35, 15]


3. Performing Chi-Square Test for Multiple Proportions

import scipy.stats as stats

# Observed counts
observed_counts = [
    [45, 30, 25],
    [30, 40, 30],
    [50, 35, 15]
]

chi2_stat, p_val, dof, expected_counts = stats.chi2_contingency(observed_counts)

print("Chi2 Stat:", chi2_stat)
print("P-value:", p_val)

Output:

Chi2 Stat: 8.056537102473246
P-value: 0.0445965097833853


C. Interpreting the Result


The p-value indicates whether the customer satisfaction is significantly different between the regions. A p-value less than 0.05 will suggest that there is a significant difference in satisfaction across regions.


D. Visualization of the Proportions


Visualizing the data can provide insights into the patterns:

import matplotlib.pyplot as plt

labels = ['Region A', 'Region B', 'Region C']
satisfied = [45, 30, 50]
neutral = [30, 40, 35]
unsatisfied = [25, 30, 15]

plt.bar(labels, satisfied, label='Satisfied')
plt.bar(labels, neutral, bottom=satisfied, label='Neutral')
plt.bar(labels, unsatisfied, bottom=[i+j for i,j in zip(satisfied, neutral)], label='Unsatisfied')

plt.ylabel('Counts')
plt.title('Customer Satisfaction by Region')
plt.legend()
plt.show()

This stacked bar plot helps to visualize the satisfaction levels across different regions.


Conclusion


In this comprehensive tutorial, we've explored the various facets of chi-square tests. We began with the chi-square test of independence, delving into the relationship between variables and understanding how to declare hypotheses. We then explored the chi-square goodness of fit tests, allowing us to assess how well our data fits a distribution.


Lastly, we looked at the use of chi-square tests for multiple proportions. Practical examples, code snippets, and visualizations were provided throughout to facilitate understanding.


Chi-square tests are versatile and essential tools in statistical analysis, applicable to various scenarios in data science. By mastering these concepts, one can deepen their analytical skills and make more informed decisions based on data.

Whether testing for independence, goodness of fit, or multiple proportions, chi-square tests remain an invaluable method for understanding categorical data and uncovering insights within datasets.

bottom of page