top of page

Comprehensive Guide to Supervised Learning, Generalization Error, and Ensemble Learning


Generalization Error


Supervised Learning - Under the Hood


Supervised learning is a foundational approach in machine learning where we train a model to map features to corresponding labels. It can be likened to a teacher-student analogy where the teacher supervises the learning by providing correct answers, and the student learns to predict the outcomes based on given inputs.


Mapping between features and labels


Consider a simple example where you want to predict the price of a house based on its features such as square footage, number of bedrooms, etc. Here, features would be the characteristics of the house, and the label would be the price.

# Example code snippet
from sklearn.linear_model import LinearRegression

features = [[1500, 3], [2000, 4], [1000, 2]] # Square footage, number of bedrooms
labels = [300000, 400000, 200000] # Price of the house

model = LinearRegression()
model.fit(features, labels)

Noise and randomness in data generation


In real life, data often contains noise or random variations. Using the house price prediction analogy, there might be other unknown factors affecting the price, like the neighborhood's popularity. Noise can make the learning process more challenging, as shown below:

# Simulated noise in the data
import numpy as np

noise = np.random.normal(0, 10000, len(labels))
noisy_labels = labels + noise

model.fit(features, noisy_labels)


Goals of Supervised Learning


The main goal of supervised learning is to create models that can generalize well from the training data to unseen data. This process consists of two key aspects:


Model approximation


This involves finding a function that approximates the true underlying relationship between features and labels. In our housing example, this would be finding a mathematical formula that can predict house prices accurately based on the given features.


Predictive error on unseen datasets


The model's ability to make correct predictions on new, unseen data is a crucial measure of its success. Using cross-validation or a separate test dataset helps in evaluating this ability.

# Splitting the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, noisy_labels, test_size=0.2)

model.fit(X_train, y_train)
prediction = model.predict(X_test)


Difficulties in Approximating Function


Overfitting and Underfitting Explained


Description and illustration of overfitting and underfitting


Understanding overfitting and underfitting is vital in creating effective models. Let's explore these terms with an analogy:

  • Overfitting: Imagine fitting a flexible rubber band around the shape of your hand. It takes the exact shape but will not fit well on other hands. Overfitting in machine learning is like this rubber band; the model learns the training data too well, including its noise and outliers, and performs poorly on unseen data.

  • Underfitting: Now imagine fitting a straight rigid stick around your hand. It doesn't capture the shape at all. In machine learning, underfitting is when the model is too simple to capture the underlying trend in the data.

# Code snippet to demonstrate overfitting and underfitting
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Generate some sample data
x = np.linspace(0, 1, 100)
y = np.sin(2 * np.pi * x) + np.random.normal(0, 0.1, 100)

# Underfitting model (Linear)
model_underfit = make_pipeline(PolynomialFeatures(degree=1), LinearRegression())
model_underfit.fit(x.reshape(-1, 1), y)

# Overfitting model (High-degree polynomial)
model_overfit = make_pipeline(PolynomialFeatures(degree=15), LinearRegression())
model_overfit.fit(x.reshape(-1, 1), y)

# Plotting
plt.scatter(x, y, color='gray')
plt.plot(x, model_underfit.predict(x.reshape(-1, 1)), label='Underfitting (Linear)')
plt.plot(x, model_overfit.predict(x.reshape(-1, 1)), label='Overfitting (Polynomial)')
plt.legend()
plt.show()

The above plot illustrates overfitting and underfitting. The overfitting model tries to capture every point, including the noise, while the underfitting model is too rigid to capture the trend.


Generalization Error


Decomposition into bias, variance, and irreducible error


The generalization error can be decomposed into three components:

  1. Bias: The error introduced by approximating a real-world problem by a simplified model. High bias usually leads to underfitting.

  2. Variance: The error due to the model's sensitivity to small fluctuations in the training data. High variance usually leads to overfitting.

  3. Irreducible Error: The noise inherent in any real-world data that can’t be removed.

The relationship among these components is visualized in the Bias-Variance tradeoff, which we'll cover next.


Bias and Variance


High bias leading to underfitting


A model with high bias oversimplifies the problem, ignoring relevant relations between features and outputs, leading to underfitting.


High variance leading to overfitting


On the other hand, a model with high variance pays too much attention to the

training data, including the noise, leading to overfitting.


Model Complexity


Flexibility to approximate the true function


Model complexity is a critical factor in finding the right balance between overfitting and underfitting. A more complex model may fit the training data perfectly but fail on new data. A simpler model may not perform well on the training data but generalize better.


Bias-Variance Tradeoff


Finding the balance between bias and variance


The goal is to find a model with the right level of complexity that balances bias

and variance, minimizing the total error.


Visual explanation of bias-variance tradeoff


Here's a code snippet to visually demonstrate the bias-variance tradeoff:

# Code snippet to plot bias-variance tradeoff
degrees = range(1, 15)
train_errors = []
test_errors = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test)))

plt.plot(degrees, train_errors, label='Training Error')
plt.plot(degrees, test_errors, label='Test Error')
plt.xlabel('Model Complexity (Degree of Polynomial)')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.show()

Diagnosing Bias and Variance Problems


Estimating the Generalization Error


Challenges in estimating generalization error


Accurately estimating how well a model will perform on unseen data is a challenging task. An error in the estimation may lead to selecting an incorrect model or over-optimizing for the training data.


Splitting data into training and test set for evaluation


A common practice is to split the data into a training set to fit the model, and a separate test set to evaluate how well the model generalizes.

# Code snippet to split data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)


Better Model Evaluation with Cross-Validation


Use of cross-validation, K-Fold-CV, and hold-out-CV for model

evaluation


Cross-validation is a powerful method for assessing a model's generalization performance. Among the variations of cross-validation are K-Fold-CV, where the data is split into K equal parts, and each part serves as a test set in turn, and hold-out-CV, where a portion of the data is held back for validation.


K-Fold CV


Process and formula for K-fold cross-validation


K-fold cross-validation divides the dataset into K equal folds. The model is trained K times, each time leaving out one fold for testing, and the average error across all K trials is computed.

# Code snippet for K-fold cross-validation
from sklearn.model_selection import cross_val_score

model = LinearRegression()
scores = cross_val_score(model, features, labels, cv=5) # 5-fold CV

average_score = np.mean(scores)


Diagnosing Variance and Bias Problems


Techniques to diagnose and remedy high variance and high bias

  • High Bias (Underfitting): If the model performs poorly on both the training and validation sets, it's an indication of high bias. Remedies include increasing model complexity or adding more features.

  • High Variance (Overfitting): If the model performs well on the training set but poorly on the validation set, it's an indication of high variance. Remedies include gathering more data or using regularization techniques.


Example with DecisionTreeRegressor


Here's a practical example using DecisionTreeRegressor, which can suffer from high variance if not carefully tuned:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# High Variance Model
tree_high_variance = DecisionTreeRegressor(max_depth=20)
tree_high_variance.fit(X_train, y_train)
train_error = mean_squared_error(y_train, tree_high_variance.predict(X_train))
test_error = mean_squared_error(y_test, tree_high_variance.predict(X_test))

# Remedy: Pruning the tree (Reducing max_depth)
tree_remedy = DecisionTreeRegressor(max_depth=5)
tree_remedy.fit(X_train, y_train)


K-Fold CV Implementation in Scikit-Learn


Practical demonstration of K-fold cross-validation using the

Auto Dataset


Utilizing the K-fold cross-validation technique in Scikit-Learn can be done with ease. Here's an example using the popular "Auto" dataset:

from sklearn.datasets import fetch_openml

# Fetch Auto dataset
auto_data = fetch_openml(name="auto-mpg")
X, y = auto_data.data, auto_data.target

# Apply K-fold cross-validation
model = LinearRegression()
k_fold_scores = cross_val_score(model, X, y, cv=5)

average_k_fold_score = np.mean(k_fold_scores)


Ensemble Learning


Advantages and Limitations of CARTs


Classification and Regression Trees (CARTs): Advantages and

limitations


CARTs, or Decision Trees, are simple and interpretable models, but they can suffer from overfitting, especially with deep trees. Let's briefly look at the main advantages and limitations:

  • Advantages:

    • Easy to interpret and visualize.

    • Capable of handling both numerical and categorical data.


  • Limitations:

    • Sensitive to the training data (high variance).

    • Prone to overfitting if the tree is too deep.



Ensemble Learning Overview


Aggregating predictions of individual models


Ensemble Learning leverages the power of combining multiple individual models to create a more robust and accurate prediction. Think of it as a team of experts coming together to make a decision.


The final prediction is more robust and less prone to errors


By aggregating predictions from different models, the ensemble can often significantly reduce the risk of an erroneous prediction by an individual model.


Ensemble Learning: A Visual Explanation


Diagrammatic representation of ensemble learning for classification


Imagine an election where three politicians vote for one of two parties. Each politician might be biased or err in their judgment. But when their votes are combined, the chance of an overall wrong decision is minimized. Ensemble learning works similarly by "voting" on the prediction of multiple models.


Ensemble Learning in Practice: Voting Classifier


Hard voting and its explanation


Hard voting is a simple yet powerful technique in ensemble learning where the final prediction is the class that gets the most votes from individual classifiers.


Example with three trained classifiers


Let's create an ensemble of three classifiers (Logistic Regression, Decision Tree, and K-Nearest Neighbors) and apply hard voting:

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Individual classifiers
log_clf = LogisticRegression()
tree_clf = DecisionTreeClassifier()
knn_clf = KNeighborsClassifier()

# Ensemble model
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('tree', tree_clf), ('knn', knn_clf)],
    voting='hard'
)

# Fitting and scoring the ensemble
voting_clf.fit(X_train, y_train)
score = voting_clf.score(X_test, y_test)


Voting Classifier in Scikit-Learn (Breast-Cancer Dataset)


Training a voting classifier using LogisticRegression,

DecisionTreeClassifier, and KNeighborsClassifier


Let's demonstrate the application of the voting classifier on the popular Breast-Cancer dataset:

from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train the voting classifier
voting_clf.fit(X_train, y_train)
voting_score = voting_clf.score(X_test, y_test)


Evaluation and comparison of classifiers


You can also evaluate and compare the performance of individual classifiers within the ensemble against the ensemble itself. This often reveals how the collective decision making of the ensemble leads to more accurate predictions.


Conclusion


Machine learning is a dynamic and multifaceted field that requires a deep understanding of various concepts and techniques. In this tutorial, we have traversed essential topics, ranging from the intricate mapping between features and labels in supervised learning to the robust capabilities of ensemble learning methods.


We started with the underpinnings of supervised learning and delved into the complexities of approximating functions, explaining the phenomena of overfitting and underfitting. We also dissected the components of generalization error, including bias, variance, and irreducible error, and explored the delicate balance between bias and variance.


Practical approaches to diagnosing bias and variance problems were presented, and the utility of cross-validation was emphasized, with hands-on examples using popular Python libraries like Scikit-Learn.

Finally, we explored the powerful concept of ensemble learning, demonstrating how combining multiple models can enhance predictive accuracy. We illustrated this with the Voting Classifier, applied to real-world datasets.


The concepts, examples, and code snippets provided in this tutorial are aimed at fostering a comprehensive understanding of key machine learning techniques. By carefully studying and applying these principles, aspiring data scientists and seasoned professionals alike can build more accurate and robust models, paving the way for impactful insights and decision-making in various domains.


bottom of page