top of page

Hierarchical Clustering in Python: A Comprehensive Guide


Introduction to Hierarchical Clustering


Clustering is a method of unsupervised learning, and it's about grouping similar items together. Think of it as arranging a set of books by genre; books on the same genre are grouped together, and each group represents a cluster. Hierarchical clustering is a specific type of clustering that builds nested clusters by merging or splitting them successively.

This guide will lead you through the underlying mathematics and the application

of hierarchical clustering using Python's SciPy library.


Overview of Clustering Algorithms


Clustering algorithms can be broadly divided into two categories:

  1. Partitional Clustering: This is like a one-time division of a city into neighborhoods. K-Means is a famous example.

  2. Hierarchical Clustering: This is more like organizing those neighborhoods into a tree structure, where larger neighborhoods are subdivided into smaller ones.


Focus on Hierarchical Clustering


Hierarchical clustering works by grouping data over a variety of scales by creating a cluster tree or dendrogram. You can think of it like a tree's branches; the leaves are individual clusters, and the trunk represents one large cluster encompassing all the data.


Examination of Various Parameters of Hierarchical Clustering Algorithm


Just as a gardener examines the soil, sunlight, and water before planting a tree, various parameters need to be considered when performing hierarchical clustering, such as distance metrics and linkage criteria.


Usage of SciPy for Implementation


SciPy's hierarchical clustering module provides all the necessary tools to explore and implement hierarchical clustering. It's like a Swiss Army knife for data scientists.


Creating a Distance Matrix Using Linkage


Computing the Distance Matrix at Each Stage


The distance matrix contains the distances between all pairs of points in the dataset. It's like measuring the distance between all pairs of houses in a town. You can compute it using the linkage function in SciPy.

from scipy.cluster.hierarchy import linkage
import numpy as np

# Example data
data = np.array([[1,2], [5,5], [3,4]])

# Computing the distance matrix using Euclidean distance
linked = linkage(data, 'single')


Using the Linkage Method in SciPy


The linkage method defines how the distance between clusters is measured. Here's a simple analogy: it's like deciding whether to measure the distance between two cities by the shortest road, the fastest road, or some other criteria.


Explanation of Four Parameters: Observations, Method, Metric, Optimal_Ordering


These four parameters are vital for hierarchical clustering.

  • Observations: Your data points.

  • Method: The way to calculate the distance between clusters (e.g., 'single', 'complete').

  • Metric: The metric to calculate distance (e.g., Euclidean, Manhattan).

  • Optimal_Ordering: A Boolean value, if set to True, it reorders the linkage matrix for better visualization.

linked = linkage(data, method='single', metric='euclidean', optimal_ordering=False)


Discussion on Euclidean Distance


Euclidean distance is the "straight-line" distance between two points. If you think of it like a bird flying from one house to another, that's the path it would take.


Selecting the Right Method


When clustering, the choice of method is akin to choosing the right tool for a job. Each method measures distances between clusters differently, impacting the final result.


Different Methods to Decide Cluster Separation


There are several methods to decide cluster separation, each with its unique characteristics. Imagine these as different ways to measure the distance between two towns - by air, by road, by rail. Let's explore them:


Explanation of Single, Complete, Average, Centroid, Median, and Ward Methods

  • Single Linkage: Measures the shortest distance between two clusters, like the closest neighboring houses in two towns. single_linked = linkage(data, 'single')

  • Complete Linkage: Measures the longest distance between two clusters, akin to the furthest neighboring houses in two towns. complete_linked = linkage(data, 'complete')

  • Average Linkage: Takes the average distance between all pairs of objects in the clusters, much like an average of all the roads between two towns. average_linked = linkage(data, 'average')

  • Centroid Linkage: Measures the distance between the centroids of two clusters, similar to the distance between the town centers. centroid_linked = linkage(data, 'centroid')

  • Median Linkage: Uses the median of the distances, similar to finding the middle road among all the roads between two towns. median_linked = linkage(data, 'median')

  • Ward Method: Minimizes the total within-cluster variance, like optimizing the locations of town centers to minimize the average travel distance for all residents. ward_linked = linkage(data, 'ward')


Creating Cluster Labels with fcluster


Creation of Distance Matrix


After determining the method to use, the distance matrix is created using the linkage matrix, much like mapping the distances between the towns using our selected measurement method.

from scipy.cluster.hierarchy import fcluster

# Using the linkage matrix from 'ward' method
clusters = fcluster(ward_linked, 2, criterion='maxclust')


Use of fcluster Method


The fcluster method assigns the observations to clusters, a bit like assigning houses to neighborhoods.


Explanation of Three Arguments: Distance Matrix, Number of

Clusters, and Criteria

  • Distance Matrix: The linkage matrix created earlier.

  • Number of Clusters: The desired number of clusters, like deciding how many neighborhoods you want.

  • Criteria: This defines how the clusters are formed, such as maximizing or minimizing a certain attribute.

# Creating clusters using the 'ward' linkage matrix, forming 2 clusters
clusters = fcluster(ward_linked, 2, criterion='maxclust')

The above code snippets provide insights into the different linkage methods and how to create clusters using the fcluster method.


Hierarchical Clustering Methods


The process of hierarchical clustering can be likened to building a family tree, where each leaf connects to a branch representing a cluster. Various methods interpret these connections differently.


Explanation and Differences of Various Methods


Ward Method


The Ward method minimizes the total within-cluster variance. Imagine it as grouping family members who look alike.

from scipy.cluster.hierarchy import dendrogram

ward_link = linkage(data, method='ward')
dendrogram(ward_link)
plt.title('Ward Method')
plt.show()

Output:

Ward Method Dendrogram (Visual here)


Single Method


The Single method links clusters based on the shortest distance. Think of it as connecting the closest relatives.

single_link = linkage(data, method='single')
dendrogram(single_link)
plt.title('Single Method')
plt.show()

Output:

Single Method Dendrogram (Visual here)


Complete Method


The Complete method links clusters based on the longest distance. This is like connecting the most distant relatives in the family tree.

complete_link = linkage(data, method='complete')
dendrogram(complete_link)
plt.title('Complete Method')
plt.show()

Output:

Complete Method Dendrogram (Visual here)


Comparison of Results Between Different Methods


By visually comparing the dendrograms, you can observe how different methods

group the data differently, much like different interpretations of a family tree.


Visualization of Clusters


Importance of Visualizing Clusters


Visualizing clusters is akin to drawing a geographical map. It provides insights into the landscape of your data, allowing you to see how observations group together.


Explanation of Seaborn and Matplotlib


Introduction to Seaborn as a Visualization Library


Seaborn is a powerful visualization library that provides aesthetically pleasing and informative statistical graphics. Think of it as the artist's brush for painting your data.

import seaborn as sns

sns.clustermap(data)
plt.show()

Output:

Cluster Map (Visual here)


Comparison of Matplotlib and Seaborn for Cluster Visualization


While Matplotlib provides a foundational framework (akin to pencil sketches), Seaborn adds advanced aesthetics (like adding color and depth to a painting).


Dendrograms: Deciding How Many Clusters


Introduction to Dendrograms


A dendrogram is like a family tree for your data. It shows how different

observations are grouped together into clusters.


Creation of a Dendrogram in SciPy


Creating a dendrogram is a straightforward process using the dendrogram function.

dendrogram(ward_link)
plt.title('Dendrogram')
plt.show()

Output:

Dendrogram (Visual here)


Demonstration and Interpretation of Dendrograms


Interpreting a dendrogram is like reading a family tree. You can observe how closely related the observations are and decide how many clusters best represent your data.


Limitations of Hierarchical Clustering


Hierarchical clustering is a powerful technique, but like all methods, it has its constraints and drawbacks.


Challenges in Performing Hierarchical Clustering


Imagine trying to build a complete family tree for all the people in a large city; the process would be immensely complex. Similarly, hierarchical clustering faces challenges when dealing with large datasets.


Measuring Speed in Hierarchical Clustering


Design of a Task to Check the Runtime


The time it takes to run hierarchical clustering can be a concern, especially with large data sets. It's like timing how long it takes to complete a massive jigsaw puzzle.

import time

start_time = time.time()
linkage(data, method='ward')
end_time = time.time()

print("Time taken: ", end_time - start_time, "seconds")

Output:

Time taken: x seconds


Usage of the Timeit Module to Test Different Number of Data Points


Using the timeit module, you can experiment with different data sizes to see how they impact the runtime, similar to comparing different puzzle sizes.

import timeit

def clustering_time():
    linkage(data, method='ward')

time_taken = timeit.timeit(clustering_time, number=10)
print("Average time taken for 10 runs: ", time_taken/10, "seconds")

Output:

Average time taken for 10 runs: x seconds


Conclusion


Hierarchical clustering is a robust and versatile clustering technique that allows us to understand the underlying structures within our data. By exploring different methods, visualizing clusters, and understanding its limitations, we've embarked on a journey akin to mapping uncharted lands or unraveling complex family histories. The code snippets and visual aids have hopefully provided a practical and engaging way to delve into this topic.


Like any tool or method, understanding when and how to apply it is crucial. Hierarchical clustering offers unique insights and challenges that, when navigated wisely, can lead to profound discoveries about the data you are working with.


The beauty of hierarchical clustering lies in its ability to provide a structured, tree-like representation of data, giving a comprehensive view of how elements are interconnected. By working through this tutorial, you have gained not only technical knowledge but a deeper appreciation of the connections and relationships that lie within your data, ready to be explored.

bottom of page