top of page

Mastering Clustering Techniques in Data Analysis



Analyzing Dominant Colors in Images


1. Introduction to Image Analysis


Understanding the colors within an image is like deciphering the visual DNA of a picture. In this section, we will dive deep into the individual components of an image: the pixels and their primary constituents – Red, Green, and Blue (RGB) values.


Understanding Pixels and their Components (Red, Green, Blue)


Pixels are the smallest elements of an image. They combine different shades of red, green, and blue to create a vast array of colors.


Example Analogy: Think of pixels as tiny mosaic tiles that come together to form

a grand artwork. Each tile (or pixel) can be a blend of red, green, and blue, and together they create the full spectrum of the image.


Reading an Image and Extracting RGB Values

from PIL import Image

# Read an image
image = Image.open('image.jpg')

# Get the RGB values
pixels = list(image.getdata())
print(pixels[0])  # Outputs the RGB value of the first pixel


Explanation of k-means clustering with RGB values

k-means clustering is an unsupervised machine learning algorithm that can identify dominant colors by clustering similar RGB values together.


Example Analogy: Imagine a room filled with different colored balloons. k-means clustering is like gathering balloons of similar colors together to identify the main color groups.


Implementing k-means Clustering on RGB values

from sklearn.cluster import KMeans
import numpy as np

# Convert the pixels into an array
pixel_array = np.array(pixels)

# Use k-means clustering
kmeans = KMeans(n_clusters=5).fit(pixel_array)


2. Using Clustering to Identify Surface Features in Satellite Images


How k-means clustering segments satellite images

Segmentation of satellite images using k-means clustering helps in identifying different terrains like forests, oceans, and urban areas.


Example Analogy: This is akin to slicing a pie into different sections, each representing a unique terrain feature in the earth's landscape.


Segmenting Satellite Image using k-means Clustering

# Apply k-means clustering
segmented_image = kmeans.cluster_centers_[kmeans.labels_]

# Reshape it to the original image dimensions
segmented_image = segmented_image.reshape(image.shape)

# Convert to an image
segmented_image = Image.fromarray(np.uint8(segmented_image))
segmented_image.show()


Output of the code


The above code will display the segmented image, highlighting different terrain features based on the dominant colors.


3. Tools for Analyzing Dominant Colors


This section covers methods to convert an image into an RGB matrix and to visualize cluster centers.


Methods to Convert an Image into an RGB Matrix

# Convert the image into an RGB matrix
rgb_matrix = np.array(image)

Using Visualization Techniques to Display Cluster Centers

import matplotlib.pyplot as plt

# Get the cluster centers
colors = kmeans.cluster_centers_

# Plot the dominant colors
plt.imshow([colors])
plt.axis('off')
plt.show()


Output of the code


The above code will display a horizontal bar showcasing the dominant colors found in the image.


4. Case Study: Analyzing an Image of the Sea


Now, we will apply everything we have learned to analyze an image of the sea.


Step-by-step Process of Converting the Image to Pixels


Reading the Sea Image and Extracting Pixels

sea_image = Image.open('sea.jpg')
sea_pixels = list(sea_image.getdata())


Creating a DataFrame with RGB values


Here, we will organize the RGB values in a tabular form for easier analysis.


Creating a DataFrame with RGB values

import pandas as pd

# Convert pixels to DataFrame
sea_pixels_df = pd.DataFrame(sea_pixels, columns=['Red', 'Green', 'Blue'])


Utilizing an Elbow Plot to Determine Clusters


Determining the right number of clusters can be done using an elbow plot.


Elbow Plot

inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(sea_pixels_df)
    inertia.append(kmeans.inertia_)

plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Plot')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()


Output of the code


This plot will help you identify the optimal number of clusters by finding the "elbow" point where the inertia starts to decrease at a slower rate.


Displaying Dominant Colors in the Image


Displaying Dominant Colors

# Use optimal number of clusters
kmeans = KMeans(n_clusters=optimal_clusters).fit(sea_pixels_df)
dominant_colors = kmeans.cluster_centers_

plt.imshow([dominant_colors])
plt.axis('off')
plt.show()


Output of the code


This code will display a horizontal bar containing the dominant colors in the sea image.


Document Clustering


1. Introduction to Document Clustering


Document clustering is the process of grouping similar documents together based on their content. By employing unsupervised learning techniques, we can automatically categorize news articles, reviews, or any text data into meaningful clusters.


2. Concepts and Techniques in Document Clustering


Building on the clustering concepts explored earlier, let's dive into the methods specifically tailored for text data.


Cleaning and Tokenizing Data Using Natural Language

Processing (NLP)


Cleaning and tokenizing are the first steps in processing text data. Tokenizing divides text into individual words or tokens, and cleaning removes unwanted characters.


Example Analogy: Think of tokenizing like chopping vegetables for a stew; each word is a different ingredient, and we want them separate and cleaned for the perfect blend.


Cleaning and Tokenizing Text Data

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = ["Text of document one.", "Text of document two."]

# Clean and tokenize
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)


Understanding Document Term Matrices and Sparse Matrices


A Document Term Matrix (DTM) is a matrix that describes the frequency of terms in a collection of documents. It's often a sparse matrix since most terms don't appear in most documents.


Creating a Document Term Matrix

# Display the Document Term Matrix
print(X.toarray())


Calculating Term Frequency - Inverse Document Frequency (TF-IDF)


TF-IDF weighs the importance of each term in a document relative to a collection of documents. It's like a score that tells how significant a word is to a document in a corpus.


Calculating TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Compute TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_X = tfidf_vectorizer.fit_transform(documents)


Performing Clustering Using k-means


Building on the k-means clustering learned earlier, we apply it to text data.


k-means Clustering on Text Data

# Apply k-means clustering
kmeans = KMeans(n_clusters=2).fit(tfidf_X)


3. Exploring Top Terms in Clusters


Understanding the content of each cluster requires analyzing the most significant terms.


Identifying the Top Terms in Different Clusters Using TF-IDF Weights


Finding Top Terms in Clusters

# Get top terms per cluster
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names_out()

for i in range(2):  # number of clusters
    top_terms = [terms[ind] for ind in order_centroids[i, :10]]
    print(f"Cluster {i}: {', '.join(top_terms)}")


Analyzing Hotel Reviews to Demonstrate Clustering


As a real-world example, let's apply clustering to hotel reviews.


Clustering Hotel Reviews

# Assuming reviews is a list of hotel reviews
tfidf_reviews = tfidf_vectorizer.fit_transform(reviews)
kmeans_reviews = KMeans(n_clusters=5).fit(tfidf_reviews)


4. Advanced Considerations in Document Clustering


Suggestions for Additional Data Preprocessing


More advanced preprocessing like stemming and lemmatization can further refine the clusters.


Stemming and Lemmatization

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_documents = [" ".join([lemmatizer.lemmatize(w) for w in word_tokenize(doc)]) for doc in documents]


Handling Large Datasets and Considering Different

Implementations


Large datasets may require specialized techniques like dimensionality reduction or optimized implementations of algorithms.


Clustering with Multiple Features


1. Introduction to Multivariate Clustering


Multivariate clustering involves grouping items based on more than two variables or features. It's like forming football teams where players are not only judged by their skills but also their stamina, teamwork, and strategic thinking.


2. Validating and Interpreting Clustering Results


When working with multiple features, validation and interpretation become more challenging yet essential.


Performing Basic Checks


Basic checks help ensure the clustering process has functioned properly.


Evaluating Cluster Centers and Sizes

from sklearn.cluster import KMeans

# Assume X_multivariate contains multiple features
kmeans = KMeans(n_clusters=3).fit(X_multivariate)

# Display cluster centers
print(kmeans.cluster_centers_)

# Display cluster sizes
print(np.bincount(kmeans.labels_))


Analyzing Clusters with Similar Centers


Sometimes clusters may have similar centers, leading to overlapping clusters. In such cases, further analysis and perhaps a different number of clusters might be needed.


3. Visualization Techniques for Clustering


Visualizing multi-dimensional data requires different types of plots.


Understanding and Utilizing Different Types of Plots


Visualizing Variables Across Clusters Using Bar Charts

import matplotlib.pyplot as plt

# Example data
clusters_means = [X_multivariate[kmeans.labels_ == i].mean(axis=0) for i in range(3)]

# Bar chart
fig, ax = plt.subplots()
for cluster_mean in clusters_means:
    ax.bar(range(X_multivariate.shape[1]), cluster_mean)
plt.show()


4. Identifying Top Items in Clusters


Examining specific examples within clusters can provide insights.


Examining Top Players in Different Clusters


Identifying Top Players in Clusters

# Assuming 'scores' represents a particular feature of interest
for i in range(3):  # number of clusters
    top_players = X_multivariate[kmeans.labels_ == i]['scores'].nlargest(5).index
    print(f"Cluster {i}: Top players {', '.join(top_players)}")


5. Feature Reduction Techniques


When dealing with many features, reducing dimensionality may be beneficial.


Brief Mention of Methods Like Factor Analysis and Multidimensional Scaling


Applying Principal Component Analysis (PCA) for Feature Reduction

from sklearn.decomposition import PCA

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_multivariate)

# Clustering on reduced features
kmeans_pca = KMeans(n_clusters=3).fit(X_pca)


Conclusion


Multivariate clustering adds a layer of complexity to the clustering process, accommodating multiple features and allowing for richer analysis. Whether identifying patterns across different variables or visualizing multi-dimensional data, this section provides a comprehensive toolkit. The techniques discussed here, coupled with your understanding of clustering from earlier sections, equip you with a versatile skill set applicable across various domains and datasets.

bottom of page