top of page

Building Intelligent Text Analysis Systems with TF-IDF and Cosine Similarity



1. Building TF-IDF Document Vectors


A. Understanding N-Gram Modeling


N-grams refer to contiguous sequences of n items from a given text. They are crucial in representing text as numerical vectors, facilitating computational analysis. Here's how we can create n-grams.


Example of vector dimension corresponding to word frequency in a document:

from sklearn.feature_extraction.text import CountVectorizer

text = ["I love data science and data analysis."]
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
# Output: ['analysis', 'and', 'and data', 'data', 'data analysis', 'data science', 'love', 'love data', 'science', 'science and']


B. Motivation for Weighting Words


Weighting words helps us discern their importance in a document. Words like 'the', 'is', 'and', can be problematic as they frequently occur.


Example of weighting based on word frequency:

from sklearn.feature_extraction.text import TfidfVectorizer

text = ["I love data science and data analysis."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)
print(X.toarray())


C. Applications of Word Weighting


Word weighting can help in automatic detection of stopwords and is used in search algorithms, recommender systems, and predictive modeling.


D. Introducing Term Frequency-Inverse Document Frequency (TF-IDF)


TF-IDF is a statistical measure that helps in assessing the importance of a word in a document within a corpus.


Example:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.idf_)


E. Implementation with Scikit-Learn


TfidfVectorizer is commonly used for this purpose, as shown above.


2. Cosine Similarity in Text Analysis


A. Introduction to Cosine Similarity


Cosine similarity is used to measure the cosine of the angle between two vectors, thus quantifying their similarity.


B. Mathematical Explanation


Given two vectors A and B, their cosine similarity is given by:

\(\text{cosine similarity} = \frac{A \cdot B}{||A|| \times ||B||}\)


Example of computation:

from sklearn.metrics.pairwise import cosine_similarity

A = [[1, 2, 3]]
B = [[1, 1, 1]]
result = cosine_similarity(A, B)
print(result)
# Output: [[0.9258201]]


C. Cosine Score Characteristics


The cosine similarity score ranges between -1 and 1, reflecting the degree of similarity.


D. Implementation using Scikit-Learn


The example above details the use of cosine_similarity function.


3. Building a Recommender System Using TF-IDF and Cosine Scores


A. Introduction to Movie Recommender System


Recommender systems are tools that predict user preferences. For example, we can use them to suggest movies based on plot lines.


Code snippet for movie data processing:

import pandas as pd

movies = pd.read_csv('movies.csv')
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(movies['description'])


B. Building the Recommender System


The recommender function can be built by generating TF-IDF vectors and creating a cosine similarity matrix.


Code snippet for similarity matrix:

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Code snippet for the recommender function:

def movie_recommender(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[movie_indices]


4. Beyond N-Grams: Word Embeddings


A. Limitations of BoW and TF-IDF


An example limitation is their inability to capture the semantic similarity between words. Word embeddings, such as Word2Vec, can help in overcoming these shortcomings.


Conclusion


In this tutorial, we explored how to build intelligent text analysis systems using TF-IDF and cosine similarity. We started with the creation of TF-IDF document vectors, then used cosine similarity for text analysis, and finally built a movie recommender system. We also looked at the limitations of these methods and hinted at the direction of word embeddings as a way to capture deeper semantic relationships. The integration of these techniques offers powerful ways to analyze and derive insights from text data. Feel free to experiment and expand on these concepts to create more advanced and tailored solutions.

bottom of page