Understanding semantic search

Introduction

During my college days, one of the electives I had chosen was algorithms for information retrieval, it was a pretty fun course, learnt about how webpages are ranked while retrieving, cold start problem and how companies like netflix tackle. It was the only course where I had actually read the textbook prescribed for the course, if only I had done that for the other ones but anyway, recently I have gone back to my ML craze days and want to explore stuff, today I want to see what is up with semantic search and what is that all about.

Semantic search

Traditionally retrieving webpages relied upon keyword matching which does not fully leverage the query as it misses the context, with semantic search we want to take that into account as well, for instance when we type apple, it can mean the fruit or the company, based on context that changes, so using semantic search we can tackle that is what is expected. So using NLP techniques, we understand the query and retrieve relevant results.

How does it work?

Natural Language Processing (NLP) and entity recognition

Semantic search uses NLP to analyse structure of the language. We parse and try to understand structure of the sentence, identifying entities, and what is the relationship between the different entities.

Contextual Understanding

Semantic search engines takes into account the context of words and phrases. It also had to take into account synonyms, related concepts and sometimes even spelling mistakes, like we see in google search sometimes.

Vector Space Models and Embeddings

Words, phrases, and entire documents are transformed into high-dimensional vectors through techniques like word embeddings. Models such as Word2Vec, GloVe, and BERT capture the semantic relationships between words by placing similar words closer together in the vector space. This allows the search engine to find documents that are contextually similar to the query, even if they don’t contain the exact keywords.

Coding timeeeeee

Install Required Libraries

First, ensure you have the necessary libraries installed.

!pip install sentence-transformers scikit-learn

Download and Preprocess Dataset

We'll use the fetch_20newsgroups function from scikit-learn to download the dataset.

from sklearn.datasets import fetch_20newsgroups

# Download the dataset
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Print the dataset information
print(f"Number of documents: {len(newsgroups_data.data)}")
print(f"Number of categories: {len(newsgroups_data.target_names)}")

Compute Embeddings

We use the sentence-transformers library to load a pre-trained model and compute embeddings for each document.

from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings for the documents
doc_embeddings = model.encode(newsgroups_data.data, show_progress_bar=True)

Perform Semantic Search

We define a function to perform semantic search by computing the embedding of a query and finding the most similar documents using cosine similarity.

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def semantic_search(query, documents, doc_embeddings, top_n=3):
    # Compute embedding for the query
    query_embedding = model.encode([query])

    # Compute cosine similarity between query and documents
    similarities = cosine_similarity(query_embedding, doc_embeddings).flatten()

    # Get the top_n most similar documents
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Print the top_n most similar documents
    print(f"Query: {query}\n")
    for idx in top_indices:
        print(f"Document: {documents[idx][:200]}...")  # Print only the first 200 characters for brevity
        print(f"Similarity: {similarities[idx]:.4f}\n")

# Example query
query = "Baseball is fun"
semantic_search(query, newsgroups_data.data, doc_embeddings, top_n=3)

Results

As we can see the results are okayish at best, but are still kinda relevant, i would say the last one quite good. To be fair, it not that big of a dataset to have all knowledge in the world to give correct answer, but it does get the context right, using bigger datasets, better transformers and all will ensure we get good results and in fact that is what is happening in the real world currently, we have search engines, e-commerce sites, content recommendation platforms and even chatbots integrate semantic search to serve their customers better.