Sentence Embedding and Similarity with Langformers
When working with semantic search or designing sophisticated NLP pipelines such as RAG (Retrieval-augmented Generation), converting text into numerical vectors—a process called embedding—is often one of the very first steps. Embeddings capture the meaning and context of sentences in a way that machines can understand, enabling powerful applications like search engines, recommendation systems, and conversational AI.
Langformers makes sentence embeddings incredibly simple. This guide will walk you through how to embed sentences and calculate textual similarity.
Why Sentence Embeddings?
Sentence embeddings represent textual data as high-dimensional vectors, allowing algorithms to compare, search, cluster, or classify text based on its semantic meaning. Rather than treating words individually, embeddings understand the context of an entire sentence or paragraph.
Encoder-only models are mostly used for this task due to their bi-directional attention mechanism.
Setting Up
First, make sure you have Langformers installed in your environment. If not, install it using pip:
pip install -U langformers
Sentence Embedding with Langformers
Using Langformers for embedding involves just two easy steps:
- Create an Embedder using
create_embedder()
. - Embed Your Sentences using the
embed()
method.
Let’s see it in action!
# Import langformers
from langformers import tasks
# Create an embedder
embedder = tasks.create_embedder(provider="huggingface", model_name="sentence-transformers/all-MiniLM-L6-v2")
# Get your sentence embeddings
embeddings = embedder.embed(["I am hungry.", "I want to eat something."])
That's it! You now have high-quality vector representations of your sentences ready to use in your applications.
Choosing the Right Model
Langformers currently supports Hugging Face models. You can pick the model that best suits your needs depending on the trade-off between speed, accuracy, and size.
Tip: Browse available models here.
Some popular models include:
sentence-transformers/all-MiniLM-L6-v2
(lightweight and fast)sentence-transformers/all-mpnet-base-v2
(higher accuracy)sentence-transformers/paraphrase-MiniLM-L3-v2
(optimized for paraphrase detection)
The model you choose is passed as the model_name
argument in create_embedder()
.
sentence-transformers/all-mpnet-base-v2
is a fine-tuned version of microsoft/mpnet-base
.Textual Similarity
Beyond embeddings, Langformers makes it easy to measure how similar two sentences are.
To compute cosine similarity between two sentences, use the similarity()
method.
Example:
# Get cosine similarity
similarity_score = embedder.similarity(["I am hungry.", "I am starving."])
print(f"Similarity Score: {similarity_score}")
How It Works
The similarity()
method:
- Takes a list of exactly two text strings.
- Returns a cosine similarity score ranging from:
1.0
→ Identical Meaning0.0
→ No Relationship-1.0
→ Completely Opposite Meaning
View official documentation here: https://langformers.com/embed-sentences.html