Sentence Embedding and Similarity with Langformers

Sentence Embedding and Similarity with Langformers
Sentence-level embedding extraction. Langformers utilizes mean-pooling with attention in context, similar to the approach used by Sentence Transformers, for generating sentence embeddings.

When working with semantic search or designing sophisticated NLP pipelines such as RAG (Retrieval-augmented Generation), converting text into numerical vectors—a process called embedding—is often one of the very first steps. Embeddings capture the meaning and context of sentences in a way that machines can understand, enabling powerful applications like search engines, recommendation systems, and conversational AI.

Langformers makes sentence embeddings incredibly simple. This guide will walk you through how to embed sentences and calculate textual similarity.

Why Sentence Embeddings?

Sentence embeddings represent textual data as high-dimensional vectors, allowing algorithms to compare, search, cluster, or classify text based on its semantic meaning. Rather than treating words individually, embeddings understand the context of an entire sentence or paragraph.

Encoder-only models are mostly used for this task due to their bi-directional attention mechanism.

Setting Up

First, make sure you have Langformers installed in your environment. If not, install it using pip:

pip install -U langformers

Sentence Embedding with Langformers

Using Langformers for embedding involves just two easy steps:

  1. Create an Embedder using create_embedder().
  2. Embed Your Sentences using the embed() method.

Let’s see it in action!

# Import langformers
from langformers import tasks

# Create an embedder
embedder = tasks.create_embedder(provider="huggingface", model_name="sentence-transformers/all-MiniLM-L6-v2")

# Get your sentence embeddings
embeddings = embedder.embed(["I am hungry.", "I want to eat something."])

That's it! You now have high-quality vector representations of your sentences ready to use in your applications.

Choosing the Right Model

Langformers currently supports Hugging Face models. You can pick the model that best suits your needs depending on the trade-off between speed, accuracy, and size.

Tip: Browse available models here.

Some popular models include:

  • sentence-transformers/all-MiniLM-L6-v2 (lightweight and fast)
  • sentence-transformers/all-mpnet-base-v2 (higher accuracy)
  • sentence-transformers/paraphrase-MiniLM-L3-v2 (optimized for paraphrase detection)

The model you choose is passed as the model_name argument in create_embedder().

💡
Encoder-only models such as BERT and RoBERTa, although capable of producing amazing results after fine-tuning (with a classification head) on downstream classification tasks, do not out-of-the-box produce semantically meaningful sentence embeddings. These models require additional restructuring of their embedding space. For instance, sentence-transformers/all-mpnet-base-v2 is a fine-tuned version of microsoft/mpnet-base.

Textual Similarity

Beyond embeddings, Langformers makes it easy to measure how similar two sentences are.

To compute cosine similarity between two sentences, use the similarity() method.

Example:

# Get cosine similarity
similarity_score = embedder.similarity(["I am hungry.", "I am starving."])

print(f"Similarity Score: {similarity_score}")

How It Works

The similarity() method:

  • Takes a list of exactly two text strings.
  • Returns a cosine similarity score ranging from:
    • 1.0Identical Meaning
    • 0.0No Relationship
    • -1.0Completely Opposite Meaning

View official documentation here: https://langformers.com/embed-sentences.html