Fixed-size, Semantic and Recursive Chunking Strategies for LLMs
In modern Retrieval-Augmented Generation (RAG) systems, handling large documents efficiently is critical. Embedding models have token limits that, when exceeded, can lead to incomplete processing or errors. This is where chunking becomes critical. By breaking large documents into smaller, manageable pieces, chunking ensures that information remains accessible, relevant, and optimized for retrieval and processing.
In this blog post, we will dive deep into chunking techniques with Langformers, exploring all three Fixed-size, Semantic and Recursive chunking strategies — complete with practical examples.
What is Chunking?
Chunking is the process of splitting a large document into smaller units called chunks. Each chunk is small enough to fit within the token limits of the chosen embedding model, yet sufficient enough to retain meaningful information.
The ultimate goal of chunking is to improve:
- Retrieval accuracy — relevant chunks are retrieved instead of entire documents.
- Processing efficiency — smaller inputs lead to faster model inference. LLMs have token generation limits; feeding a whole text document, such as a book, is not feasible.
Across all chunking strategies in Langformers, tokenization plays a central role. The chunk size is defined in terms of tokens (not words), and the number of tokens can vary based on the tokenizer used.
Chunking Strategies in Langformers
As of writing this post, Langformers provides three main chunking strategies:
- Fixed-size Chunking (also with overlapping)
- Semantic Chunking
- Recursive Chunking
Let's explore each in detail.
Fixed-size Chunking
Fixed-size chunking is the simplest method: documents are split into chunks of a specified number of tokens. It’s fast, predictable, and works well for content where structure is less important.
Key Features
- Divides text based purely on token count.
- Optional overlapping between chunks for better context preservation.
Let's see how we can implement this chunking strategy with Langformers.
First, make sure you have Langformers installed in your environment. If not, install it using pip:
pip install -U langformers
Fixed-size Chunking Example
# Import Langformers
from langformers import tasks
# Create a fixed-size chunker
chunker = tasks.create_chunker(strategy="fixed_size", tokenizer="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(
document="This is a test document. It contains several sentences. We will chunk it into smaller pieces.",
chunk_size=8
)
In this example:
- We use the
sentence-transformers/all-mpnet-base-v2
tokenizer. - Each chunk contains approximately 8 tokens.
- Overlapping chunks can be created by specifying the
overlap
parameter.
Semantic Chunking
Semantic chunking goes a step further by considering the meaning of the content. It first creates small initial chunks, then merges them based on semantic similarity. This leads to more contextually meaningful chunks.
Semantic chunking is ideal for:
- Technical documents
- Research papers
- Legal texts
- Any domain where preserving semantic integrity is crucial
How Semantic Chunking Works
- Initially, the document is split into small chunks based on a token limit.
- The chunks are then grouped together based on their semantic similarity, controlled by a similarity threshold.
Semantic Chunking Example
# Import Langformers
from langformers import tasks
# Create a semantic chunker
chunker = tasks.create_chunker(strategy="semantic", model_name="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(
document="Cats are awesome. Dogs are awesome. Python is amazing.",
initial_chunk_size=4,
max_chunk_size=10,
similarity_threshold=0.3
)
In this example:
- The document is initially split into very small chunks (4 tokens).
- Similar chunks are merged until a maximum size of 10 tokens is reached.
- Only chunks with similarity greater than 0.3 are grouped together.
Recursive Chunking
With this chunking strategy, the document is split hierarchically based on the provided separators.
Generally, a document can be first divided by sections, then by paragraphs, and further down at token -level as needed. Langformers adopts this strategy by first splitting text at double newlines (\n\n
) to identify sections, then at single newlines (\n
) for paragraphs, and eventually down to individual tokens. If any chunk exceeds the chunk size limit, it is recursively broken down into smaller chunks until all fit within the allowed size.
We also have the option to specify custom separators for splitting the document.
Recursive Chunking Example
# Import Langformers
from langformers import tasks
# Create a chunker
chunker = tasks.create_chunker(strategy="recursive", tokenizer="sentence-transformers/all-mpnet-base-v2")
# Chunk a document
chunks = chunker.chunk(document="Cats are awesome.\n\nDogs are awesome.\nPython is amazing.",
separators=["\n\n", "\n"],
chunk_size=5)
If chunk size is not provided, tokenizer's max length will be used.
That's all. It's that easy to split documents into chunks with Langformers.
Choosing the right chunking strategy can make a significant difference in the performance of your RAG pipeline or search system. Langformers will continue to add additional chunking strategies in the future. Stay posted!
View official documentation here: https://langformers.com/chunking-for-llms.html