Data Labelling Using LLMs with Langformers
When most people think of Large Language Models (LLMs), they think of conversations, content generation, or summarization. But LLMs are also incredibly effective at data labelling — and now, with Langformers, you can easily utilize that power for you text labelling tasks.
Whether you're preparing training data, building a classifier, or just need quick annotations (maybe something around weak supervision), Langformers offers the simplest way to define labels and let LLMs do the heavy lifting.
How It Works
Langformers provides a high-level API to turn any supported LLM into a data labeller in just a couple of lines of code. All you need to do is:
- Load an LLM with Langformers.
- Define labels and conditions you care about.
- Label texts.
It’s that simple.
Getting Started with Langformers
First, a few quick notes:
- Hugging Face Models: Langformers supports chat-tuned models (those with a
chat_template
in theirtokenizer_config.json
) that are compatible with the Transformers library and your hardware.- Example:
meta-llama/Llama-3.2-1B-Instruct
— make sure you have access via Hugging Face.
- Example:
- Ollama Models: Ensure you have Ollama installed and the model pulled.
- Install Ollama: Download Ollama
- Pull a model (example):
ollama pull llama3.1:8b
Install Langformers
First, install Langformers using pip:
pip install -U langformers
Best practice: It’s recommended to create a virtual environment before installing any Python package globally. Check out Langformers official installation guide if you need help setting that up.
Langformers + LLMs for Data Labelling
Here's a quick example of how you can load an LLM, define labels and conditions, and label a text (for a single label task).
# Import langformers
from langformers import tasks
# Load an LLM as a data labeller
labeller = tasks.create_labeller(provider="huggingface", model_name="meta-llama/Meta-Llama-3-8B-Instruct", multi_label=False)
# Provide labels and conditions
conditions = {
"Positive": "The text expresses a positive sentiment.",
"Negative": "The text expresses a negative sentiment.",
"Neutral": "The text does not express any emotions."
}
# Label a text
text = "No doubt, The Shawshank Redemption is a cinematic masterpiece."
labeller.label(text, conditions)
If your use case involves labelling a complete dataset, put `labeller.label()` inside a loop.
We could also pass multiple texts at once to the LLM, however LLMs might produce incorrect labels for texts as they go down the list. Therefore, it is best to label one sentence at a time, if computing resource is not an issue.
If we set multi_label=True, the LLM will get to select multiple labels.
View official documentation here: https://langformers.com/data-labelling-llms.html