Build Your Own LLM-Powered Apps with Langformers: LLM Inference API
Today, we're excited to showcase another powerful feature of Langformers — LLM Inference via a REST API. Whether you’re building a chatbot, automating workflows, or adding AI magic to your product, Langformers gives you everything you need to bring large language models (LLMs) into your application stack effortlessly.
Let’s dive in!
What is LLM Inference?
LLM inference means sending a request to a language model and getting a generated response back.
With Langformers, this is as easy as sending a POST
request to /api/generate
. In just a few lines of code, you can set up a server that listens for prompts, processes them with a powerful LLM, and streams the generated tokens back to your application.
And the best part? Langformers supports both Hugging Face and Ollama models right out of the box!
Getting Started with Langformers
First, a few quick notes:
- Hugging Face Models: Langformers supports chat-tuned models (those with a
chat_template
in theirtokenizer_config.json
) that are compatible with the Transformers library and your hardware.- Example:
meta-llama/Llama-3.2-1B-Instruct
— make sure you have access via Hugging Face.
- Example:
- Ollama Models: Ensure you have Ollama installed and the model pulled.
- Install Ollama: Download Ollama
- Pull a model (example):
ollama pull llama3.1:8b
Install Langformers
First, install Langformers using pip:
pip install -U langformers
Best practice: It’s recommended to create a virtual environment before installing any Python package globally. Check out Langformers official installation guide if you need help setting that up.
Running Langformers in LLM Inference Mode
Create a simple Python script to start serving your model:
# Import Langformers
from langformers import tasks
# Create a generator
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")
# Run the generator
generator.run(host="0.0.0.0", port=8000)
That's it! Your LLM is now live at http://0.0.0.0:8000/api/generate
.
API Request Format
To interact with the API, send a POST
request to /api/generate
with a JSON payload.
Payload Example:
{
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi"
}
- prompt is required.
- Other fields are optional but give you fine-grained control.
Example: Making a Request with Python
# Imports
import requests
import json
# Endpoint URL
url = "http://0.0.0.0:8000/api/generate"
# Define payload
payload = json.dumps({
"prompt": "Hi"
})
# Headers
headers = {
"Content-Type": "application/json",
}
# Send request
response = requests.post(url, headers=headers, data=payload)
# Print response
print(response.text)
Real-time Streaming
Langformers supports server-sent event (SSE) streams, so your app can start processing tokens immediately without waiting for the full response.
Each token chunk looks like:
data: {"chunk": "Hello "}
Parsing SSE Streams with StreamProcessor
Langformers provides a native client to handle SSE streams easily.
from langformers.generators import StreamProcessor
# Create a StreamProcessor client
client = StreamProcessor(headers={"Content-Type": "application/json"})
# Payload
payload = {
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"prompt": "Hi, how are you today?",
}
# Stream the response
response = client.process(endpoint_url="http://0.0.0.0:8000/api/generate", payload=payload)
# Print tokens as they arrive
for chunk in response:
print(chunk, end="", flush=True)
Fine-tuning Your API Responses
Here’s what the API parameters control:
system_prompt
: How should the LLM behave across all prompt. Basically, system level instruction for the LLM.memory_k
: Number of previous chat messages to remember.temperature
: Controls randomness. (higher = more creative)top_p
: Controls diversity. (lower = more focused)max_length
: Max tokens to generate. Includes, inputs and generated tokens.prompt
: Actual user prompt.
Important: Changing the system_prompt
resets conversation memory.
Securing the API: Authentication
Need to protect your endpoint? Langformers has you covered with authentication dependencies. Simply pass a dependency function while creating the generator. If the dependency function returns a value, access is granted; however, raising an HTTPException will block access.
Simple Auth Example
# Imports
from langformers import tasks
from fastapi import Request, HTTPException
# Define a set of valid API keys
API_KEYS = {"12345", "67890"}
async def auth_dependency(request: Request):
"""
Extracts the Bearer token and verifies it against a list of valid API keys.
"""
auth_header = request.headers.get("Authorization")
if not auth_header or not auth_header.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Invalid authorization header format.")
token = auth_header.split("Bearer ")[1]
if token not in API_KEYS:
raise HTTPException(status_code=401, detail="Unauthorized.")
return True # Allow access
# Create a generator with authentication
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b", dependency=auth_dependency)
# Run the generator
generator.run(host="0.0.0.0", port=8000)
When calling your API, simply include:
headers = {
'Authorization': 'Bearer 12345',
'Content-Type': 'application/json'
}
For production-grade security, consider using OAuth2 and JWT (JSON Web Tokens) instead of simple API keys.
View official documentation here: https://langformers.com/llm-inference.html