Ollama and Local LLMs: Step-by-step Guide
Everyone loves ChatGPT, right? And hey, shoutout to DeepSeek Hero too. But what about the unsung heroes of the AI world—those open-source large language models (LLMs) quietly powering innovation behind the scenes?
While models like Meta’s LLaMA might ring a bell for some, there’s a whole family of open-source LLMs you may not have heard of yet—like Mistral, Gemma, Qwen, and even open versions of DeepSeek. Sure, most non-CS folks won’t download massive multi-gigabyte models just to have a chat. After all, why bother when services like ChatGPT or Gemini are just a URL away?
But what if running open-source LLMs locally was as easy as installing an app and typing in your terminal?
🚪 Enter Ollama
In this blog, we’ll walk step-by-step through setting up Ollama, a tool that makes running powerful open-source models on your own machine a breeze. We’ll also integrate Langformers with Ollama models.
For this guide, I'm using a MacBook Pro with an M1 Pro, but this works seamlessly on systems with other Apple Silicon (MPS) chips or NVIDIA GPUs.
What Is Ollama?
Ollama is a lightweight, developer-friendly framework that lets you run LLMs locally without needing a resource-heavy computing environment.
It supports a wide range of models, including:
- LLaMA
- Mistral
- DeepSeek
- Gemma
- Qwen
Ollama Components
- Ollama CLI (or desktop app)– Manages local models and runtime.
- Ollama Python Library – Lets you interact with models programmatically.
Install Ollama
Windows: Install Ollama from here.
macOS: Install Ollama from here.
Linux: Open your terminal and run the following installation script:
curl -fsSL https://ollama.com/install.sh | sh
This script installs Ollama on your system. Once it’s done, you’ll be ready to download and run models locally.
If you want to use Ollama from your Python scripts or applications, you’ll also need the Python package. However, in this guide, we’ll use Langformers to interact with Ollama models through a beautiful chat interface. Langformers also allows you to run LLM inference over a REST API, making it easy to integrate models into your applications.
Download a Model
Let’s download a model. We’ll use the LLaMA 3.1 8B model. It is a good starting point.
Open your terminal and run:
ollama pull llama3.1:8b
Wait a few minutes—it’s about 4.7 GB. Once it’s downloaded, confirm with:
ollama list
You should see something like:
NAME ID SIZE MODIFIED
llama3.1:8b 42182419e950 4.7 GB ...
Other Models You Can Try
To run any model (desktop application):
ollama run <model-name>
In our case, we'll just start Ollama without running the desktop application: ollama serve llama3.1:8b
.
Langformers + Ollama
Now let’s integrate Langformers with the model we just downloaded. If you haven’t already, install Langformers with:
pip install -U langformers
Now, create a Python file with the following code:
from langformers import tasks
generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")
generator.run(host="0.0.0.0", port=8000)
Open http://0.0.0.0:8000 in your browser. You’ll see a slick chat interface where you can talk to LLaMA 3.1 — all local, no cloud required!
Play around with the LLM settings, and start chatting.
Expose Your Local Model as an API
Langformers supports REST API-based interaction. Just send a POST
request to http://0.0.0.0:8000
with this JSON payload:
{
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi"
}
Langformers streams LLM response using SSE (Server-Sent Events). You’ll need to parse the streams—but don’t worry, Langformers has you covered.
Parse SSE Streams
Langformers provides the StreamProcessor
class to handle SSE streams natively.
from langformers.generators import StreamProcessor
headers = {
"Content-Type": "application/json",
}
client = StreamProcessor(headers=headers)
payload = {
"system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
"memory_k": 10,
"temperature": 0.5,
"top_p": 1,
"max_length": 5000,
"prompt": "Hi, how are you today?",
}
response = client.process(endpoint_url="http://0.0.0.0:8000/api/generate", payload=payload)
for chunk in response:
print(chunk, end="", flush=True)
Add Authentication (Optional)
You can secure your API by providing an authentication dependency. Example:
async def auth_dependency():
# You can add API key checks or user auth here
return True
generator = tasks.create_generator(
provider="ollama",
model_name="llama3.1:8b",
dependency=auth_dependency
)
For more detailed usage of what you can achieve with Langformers and Ollama models, refer to the official documentation:
Happy LLMing! See you in the next blog.