large language models

Ollama and Local LLMs: Step-by-step Guide

Rabindra Lamsal

29 Apr 2025 • 3 min read

Everyone loves ChatGPT, right? And hey, shoutout to DeepSeek Hero too. But what about the unsung heroes of the AI world—those open-source large language models (LLMs) quietly powering innovation behind the scenes?

While models like Meta’s LLaMA might ring a bell for some, there’s a whole family of open-source LLMs you may not have heard of yet—like Mistral, Gemma, Qwen, and even open versions of DeepSeek. Sure, most non-CS folks won’t download massive multi-gigabyte models just to have a chat. After all, why bother when services like ChatGPT or Gemini are just a URL away?

But what if running open-source LLMs locally was as easy as installing an app and typing in your terminal?

🚪 Enter Ollama

In this blog, we’ll walk step-by-step through setting up Ollama, a tool that makes running powerful open-source models on your own machine a breeze. We’ll also integrate Langformers with Ollama models.

For this guide, I'm using a MacBook Pro with an M1 Pro, but this works seamlessly on systems with other Apple Silicon (MPS) chips or NVIDIA GPUs.

What Is Ollama?

Ollama is a lightweight, developer-friendly framework that lets you run LLMs locally without needing a resource-heavy computing environment.

It supports a wide range of models, including:

LLaMA
Mistral
DeepSeek
Gemma
Qwen

Ollama Components

Ollama CLI (or desktop app)– Manages local models and runtime.
Ollama Python Library – Lets you interact with models programmatically.

Install Ollama

Windows: Install Ollama from here.

macOS: Install Ollama from here.

Linux: Open your terminal and run the following installation script:

curl -fsSL https://ollama.com/install.sh | sh

This script installs Ollama on your system. Once it’s done, you’ll be ready to download and run models locally.

If you want to use Ollama from your Python scripts or applications, you’ll also need the Python package. However, in this guide, we’ll use Langformers to interact with Ollama models through a beautiful chat interface. Langformers also allows you to run LLM inference over a REST API, making it easy to integrate models into your applications.

Download a Model

Let’s download a model. We’ll use the LLaMA 3.1 8B model. It is a good starting point.

Open your terminal and run:

ollama pull llama3.1:8b

Wait a few minutes—it’s about 4.7 GB. Once it’s downloaded, confirm with:

ollama list

You should see something like:

NAME            ID              SIZE     MODIFIED
llama3.1:8b     42182419e950    4.7 GB   ...

Other Models You Can Try

To run any model (desktop application):

ollama run <model-name>

In our case, we'll just start Ollama without running the desktop application: ollama serve llama3.1:8b.

Langformers + Ollama

Now let’s integrate Langformers with the model we just downloaded. If you haven’t already, install Langformers with:

pip install -U langformers

Now, create a Python file with the following code:

from langformers import tasks

generator = tasks.create_generator(provider="ollama", model_name="llama3.1:8b")

generator.run(host="0.0.0.0", port=8000)

Open http://0.0.0.0:8000 in your browser. You’ll see a slick chat interface where you can talk to LLaMA 3.1 — all local, no cloud required!

Play around with the LLM settings, and start chatting.

Expose Your Local Model as an API

Langformers supports REST API-based interaction. Just send a POST request to http://0.0.0.0:8000 with this JSON payload:

{
    "system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
    "memory_k": 10,
    "temperature": 0.5,
    "top_p": 1,
    "max_length": 5000,
    "prompt": "Hi"
}

Langformers streams LLM response using SSE (Server-Sent Events). You’ll need to parse the streams—but don’t worry, Langformers has you covered.

Parse SSE Streams

Langformers provides the StreamProcessor class to handle SSE streams natively.

from langformers.generators import StreamProcessor

headers = {
    "Content-Type": "application/json",
}

client = StreamProcessor(headers=headers)

payload = {
    "system_prompt": "You are an Aussie AI assistant, reply in an Aussie way.",
    "memory_k": 10,
    "temperature": 0.5,
    "top_p": 1,
    "max_length": 5000,
    "prompt": "Hi, how are you today?",
}

response = client.process(endpoint_url="http://0.0.0.0:8000/api/generate", payload=payload)

for chunk in response:
    print(chunk, end="", flush=True)

Add Authentication (Optional)

You can secure your API by providing an authentication dependency. Example:

async def auth_dependency():
    # You can add API key checks or user auth here
    return True

generator = tasks.create_generator(
    provider="ollama",
    model_name="llama3.1:8b",
    dependency=auth_dependency
)

For more detailed usage of what you can achieve with Langformers and Ollama models, refer to the official documentation:

Happy LLMing! See you in the next blog.