How LLMs Work: A Beginner's Guide to Decoder-Only Transformers

How LLMs Work: A Beginner's Guide to Decoder-Only Transformers
Photo by Nik / Unsplash

A language model like GPT (which stands for Generative Pretrained Transformer) takes text, breaks it into tokens (words or subwords), converts those tokens into numbers, processes those numbers through layers of Transformer decoders, and finally outputs a probability distribution over all possible tokens in its vocabulary. It then selects the token with the highest probability. This process repeats until a full response is generated.

If you're new to the Transformer architecture, this might sound too much, but stick with me throughout this blog post. By the end, you'll have a clear picture you won’t forget.

💡
Generative LLMs (like GPT, LLaMa, Deepseek, etc.) are decoder-only models based on the Transformer architecture. These models are auto-regressive, i.e., they generate one token at a time based on the previously seen tokens.

While this post uses GPT as the main example, the explanations generally apply to other LLMs too. The core principles remain similar.

GPT and ChatGPT

Let's clear one thing before we get started: ChatGPT is an application built on top of large language models (LLMs) like GPT-3.5 and GPT-4. While GPT is the underlying model that learns to predict and generate text, ChatGPT adds several key layers:

  • A conversational interface – the chat-style interaction we’re all familiar with.
  • Instruction tuning – GPT is fine-tuned using datasets made of (instruction, input, output) examples. Without instruction tuning, if you prompt: "Summarize this paragraph," a raw GPT model might just continue the paragraph. An instruction-tuned model recognizes "summarize" as a command, generates a summary, and knows when to stop.
  • Safety layers – which are responsible for:
    • Reinforcement Learning from Human Feedback (RLHF): human evaluators guide the model’s behavior, encouraging helpfulness and harmlessness.
    • System prompts & behaviour shaping: internal instructions guide the model’s tone, personality, and limits.

So, when you’re chatting with ChatGPT, you’re actually interacting with a powerful GPT model that’s been tuned to follow instructions and have natural conversations.

Now that we're familiar with this difference, let's get started step-by-step.

Input Text → Tokens

Think of GPT as doing smart autocomplete.

Input: Gautam Buddha was born in

GPT doesn't understand raw text. Instead, it uses a tokenizer that converts text into tokens, typically subwords. Each LLM has a vocabulary: a list of tokens and their corresponding ID numbers, called input IDs.

Since GPT is trained on multiple languages, imagine how massive the vocabulary would be if it had to include every word from every language! Instead, GPT breaks words into subwords or tokens.

For example, if GPT has to understand the name "Rabindra," it might split it into tokens like R, ab, ind, ra.

Code Example: Tokenization

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

tokens = tokenizer.tokenize("Rabindra")
print(tokens)  
# Output: ['R', 'ab', 'ind', 'ra']

Some words like "Australia" are found whole in the vocabulary:

tokenizer.tokenize("Australia")
# Output: ['Australia']

Now let’s tokenize a sentence:

tokenizer.tokenize("Melbourne is an amazing city.")
# Output: ['Mel', 'bourne', 'Ä is', 'Ä an', 'Ä amazing', 'Ä city', '.']

Notice the special character Ġ before some tokens; it indicates a space before the token. Subwords like 'Mel' and 'bourne' don’t have it, which means they are fragments of a larger word.

Tokens → Input IDs

Once tokenized, the tokens are mapped to numbers (input IDs).

tokens = tokenizer.encode("Melbourne is an amazing city.")
print(tokens)
# Output: [21102, 12544, 318, 281, 4998, 1748, 13]

Let’s try another sentence:

tokens = tokenizer.encode("Sydney is an amazing city.")
# Tokenized: ['S', 'yd', 'ney', 'Ä is', 'Ä an', 'Ä amazing', 'Ä city', '.']
# Output: [50, 5173, 1681, 318, 281, 4998, 1748, 13]

As you can see, "Sydney" and "Melbourne" are broken into different tokens, but the rest of the sentence has the same input IDs.

Remember that each LLM has its own tokenizer and vocabulary.

Now, let's get back to our example: Gautam Buddha was born in

Input IDs → Embeddings

Each input ID is passed through an embedding layer to become a fixed-size vector (token embedding/token vector) (e.g., size 768). If you're wondering what such vectors are, I've explained them in detail in this article.

Gautam Buddha was born in

↓

['G', 'aut', 'am', 'Ä Buddha', 'Ä was', 'Ä born', 'Ä in']

↓

[38, 2306, 321, 19154, 373, 4642, 287]  

↓

[
  E1 = [0.01, -0.2, ..., 0.003],   # vector for "G"
  E2 = [0.05, 0.07, ..., -0.01],   # "aut"
  ...
]

Positional Embeddings: LLMs care about word order. The following sentences are different to LLMs.

  • "Gautam Buddha was born in Nepal."
  • "Nepal was born in Gautam Buddha."

To preserve word order, each token's embedding is added to a positional encoding. Think of positional encodings as embeddings for each position. For each token, the final embedding = token_embedding + position_embedding

So:

  • Token 0 → token_embedding("G") + position_embedding(0)
  • Token 1 → token_embedding("aut") + position_embedding(1)
  • ... and so on
💡
Both token and positional embeddings are trainable parameters in LLMs like GPT. They are optimized just like the rest of the model using backpropagation and gradient descent.

Embeddings → Transformer Decoder Blocks

Now comes the magic: passing the embedded sequence into a stack of Transformer decoder blocks. This is where the model thinks. You can imagine each decoder block as a little reasoning unit that asks:

"Based on everything I’ve seen so far, what should I focus on to guess the next word?"

Each block has three jobs:

  • Masked Self-Attention: What matters the most right now?
  • Feedforward Neural Network: Making sense of the idea.
  • Residual Connections + LayerNorm: Keeping all stable
Recommended

If you're curious about the detailed math alongside visuals behind how transformers work, especially masked self-attention, I highly recommend checking out Jay Alammar's "The Illustrated GPT-2". It's one of the best, most visual explanations out there.

Read Jay's article

Masked Self-Attention

Our input so far is: Gautam Buddha was born in

The model is now trying to figure out what word comes next.

But it doesn’t treat each word equally — instead, it looks back at the previous words and tries to decide which ones are most important to focus on. Maybe "born" is really important here. Maybe "Buddha" matters too.

This "looking back" process is called masked self-attention.

Here’s how to think about it:

  • Imagine each token is asking, "Who should I listen to in this sentence amongst tokens that are before me?"
  • The model decides: "Let me pay 50% attention to born, 30% to Buddha, and so on."
  • It then blends the meanings of those words to help it guess the next one.

This is how the model starts to understand context. Not just what the last word was — but which earlier words matter for the next prediction.


Analogy + Maths

Each word (token) is a student

Imagine we’re in a classroom where every word is a student in a line. Let’s focus on the word in — this is our student who’s about to speak and trying to decide what to say next.

To prepare for this, every student is given three things:

  • A Query vector (Q) – like a question they're asking: "Whose input should I listen to?"
  • A Key vector (K) – like a nametag that says what they know about.
  • A Value vector (V) – like their actual knowledge or opinion.

How are these vectors generated? The embedding of each token is multiplied by three different weight matrices to generate these vectors. The Q, K, V vectors will have smaller dimensions (e.g., 64) than the original token embedding (e.g., 768). The weight matrices are learnt during training.

Why are the Q, K, and V dimensions smaller than the original token embedding dimension? Well, it's a design choice. And to be honest, it's not just masked self-attention we're dealing with — it's actually multi-headed masked self-attention. The self-attention we're discussing here is happening in one head. But in practice, this attention mechanism is repeated across, say, 12 different heads. So, if each head uses Q, K, and V vectors with a dimension of 64, then after processing through all 12 heads in parallel, the outputs are concatenated: 64 × 12 = 768 dimensions. Sound familiar? That’s the same as the original token embedding dimension!

Of course, different large language models (LLMs) can have different Q, K, V vector sizes and a different number of heads. Each head will have a different set of weight matrices for generating Q, K, and V vectors.

Let's continue with our student and class analogy.

Attention Scores: Asking the class

Our student in takes its Query vector and goes down the row, comparing it with each earlier student’s Key vector.

Mathematically, this is a dot product (generalized for all tokens):

\[ \text{Score}_{i,j} = Q_i \cdot K_j \]

Analogy: The student in is asking: "Hey Buddha, does what you said help me figure out what comes next? How about you, born?"

The dot product tells us how aligned their topics are — high scores mean better alignment.

Scaling and Softmax: Normalizing opinions

Now that in has scores for each previous student, we do two things:

  • Scale the scores by \( \sqrt{d_k} \)​​, where \( d_k \) is the dimension of the Key vectors (to keep gradients stable).
  • Softmax the scores so they turn into probabilities (generalized for all tokens):

\[ \text{AttentionWeight}_{i,j} = \text{softmax}\left( \frac{Q_i \cdot K_j}{\sqrt{d_k}} \right) \]

What does a softmax function do?

  • It squashes all the scores into numbers between 0 and 1.
  • And most importantly, they all add up to 1, like a probability distribution.

Analogy: The student in now decides: "Okay, I’ll listen 50% to born, 30% to Buddha, 7% to was…"

Weighted Sum: Combining the voices

Now the student gathers the Value vectors of the earlier students and combines them using the attention weights (generalized for all tokens):

\[ \text{Output}_i = \sum_j \text{AttentionWeight}_{i,j} \cdot V_j \]

This gives us a contextualized representation of every tokens. \( \text{Output}_7 \) now carries not just the word in itself, but the important bits of the earlier words it decided to attend to.

This process of computing self-attention scores and generating a contextualized representation happens multiple times in parallel — this is known as multi-headed masked self-attention. The output for token in​ is concatenated from all the heads, then multiplied by an additional weight matrix (linear projection) for combining the information, and finally passed through a feedforward neural network.

💡
Tip: Models like BERT, however, can see the entire input sequence — they use a bidirectional self-attention mechanism. This makes them well-suited for tasks that require full context, such as text classification (e.g., fine-tuning BERT or RoBERTa), sentence similarity (e.g., using Sentence Transformers), and more.

Feedforward Neural Network

The contextualized representation of the word in is passed through a neural network. Think of this like a filter that adds a bit more complexity.

It’s as if the model is saying: "Okay, I’ve figured out what matters, now let me make sense of it."

Residual Connection + Layer Normalization

The model uses a connection to keep the original information, even after applying complex transformations. This helps it learn better and avoid forgetting useful details. It also normalizes the values so things don’t get too extreme during training, which keeps the learning process stable and reliable.

And That’s Just One Block! Transformer models don’t just have one decoder block; they stack lots of them. Each one improves on the understanding from the last. Each layer will have a different set of weight matrices for generating Q, K, and V vectors.

By the time the model gets through all the blocks, it has a really rich understanding of the input it has access to.

Decoder Output → Vocabulary Scores

At the final layer, each token vector contains rich contextualized representation. The last decoder block gives us a vector for the last token, i.e., in. Now, the model wants to guess the next token. So it takes that 768-length vector and passes it through a Linear layer (basically a big matrix) that maps it to a list of scores, one for every token in the vocabulary.

For example, GPT-2 has around 50,257 tokens in its vocabulary. So now we have 50,257 scores, one per token.

Right now, those are just raw scores, they don’t mean much by themselves. So, GPT applies a softmax function.

Now GPT can say:

  • Car → 0.00003
  • Nepal → 0.91
  • tofu → 0.00001
  • India → 0.005
  • China → 0.002
  • Australia → 0.001
  • etc.

These numbers are the model's best guess of what the next word should be, based on everything it has seen so far.

So in our case, it chooses Nepal, because the model is 91% confident that it's the right next word.

Auto-regression: Predict, Add, Repeat

Once Nepal is chosen, the model adds it to the input sequence:

"Gautam Buddha was born in Nepal"

Now it runs everything again to guess the next token (maybe .), and repeats this loop until it decides the output is complete.

Summary

Here's a complete summary of how an LLM auto-regressively does text generation.

  • Text → Tokens → Input IDs → Token Embeddings
  • Add Positional Embeddings
  • Pass through Transformer Decoder Blocks
    • Masked Self-Attention
    • Feedforward Layer
    • Residual + LayerNorm
  • Output goes through Linear + Softmax → Next Token (based on a probability)
  • Repeat (auto-regression)

That's it! I hope the article helped you understand the intuition behind how generative LLMs do the text generation tasks.

I'll see you in the next one.

Reading Resources: