How LLMs Work: A Beginner's Guide to Decoder-Only Transformers
A language model like GPT (which stands for Generative Pretrained Transformer) takes text, breaks it into tokens (words or subwords), converts those tokens into numbers, processes those numbers through layers of Transformer decoders, and finally outputs a probability distribution over all possible tokens in its vocabulary. It then selects the token with the highest probability. This process repeats until a full response is generated.
If you're new to the Transformer architecture, this might sound too much, but stick with me throughout this blog post. By the end, you'll have a clear picture you won’t forget.
While this post uses GPT as the main example, the explanations generally apply to other LLMs too. The core principles remain similar.
GPT and ChatGPT
Let's clear one thing before we get started: ChatGPT is an application built on top of large language models (LLMs) like GPT-3.5 and GPT-4. While GPT is the underlying model that learns to predict and generate text, ChatGPT adds several key layers:
- A conversational interface – the chat-style interaction we’re all familiar with.
- Instruction tuning – GPT is fine-tuned using datasets made of (instruction, input, output) examples. Without instruction tuning, if you prompt: "Summarize this paragraph," a raw GPT model might just continue the paragraph. An instruction-tuned model recognizes "summarize" as a command, generates a summary, and knows when to stop.
- Safety layers – which are responsible for:
- Reinforcement Learning from Human Feedback (RLHF): human evaluators guide the model’s behavior, encouraging helpfulness and harmlessness.
- System prompts & behaviour shaping: internal instructions guide the model’s tone, personality, and limits.
So, when you’re chatting with ChatGPT, you’re actually interacting with a powerful GPT model that’s been tuned to follow instructions and have natural conversations.
Now that we're familiar with this difference, let's get started step-by-step.
Input Text → Tokens
Think of GPT as doing smart autocomplete.
Input: Gautam Buddha was born in
GPT doesn't understand raw text. Instead, it uses a tokenizer that converts text into tokens, typically subwords. Each LLM has a vocabulary: a list of tokens and their corresponding ID numbers, called input IDs.
Since GPT is trained on multiple languages, imagine how massive the vocabulary would be if it had to include every word from every language! Instead, GPT breaks words into subwords or tokens.
For example, if GPT has to understand the name "Rabindra," it might split it into tokens like R
, ab
, ind
, ra
.
Code Example: Tokenization
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize("Rabindra")
print(tokens)
# Output: ['R', 'ab', 'ind', 'ra']
Some words like "Australia" are found whole in the vocabulary:
tokenizer.tokenize("Australia")
# Output: ['Australia']
Now let’s tokenize a sentence:
tokenizer.tokenize("Melbourne is an amazing city.")
# Output: ['Mel', 'bourne', 'Ä is', 'Ä an', 'Ä amazing', 'Ä city', '.']
Notice the special character Ä
before some tokens; it indicates a space before the token. Subwords like 'Mel'
and 'bourne'
don’t have it, which means they are fragments of a larger word.
Tokens → Input IDs
Once tokenized, the tokens are mapped to numbers (input IDs).
tokens = tokenizer.encode("Melbourne is an amazing city.")
print(tokens)
# Output: [21102, 12544, 318, 281, 4998, 1748, 13]
Let’s try another sentence:
tokens = tokenizer.encode("Sydney is an amazing city.")
# Tokenized: ['S', 'yd', 'ney', 'Ä is', 'Ä an', 'Ä amazing', 'Ä city', '.']
# Output: [50, 5173, 1681, 318, 281, 4998, 1748, 13]
As you can see, "Sydney" and "Melbourne" are broken into different tokens, but the rest of the sentence has the same input IDs.
Remember that each LLM has its own tokenizer and vocabulary.
Now, let's get back to our example: Gautam Buddha was born in
Input IDs → Embeddings
Each input ID is passed through an embedding layer to become a fixed-size vector (token embedding/token vector) (e.g., size 768). If you're wondering what such vectors are, I've explained them in detail in this article.
Gautam Buddha was born in
↓
['G', 'aut', 'am', 'Ä Buddha', 'Ä was', 'Ä born', 'Ä in']
↓
[38, 2306, 321, 19154, 373, 4642, 287]
↓
[
E1 = [0.01, -0.2, ..., 0.003], # vector for "G"
E2 = [0.05, 0.07, ..., -0.01], # "aut"
...
]
Positional Embeddings: LLMs care about word order. The following sentences are different to LLMs.
- "Gautam Buddha was born in Nepal."
- "Nepal was born in Gautam Buddha."
To preserve word order, each token's embedding is added to a positional encoding. Think of positional encodings as embeddings for each position. For each token, the final embedding = token_embedding + position_embedding
So:
- Token 0 → token_embedding("G") + position_embedding(0)
- Token 1 → token_embedding("aut") + position_embedding(1)
- ... and so on
Embeddings → Transformer Decoder Blocks
Now comes the magic: passing the embedded sequence into a stack of Transformer decoder blocks. This is where the model thinks. You can imagine each decoder block as a little reasoning unit that asks:
"Based on everything I’ve seen so far, what should I focus on to guess the next word?"
Each block has three jobs:
- Masked Self-Attention: What matters the most right now?
- Feedforward Neural Network: Making sense of the idea.
- Residual Connections + LayerNorm: Keeping all stable
If you're curious about the detailed math alongside visuals behind how transformers work, especially masked self-attention, I highly recommend checking out Jay Alammar's "The Illustrated GPT-2". It's one of the best, most visual explanations out there.
Masked Self-Attention
Our input so far is: Gautam Buddha was born in
The model is now trying to figure out what word comes next.
But it doesn’t treat each word equally — instead, it looks back at the previous words and tries to decide which ones are most important to focus on. Maybe "born" is really important here. Maybe "Buddha" matters too.
This "looking back" process is called masked self-attention.
Here’s how to think about it:
- Imagine each token is asking, "Who should I listen to in this sentence amongst tokens that are before me?"
- The model decides: "Let me pay 50% attention to
born
, 30% toBuddha
, and so on." - It then blends the meanings of those words to help it guess the next one.
This is how the model starts to understand context. Not just what the last word was — but which earlier words matter for the next prediction.
Analogy + Maths
Each word (token) is a student
Imagine we’re in a classroom where every word is a student in a line. Let’s focus on the word in
— this is our student who’s about to speak and trying to decide what to say next.
To prepare for this, every student is given three things:
- A Query vector (Q) – like a question they're asking: "Whose input should I listen to?"
- A Key vector (K) – like a nametag that says what they know about.
- A Value vector (V) – like their actual knowledge or opinion.
How are these vectors generated? The embedding of each token is multiplied by three different weight matrices to generate these vectors. The Q, K, V vectors will have smaller dimensions (e.g., 64) than the original token embedding (e.g., 768). The weight matrices are learnt during training.
Why are the Q, K, and V dimensions smaller than the original token embedding dimension? Well, it's a design choice. And to be honest, it's not just masked self-attention we're dealing with — it's actually multi-headed masked self-attention. The self-attention we're discussing here is happening in one head. But in practice, this attention mechanism is repeated across, say, 12 different heads. So, if each head uses Q, K, and V vectors with a dimension of 64, then after processing through all 12 heads in parallel, the outputs are concatenated: 64 × 12 = 768 dimensions. Sound familiar? That’s the same as the original token embedding dimension!
Of course, different large language models (LLMs) can have different Q, K, V vector sizes and a different number of heads. Each head will have a different set of weight matrices for generating Q, K, and V vectors.
Let's continue with our student and class analogy.
Attention Scores: Asking the class
Our student in
takes its Query vector and goes down the row, comparing it with each earlier student’s Key vector.
Mathematically, this is a dot product (generalized for all tokens):
\[ \text{Score}_{i,j} = Q_i \cdot K_j \]
Analogy: The student in
is asking: "Hey Buddha
, does what you said help me figure out what comes next? How about you, born
?"
The dot product tells us how aligned their topics are — high scores mean better alignment.
Scaling and Softmax: Normalizing opinions
Now that in
has scores for each previous student, we do two things:
- Scale the scores by \( \sqrt{d_k} \)​​, where \( d_k \) is the dimension of the Key vectors (to keep gradients stable).
- Softmax the scores so they turn into probabilities (generalized for all tokens):
\[ \text{AttentionWeight}_{i,j} = \text{softmax}\left( \frac{Q_i \cdot K_j}{\sqrt{d_k}} \right) \]
What does a softmax function do?
- It squashes all the scores into numbers between 0 and 1.
- And most importantly, they all add up to 1, like a probability distribution.
Analogy: The student in
now decides: "Okay, I’ll listen 50% to born
, 30% to Buddha
, 7% to was
…"
Weighted Sum: Combining the voices
Now the student gathers the Value vectors of the earlier students and combines them using the attention weights (generalized for all tokens):
\[ \text{Output}_i = \sum_j \text{AttentionWeight}_{i,j} \cdot V_j \]
This gives us a contextualized representation of every tokens. \( \text{Output}_7 \) now carries not just the word in
itself, but the important bits of the earlier words it decided to attend to.
This process of computing self-attention scores and generating a contextualized representation happens multiple times in parallel — this is known as multi-headed masked self-attention. The output for token in
​ is concatenated from all the heads, then multiplied by an additional weight matrix (linear projection) for combining the information, and finally passed through a feedforward neural network.
Feedforward Neural Network
The contextualized representation of the word in
is passed through a neural network. Think of this like a filter that adds a bit more complexity.
It’s as if the model is saying: "Okay, I’ve figured out what matters, now let me make sense of it."
Residual Connection + Layer Normalization
The model uses a connection to keep the original information, even after applying complex transformations. This helps it learn better and avoid forgetting useful details. It also normalizes the values so things don’t get too extreme during training, which keeps the learning process stable and reliable.
And That’s Just One Block! Transformer models don’t just have one decoder block; they stack lots of them. Each one improves on the understanding from the last. Each layer will have a different set of weight matrices for generating Q, K, and V vectors.
By the time the model gets through all the blocks, it has a really rich understanding of the input it has access to.
Decoder Output → Vocabulary Scores
At the final layer, each token vector contains rich contextualized representation. The last decoder block gives us a vector for the last token, i.e., in
. Now, the model wants to guess the next token. So it takes that 768-length vector and passes it through a Linear layer (basically a big matrix) that maps it to a list of scores, one for every token in the vocabulary.
For example, GPT-2 has around 50,257 tokens in its vocabulary. So now we have 50,257 scores, one per token.
Right now, those are just raw scores, they don’t mean much by themselves. So, GPT applies a softmax function.
Now GPT can say:
Car
→0.00003
Nepal
→0.91
tofu
→0.00001
India
→0.005
China
→0.002
Australia
→0.001
- etc.
These numbers are the model's best guess of what the next word should be, based on everything it has seen so far.
So in our case, it chooses Nepal
, because the model is 91% confident that it's the right next word.
Auto-regression: Predict, Add, Repeat
Once Nepal
is chosen, the model adds it to the input sequence:
"Gautam Buddha was born in Nepal"
Now it runs everything again to guess the next token (maybe .
), and repeats this loop until it decides the output is complete.
Summary
Here's a complete summary of how an LLM auto-regressively does text generation.
- Text → Tokens → Input IDs → Token Embeddings
- Add Positional Embeddings
- Pass through Transformer Decoder Blocks
- Masked Self-Attention
- Feedforward Layer
- Residual + LayerNorm
- Output goes through Linear + Softmax → Next Token (based on a probability)
- Repeat (auto-regression)
That's it! I hope the article helped you understand the intuition behind how generative LLMs do the text generation tasks.
I'll see you in the next one.
Reading Resources:
- Attention is All You Need (The paper that introduced the Transformer architecture)
- Better language models and their implications (GPT-2 introduction)
- GPT-2 on Hugging Face (GPT-2 weights)
- The Illustrated Transformer
- The Illustrated GPT-2 (Visualizing Transformer Language Models)