The Transformer Architecture
The Transformer is arguably the most impactful architecture in the history of artificial intelligence. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain, it effectively rendered Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) obsolete for sequence modeling tasks.
This document serves as a "single source of truth" for understanding the Transformer. We will deconstruct it from the mathematical foundations up to the modern implementation details used in state-of-the-art LLMs like GPT-4, Claude, and Llama 3.
2. Historical Genesis: Why it Happened
2014: Seq2Seq with RNNs (Sutskever et al.)
2015: Attention Mechanism for NMT (Bahdanau et al.)
2016: Google Neural Machine Translation (GNMT)
2017: The Transformer (Vaswani et al.)
The Recurrent Bottleneck
Before Transformers, sequence modeling was dominated by RNNs. An RNN processes data sequentially: token \(t\) depends on the hidden state of token \(t-1\).
This sequential nature introduced two fatal flaws preventing scale:
- No Parallelism: You cannot compute the hidden state at step 100 without computing steps 1 through 99. This makes GPU training incredibly inefficient, as GPUs excel at massive parallel matrix operations, not serial accumulation.
- Information Bottleneck: The entire history of a sentence, no matter how long, had to be compressed into a single fixed-size vector \(h_t\). As the sequence length grew, earlier information (gradients) would vanish (or explode), making it impossible to learn dependencies between distant tokens.
Attention is Born (2015)
Bahdanau et al. (2015) introduced "Attention" as a patch for RNNs. Instead of relying on the final hidden state, the decoder could "look back" at all encoder hidden states and calculate a weighted average based on relevance.
The Transformer team asked a radical question: If Attention gives us access to the whole sequence at once, why do we need the RNN recurrence at all?
They removed the recurrence. They removed the convolutions. They kept only the Attention. Hence: "Attention Is All You Need".
3. Architecture Overview
The original Transformer was an Encoder-Decoder model designed for machine translation (English to German).
4. Input Processing: Embeddings & Position
Tokenization (Briefly)
Raw text is converted into integers via a tokenizer (BPE, SentencePiece).
"Hello world" → [15496, 995].
Input Embeddings
Each token ID is looked up in a learnable embedding matrix \(W_e \in \mathbb{R}^{V \times d_{model}}\\). If \(d_{model} = 4096\) (like Llama 2), every word becomes a vector of 4096 floating point numbers.
Positional Encodings (PE)
Crucial Concept: The Transformer has no inherent sense of order. If you shuffle the words in a sentence, the Self-Attention mechanism produces the exact same output (permutation invariance). To fix this, we must inject position information.
Original Sinusoidal PE (Vaswani et al.)
Instead of learning position vectors, they chose fixed frequencies so the model could potentially extrapolate to longer lengths.
We add this directly to the input embedding: \(x = x_{embed} + x_{pos}\).
Rotary Positional Embeddings (RoPE)
Modern LLMs (Llama, Mistral, PaLM) use RoPE. Instead of adding position information, RoPE rotates the Query and Key vectors.
Imagine the 2D plane. Rotating a vector by angle \(\theta \cdot m\) (where \(m\) is position) encodes relative distance naturally. If you dot product two vectors at position \(m\) and \(n\), the result depends only on \(m-n\) (relative distance), which is a highly desirable property for language.
5. The Attention Mechanism
This is the engine of the Transformer.
Query, Key, and Value
For every token vector \(x\), we project it into three vectors using learnable matrices \(W_Q, W_K, W_V\):
- Query (\(Q\)): What am I looking for?
- Key (\(K\)): What do I contain?
- Value (\(V\)): What information should I pass along?
Scaled Dot-Product Attention
Step-by-Step Execution:
- Dot Product (\(QK^T\)): Calculate similarity between every Query and every Key. Result is an \(N \times N\) matrix of scores.
- Scaling (\(\frac{1}{\sqrt{d_k}}\\)): Divide by square root of head dimension (e.g., \(\sqrt{64} = 8\)). This prevents gradients from vanishing when dot products get too large.
- Masking (Crucial for Decoders): Set the upper triangular part of the matrix to \(-\infty\). This ensures token \(t\) cannot "see" tokens \(t+1, t+2...\). It can only attend to the past.
- Softmax: Convert scores to probabilities (sum to 1).
- Weighted Sum: Multiply probabilities by Values (\(V\)).
Multi-Head Attention (MHA)
One attention head might focus on syntax (subject-verb). Another might focus on context (pronoun resolution). To allow this, we split the embedding dimension into \(h\) "heads".
# Conceptual PyTorch Implementation
class MultiHeadAttention(nn.Module):
def forward(self, x):
# Batch, Seq, Dim
B, T, C = x.shape
# Split into heads (nh = num_heads)
k = self.key(x).view(B, T, nh, hs).transpose(1, 2) # (B, nh, T, hs)
q = self.query(x).view(B, T, nh, hs).transpose(1, 2)
v = self.value(x).view(B, T, nh, hs).transpose(1, 2)
# Attention (Flash Attention usually replaces this manual implementation)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v # (B, nh, T, hs)
# Re-assemble
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.proj(y)
6. Feed-Forward Networks (FFN)
While Attention mixes information between tokens, the FFN processes information within each token independently. It is often described as the "Key-Value Memory" of the model where factual knowledge is stored.
SwiGLU Activation
Original Transformers used ReLU. Modern LLMs (PaLM, Llama) use SwiGLU, a gated variant of SiLU (Swish). It requires three matrix multiplications instead of two but offers better convergence.
7. Normalization & Residuals
Residual Connections
\(x \leftarrow x + \text{Layer}(x)\).
This "highway" for gradients allows deep networks (100+ layers) to train without gradient vanishing.
LayerNorm vs. RMSNorm
- LayerNorm (Original): Centers and scales the vector to mean 0, variance 1. Requires calculating mean and variance.
- RMSNorm (Modern): Root Mean Square Normalization. Only scales (no centering). Computationally cheaper and empirically equal or better performance.
Original paper used Post-Norm:
Norm(x + Sublayer(x)). This was hard to train (warmup required).Modern LLMs use Pre-Norm:
x + Sublayer(Norm(x)). This is much more stable.
8. The Decoder-Only Shift (GPT Style)
Why did we abandon the Encoder?
The Encoder is bidirectional (sees future tokens). This is great for understanding (BERT) but impossible for generation (text completion). For Generative AI, we need to predict the next token based only on the past.
The GPT (Generative Pre-trained Transformer) family proved that if you scale up a Decoder-only model and train it on enough data, it learns to "understand" just as well as an Encoder, while retaining the ability to generate.
9. Full Implementation (NanoGPT Style)
Here is a complete, minimal, clean implementation of a modern Decoder-only Transformer block in PyTorch.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# Key, Query, Value projections combined
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
# Output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
self.n_head = config.n_head
self.n_embd = config.n_embd
# Causal mask (ensure we don't look into the future)
self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.size() # Batch, Time, Channels
# Calculate query, key, values
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
# Reshape for multi-head attention
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# Attention scores
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
# Weighted sum
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
self.act = nn.GELU() # Or SwiGLU
def forward(self, x):
x = self.c_fc(x)
x = self.act(x)
x = self.c_proj(x)
return x
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd) # Or RMSNorm
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
# Pre-Norm architecture (Norm -> Sublayer -> Add)
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
10. Complexity Analysis
The cost of the Transformer is dominated by the Attention mechanism.
| Operation | Complexity | Explanation |
|---|---|---|
| Attention (Time) | $$O(N^2 \cdot d)$$ | Quadratic with sequence length \(N\). This is why context windows were historically small (2k, 4k). |
| FFN (Time) | $$O(N \cdot d^2)$$ | Linear with sequence length, but quadratic with model dimension. |
| Memory (Training) | $$O(N^2)$$ | Storing the attention matrix for backpropagation. |
| Inference (KV Cache) | $$O(N \cdot d)$$ | We must store Key/Value vectors for all previous tokens to avoid recomputing them. |
Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.
Bahdanau, D., et al. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR.
Su, J., et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding".
Shazeer, N. (2020). "GLU Variants Improve Transformer".