The Transformer Architecture

A Comprehensive Engineering Deep Dive

The Transformer is arguably the most impactful architecture in the history of artificial intelligence. Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain, it effectively rendered Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) obsolete for sequence modeling tasks.

This document serves as a "single source of truth" for understanding the Transformer. We will deconstruct it from the mathematical foundations up to the modern implementation details used in state-of-the-art LLMs like GPT-4, Claude, and Llama 3.

2. Historical Genesis: Why it Happened

Timeline:
2014: Seq2Seq with RNNs (Sutskever et al.)
2015: Attention Mechanism for NMT (Bahdanau et al.)
2016: Google Neural Machine Translation (GNMT)
2017: The Transformer (Vaswani et al.)

The Recurrent Bottleneck

Before Transformers, sequence modeling was dominated by RNNs. An RNN processes data sequentially: token $t$ depends on the hidden state of token $t-1$.

h_t = f(h_{t-1}, x_t)

This sequential nature introduced two fatal flaws preventing scale:

No Parallelism: You cannot compute the hidden state at step 100 without computing steps 1 through 99. This makes GPU training incredibly inefficient, as GPUs excel at massive parallel matrix operations, not serial accumulation.
Information Bottleneck: The entire history of a sentence, no matter how long, had to be compressed into a single fixed-size vector $h_t$. As the sequence length grew, earlier information (gradients) would vanish (or explode), making it impossible to learn dependencies between distant tokens.

Attention is Born (2015)

Bahdanau et al. (2015) introduced "Attention" as a patch for RNNs. Instead of relying on the final hidden state, the decoder could "look back" at all encoder hidden states and calculate a weighted average based on relevance.

The Transformer team asked a radical question: If Attention gives us access to the whole sequence at once, why do we need the RNN recurrence at all?

They removed the recurrence. They removed the convolutions. They kept only the Attention. Hence: "Attention Is All You Need".

3. Architecture Overview

The original Transformer was an Encoder-Decoder model designed for machine translation (English to German).

Modern Context: Most modern LLMs (GPT, Llama, Claude) are Decoder-Only architectures. They drop the Encoder stack entirely and just use the Decoder part (without Cross-Attention). We will focus primarily on the Decoder-Only architecture as it is the standard for GenAI.

4. Input Processing: Embeddings & Position

Tokenization (Briefly)

Raw text is converted into integers via a tokenizer (BPE, SentencePiece).
"Hello world" → [15496, 995].

Input Embeddings

Each token ID is looked up in a learnable embedding matrix $W_e \in \mathbb{R}^{V \times d_{model}}\$. If $d_{model} = 4096$ (like Llama 2), every word becomes a vector of 4096 floating point numbers.

Positional Encodings (PE)

Crucial Concept: The Transformer has no inherent sense of order. If you shuffle the words in a sentence, the Self-Attention mechanism produces the exact same output (permutation invariance). To fix this, we must inject position information.

Original Sinusoidal PE (Vaswani et al.)

Instead of learning position vectors, they chose fixed frequencies so the model could potentially extrapolate to longer lengths.

PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $$ $$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})

We add this directly to the input embedding: $x = x_{embed} + x_{pos}$.

Rotary Positional Embeddings (RoPE)

Modern LLMs (Llama, Mistral, PaLM) use RoPE. Instead of adding position information, RoPE rotates the Query and Key vectors.

Imagine the 2D plane. Rotating a vector by angle $\theta \cdot m$ (where $m$ is position) encodes relative distance naturally. If you dot product two vectors at position $m$ and $n$, the result depends only on $m-n$ (relative distance), which is a highly desirable property for language.

5. The Attention Mechanism

This is the engine of the Transformer.

Query, Key, and Value

For every token vector $x$, we project it into three vectors using learnable matrices $W_Q, W_K, W_V$:

Query ($Q$): What am I looking for?
Key ($K$): What do I contain?
Value ($V$): What information should I pass along?

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Step-by-Step Execution:

Dot Product ($QK^T$): Calculate similarity between every Query and every Key. Result is an $N \times N$ matrix of scores.
Scaling ($\frac{1}{\sqrt{d_k}}\$): Divide by square root of head dimension (e.g., $\sqrt{64} = 8$). This prevents gradients from vanishing when dot products get too large.
Masking (Crucial for Decoders): Set the upper triangular part of the matrix to $-\infty$. This ensures token $t$ cannot "see" tokens $t+1, t+2...$. It can only attend to the past.
Softmax: Convert scores to probabilities (sum to 1).
Weighted Sum: Multiply probabilities by Values ($V$).

Multi-Head Attention (MHA)

One attention head might focus on syntax (subject-verb). Another might focus on context (pronoun resolution). To allow this, we split the embedding dimension into $h$ "heads".

# Conceptual PyTorch Implementation
class MultiHeadAttention(nn.Module):
    def forward(self, x):
        # Batch, Seq, Dim
        B, T, C = x.shape 
        
        # Split into heads (nh = num_heads)
        k = self.key(x).view(B, T, nh, hs).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, nh, hs).transpose(1, 2)
        v = self.value(x).view(B, T, nh, hs).transpose(1, 2)
        
        # Attention (Flash Attention usually replaces this manual implementation)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        
        y = att @ v # (B, nh, T, hs)
        
        # Re-assemble
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(y)

GQA (Grouped Query Attention): Used in Llama 2 (70B) and Llama 3. It uses fewer Key/Value heads than Query heads (e.g., 8 KV heads for 64 Q heads). This drastically reduces the size of the KV Cache during inference, speeding up generation and lowering memory usage.

6. Feed-Forward Networks (FFN)

While Attention mixes information between tokens, the FFN processes information within each token independently. It is often described as the "Key-Value Memory" of the model where factual knowledge is stored.

\text{FFN}(x) = \text{Activation}(xW_1 + b_1)W_2 + b_2

SwiGLU Activation

Original Transformers used ReLU. Modern LLMs (PaLM, Llama) use SwiGLU, a gated variant of SiLU (Swish). It requires three matrix multiplications instead of two but offers better convergence.

7. Normalization & Residuals

Residual Connections

$x \leftarrow x + \text{Layer}(x)$.
This "highway" for gradients allows deep networks (100+ layers) to train without gradient vanishing.

LayerNorm vs. RMSNorm

LayerNorm (Original): Centers and scales the vector to mean 0, variance 1. Requires calculating mean and variance.
RMSNorm (Modern): Root Mean Square Normalization. Only scales (no centering). Computationally cheaper and empirically equal or better performance.

Pre-Norm vs. Post-Norm:
Original paper used Post-Norm: Norm(x + Sublayer(x)). This was hard to train (warmup required).
Modern LLMs use Pre-Norm: x + Sublayer(Norm(x)). This is much more stable.

8. The Decoder-Only Shift (GPT Style)

Why did we abandon the Encoder?

The Encoder is bidirectional (sees future tokens). This is great for understanding (BERT) but impossible for generation (text completion). For Generative AI, we need to predict the next token based only on the past.

The GPT (Generative Pre-trained Transformer) family proved that if you scale up a Decoder-only model and train it on enough data, it learns to "understand" just as well as an Encoder, while retaining the ability to generate.

9. Full Implementation (NanoGPT Style)

Here is a complete, minimal, clean implementation of a modern Decoder-only Transformer block in PyTorch.

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        # Key, Query, Value projections combined
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
        
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        
        # Causal mask (ensure we don't look into the future)
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                    .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # Batch, Time, Channels
        
        # Calculate query, key, values
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        
        # Reshape for multi-head attention
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        
        # Attention scores
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        
        # Weighted sum
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        
        return self.c_proj(y)

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
        self.act     = nn.GELU() # Or SwiGLU

    def forward(self, x):
        x = self.c_fc(x)
        x = self.act(x)
        x = self.c_proj(x)
        return x

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd) # Or RMSNorm
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        # Pre-Norm architecture (Norm -> Sublayer -> Add)
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

10. Complexity Analysis

The cost of the Transformer is dominated by the Attention mechanism.

Operation	Complexity	Explanation
Attention (Time)	$$O(N^2 \cdot d)$$	Quadratic with sequence length $N$. This is why context windows were historically small (2k, 4k).
FFN (Time)	$$O(N \cdot d^2)$$	Linear with sequence length, but quadratic with model dimension.
Memory (Training)	$$O(N^2)$$	Storing the attention matrix for backpropagation.
Inference (KV Cache)	$$O(N \cdot d)$$	We must store Key/Value vectors for all previous tokens to avoid recomputing them.

Scaling Laws: Kaplan et al. (2020) and Hoffmann et al. (Chinchilla, 2022) showed that loss scales as a power law with compute, data size, and parameter count. This predictability drove the "Scale is all you need" era.

References:
Vaswani, A., et al. (2017). "Attention Is All You Need". NeurIPS.
Bahdanau, D., et al. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR.
Su, J., et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding".
Shazeer, N. (2020). "GLU Variants Improve Transformer".

Operation	Complexity	Explanation
Attention (Time)	$$O(N^2 \cdot d)$$	Quadratic with sequence length \(N\). This is why context windows were historically small (2k, 4k).
FFN (Time)	$$O(N \cdot d^2)$$	Linear with sequence length, but quadratic with model dimension.
Memory (Training)	$$O(N^2)$$	Storing the attention matrix for backpropagation.
Inference (KV Cache)	$$O(N \cdot d)$$	We must store Key/Value vectors for all previous tokens to avoid recomputing them.