What Is a Transformer?
A Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent neural networks (RNNs) as the dominant architecture for sequence processing tasks, and it now forms the backbone of virtually all modern large language models, including GPT, BERT, LLaMA, and Claude.
The Core Innovation: Self-Attention
The key breakthrough of transformers is the self-attention mechanism (also called scaled dot-product attention). Unlike RNNs, which process sequences one token at a time, self-attention allows every token in a sequence to attend to every other token simultaneously.
How Self-Attention Works
For each token in the input:
- Three vectors are computed: Query (Q), Key (K), and Value (V)
- Attention scores are calculated as the dot product of Q with all K vectors
- Scores are scaled and passed through a softmax to get attention weights
- The output is the weighted sum of V vectors
This produces a context-aware representation where the meaning of each word is influenced by every other word in the sequence.
Multi-Head Attention
Transformers use multiple attention heads in parallel, each learning different types of relationships (syntactic, semantic, positional). Their outputs are concatenated and linearly projected, providing a richer representation than single-head attention.
Transformer Architecture
The original transformer consists of two main components:
Encoder
- Processes the input sequence
- Produces contextual representations
- Used in models like BERT (bidirectional understanding)
Decoder
- Generates the output sequence token by token
- Uses masked self-attention (can only attend to previous tokens)
- Used in models like GPT (autoregressive generation)
Encoder-Decoder
- The full original architecture
- The encoder processes input, the decoder generates output using cross-attention to encoder representations
- Used in models like T5 and original machine translation systems
Why Transformers Replaced RNNs
| Feature | RNNs/LSTMs | Transformers |
|---|---|---|
| Parallelism | Sequential processing | Fully parallel |
| Long-range dependencies | Struggle with distant tokens | Direct attention to any position |
| Training speed | Slow (sequential bottleneck) | Fast (GPU-optimized parallel ops) |
| Scalability | Diminishing returns at scale | Performance scales with model size |
| Memory | Fixed hidden state | Flexible context window |
Positional Encoding
Since transformers process all tokens simultaneously (no inherent notion of order), they use positional encoding — mathematical signals added to input embeddings that encode each token's position in the sequence. This allows the model to understand word order without sequential processing.
Transformers Beyond NLP
While originally designed for language, transformers now dominate across domains:
- Computer Vision — Vision Transformers (ViT) process images as sequences of patches
- Audio — Whisper and other speech models use transformer architectures
- Protein Folding — AlphaFold uses transformer-based attention for structure prediction
- Robotics — Decision Transformers model control as sequence prediction
- Multimodal AI — Models like GPT-4V process both text and images