What Is the Attention Mechanism?
The attention mechanism is a neural network technique that allows models to dynamically focus on the most relevant parts of their input when producing each element of the output. Instead of compressing an entire input sequence into a single fixed-size representation, attention lets the model "look back" at all input positions and weigh their importance.
Why Attention Was Needed
Before attention, sequence-to-sequence models (like RNNs and LSTMs) encoded entire input sequences into a single fixed-length vector. This created a bottleneck — long sequences lost information as they were compressed. Attention solved this by allowing the decoder to access all encoder positions directly.
How Attention Works
Scaled Dot-Product Attention
The most common form of attention (used in Transformers) works with three components:
- Query (Q) — What the model is looking for
- Key (K) — What each position offers
- Value (V) — The actual information at each position
The attention calculation:
- Compute similarity scores: Q · K^T (dot product of query with all keys)
- Scale by √(dimension) to prevent extreme values
- Apply softmax to get attention weights (probabilities that sum to 1)
- Multiply weights by V to get the weighted output
Multi-Head Attention
Instead of performing a single attention operation, transformers run multiple attention heads in parallel, each learning to focus on different types of relationships:
- One head might capture syntactic relationships (subject-verb)
- Another captures semantic relationships (synonyms, antonyms)
- Another captures positional patterns (nearby words)
The outputs of all heads are concatenated and linearly projected.
Types of Attention
| Type | Description | Used In |
|---|---|---|
| Self-Attention | Each position attends to all positions in the same sequence | Transformer encoder, GPT |
| Cross-Attention | Positions in one sequence attend to positions in another | Encoder-decoder models, T5 |
| Causal (Masked) Attention | Each position can only attend to previous positions | GPT, autoregressive models |
| Local Attention | Attention restricted to a window around each position | Longformer, efficient transformers |
Attention Beyond Language
- Computer Vision — Vision Transformers (ViT) apply attention to image patches
- Speech — Whisper uses attention for speech-to-text
- Protein Science — AlphaFold uses attention for structure prediction
- Music — Attention models compose and analyze musical sequences
Attention vs. Recurrence
| Feature | RNN/LSTM | Attention |
|---|---|---|
| Parallelism | Sequential (slow) | Fully parallel (fast) |
| Long-Range Dependencies | Difficult (vanishing gradients) | Direct access to any position |
| Computational Complexity | O(n) per step | O(n²) per layer |
| Interpretability | Hidden states are opaque | Attention weights are visualizable |
Limitations
- Quadratic Complexity — Attention scales as O(n²) with sequence length, making very long inputs expensive
- Memory Usage — Storing attention matrices for long sequences requires significant RAM
- Approximations — Efficient attention variants (Flash Attention, linear attention) trade exactness for speed