What Is a Transformer? The Architecture Behind Modern AI

A Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent neural networks (RNNs) as the dominant architecture for sequence processing tasks, and it now forms the backbone of virtually all modern large language models, including GPT, BERT, LLaMA, and Claude.

The Core Innovation: Self-Attention

The key breakthrough of transformers is the self-attention mechanism (also called scaled dot-product attention). Unlike RNNs, which process sequences one token at a time, self-attention allows every token in a sequence to attend to every other token simultaneously.

How Self-Attention Works

For each token in the input:

Three vectors are computed: Query (Q), Key (K), and Value (V)
Attention scores are calculated as the dot product of Q with all K vectors
Scores are scaled and passed through a softmax to get attention weights
The output is the weighted sum of V vectors

This produces a context-aware representation where the meaning of each word is influenced by every other word in the sequence.

Multi-Head Attention

Transformers use multiple attention heads in parallel, each learning different types of relationships (syntactic, semantic, positional). Their outputs are concatenated and linearly projected, providing a richer representation than single-head attention.

Transformer Architecture

The original transformer consists of two main components:

Encoder

Processes the input sequence
Produces contextual representations
Used in models like BERT (bidirectional understanding)

Decoder

Generates the output sequence token by token
Uses masked self-attention (can only attend to previous tokens)
Used in models like GPT (autoregressive generation)

Encoder-Decoder

The full original architecture
The encoder processes input, the decoder generates output using cross-attention to encoder representations
Used in models like T5 and original machine translation systems

Why Transformers Replaced RNNs

Feature	RNNs/LSTMs	Transformers
Parallelism	Sequential processing	Fully parallel
Long-range dependencies	Struggle with distant tokens	Direct attention to any position
Training speed	Slow (sequential bottleneck)	Fast (GPU-optimized parallel ops)
Scalability	Diminishing returns at scale	Performance scales with model size
Memory	Fixed hidden state	Flexible context window

Positional Encoding

Since transformers process all tokens simultaneously (no inherent notion of order), they use positional encoding — mathematical signals added to input embeddings that encode each token's position in the sequence. This allows the model to understand word order without sequential processing.

Transformers Beyond NLP

While originally designed for language, transformers now dominate across domains:

Computer Vision — Vision Transformers (ViT) process images as sequences of patches
Audio — Whisper and other speech models use transformer architectures
Protein Folding — AlphaFold uses transformer-based attention for structure prediction
Robotics — Decision Transformers model control as sequence prediction
Multimodal AI — Models like GPT-4V process both text and images

Cookie Preferences

What Is a Transformer?