Cookie Preferences

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept All", you consent to our use of cookies. Learn more

    AI Architecture
    architecture

    What Is a Transformer?

    AsterMind Team

    A Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent neural networks (RNNs) as the dominant architecture for sequence processing tasks, and it now forms the backbone of virtually all modern large language models, including GPT, BERT, LLaMA, and Claude.

    The Core Innovation: Self-Attention

    The key breakthrough of transformers is the self-attention mechanism (also called scaled dot-product attention). Unlike RNNs, which process sequences one token at a time, self-attention allows every token in a sequence to attend to every other token simultaneously.

    How Self-Attention Works

    For each token in the input:

    1. Three vectors are computed: Query (Q), Key (K), and Value (V)
    2. Attention scores are calculated as the dot product of Q with all K vectors
    3. Scores are scaled and passed through a softmax to get attention weights
    4. The output is the weighted sum of V vectors

    This produces a context-aware representation where the meaning of each word is influenced by every other word in the sequence.

    Multi-Head Attention

    Transformers use multiple attention heads in parallel, each learning different types of relationships (syntactic, semantic, positional). Their outputs are concatenated and linearly projected, providing a richer representation than single-head attention.

    Transformer Architecture

    The original transformer consists of two main components:

    Encoder

    • Processes the input sequence
    • Produces contextual representations
    • Used in models like BERT (bidirectional understanding)

    Decoder

    • Generates the output sequence token by token
    • Uses masked self-attention (can only attend to previous tokens)
    • Used in models like GPT (autoregressive generation)

    Encoder-Decoder

    • The full original architecture
    • The encoder processes input, the decoder generates output using cross-attention to encoder representations
    • Used in models like T5 and original machine translation systems

    Why Transformers Replaced RNNs

    Feature RNNs/LSTMs Transformers
    Parallelism Sequential processing Fully parallel
    Long-range dependencies Struggle with distant tokens Direct attention to any position
    Training speed Slow (sequential bottleneck) Fast (GPU-optimized parallel ops)
    Scalability Diminishing returns at scale Performance scales with model size
    Memory Fixed hidden state Flexible context window

    Positional Encoding

    Since transformers process all tokens simultaneously (no inherent notion of order), they use positional encoding — mathematical signals added to input embeddings that encode each token's position in the sequence. This allows the model to understand word order without sequential processing.

    Transformers Beyond NLP

    While originally designed for language, transformers now dominate across domains:

    • Computer Vision — Vision Transformers (ViT) process images as sequences of patches
    • Audio — Whisper and other speech models use transformer architectures
    • Protein Folding — AlphaFold uses transformer-based attention for structure prediction
    • Robotics — Decision Transformers model control as sequence prediction
    • Multimodal AI — Models like GPT-4V process both text and images

    Further Reading