Cookie Preferences

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept All", you consent to our use of cookies. Learn more

    AI Techniques
    techniques

    What Is Model Distillation?

    AsterMind Team

    Model distillation (or knowledge distillation) is a technique for creating smaller, faster AI models by training a compact "student" model to replicate the behavior of a larger, more capable "teacher" model. The student learns not just the correct answers but also the teacher's confidence patterns across all possible outputs — capturing "dark knowledge" that isn't available from labels alone.

    How Distillation Works

    The Process

    1. Train a Teacher — Start with a large, high-performance model (e.g., GPT-4, a 70B parameter model)
    2. Generate Soft Labels — Run training data through the teacher to get probability distributions over all outputs (not just the top prediction)
    3. Train the Student — A much smaller model learns to match the teacher's output distributions
    4. Deploy the Student — The compact model serves predictions in production

    Why Soft Labels Matter

    A teacher classifying an image might output: "cat: 0.85, dog: 0.10, fox: 0.04, wolf: 0.01". The hard label is just "cat," but the soft distribution reveals that dogs look somewhat similar to this image, foxes less so. This relational knowledge helps the student learn richer representations than training on hard labels alone.

    Distillation Approaches

    Method Description Use Case
    Response Distillation Student mimics teacher's output probabilities General purpose
    Feature Distillation Student mimics teacher's internal representations When internal features matter
    Relation Distillation Student learns relationships between samples Structured data
    Self-Distillation Model distills knowledge from its own deeper layers Single-model optimization
    LLM Distillation Small LLM trained on large LLM's outputs Creating specialized small LLMs

    Benefits

    Aspect Teacher Model Distilled Student
    Size 70B+ parameters 1-8B parameters
    Inference Speed Slow 5-50x faster
    Memory 100+ GB 2-16 GB
    Hardware Multiple GPUs Single GPU or CPU
    Cost per Query High 10-50x lower
    Edge Deployment Impractical Feasible

    Real-World Examples

    • GPT-4 → GPT-4o-mini — Smaller model trained to approximate the larger model's capabilities
    • BERT → DistilBERT — 40% smaller, 60% faster, retaining 97% of performance
    • LLaMA 70B → LLaMA 8B — Smaller variants informed by larger model insights
    • Whisper Large → Whisper Small — Compact speech models for edge devices

    When to Use Distillation

    • Edge Deployment — Models must fit on devices with limited memory and compute
    • Cost Optimization — Reducing inference costs at scale
    • Latency Requirements — Applications needing real-time responses
    • Proprietary Models — Creating deployable models from API-only teachers
    • Specialized Tasks — When you need a focused model rather than a general one

    Further Reading