What Is Model Distillation?
Model distillation (or knowledge distillation) is a technique for creating smaller, faster AI models by training a compact "student" model to replicate the behavior of a larger, more capable "teacher" model. The student learns not just the correct answers but also the teacher's confidence patterns across all possible outputs — capturing "dark knowledge" that isn't available from labels alone.
How Distillation Works
The Process
- Train a Teacher — Start with a large, high-performance model (e.g., GPT-4, a 70B parameter model)
- Generate Soft Labels — Run training data through the teacher to get probability distributions over all outputs (not just the top prediction)
- Train the Student — A much smaller model learns to match the teacher's output distributions
- Deploy the Student — The compact model serves predictions in production
Why Soft Labels Matter
A teacher classifying an image might output: "cat: 0.85, dog: 0.10, fox: 0.04, wolf: 0.01". The hard label is just "cat," but the soft distribution reveals that dogs look somewhat similar to this image, foxes less so. This relational knowledge helps the student learn richer representations than training on hard labels alone.
Distillation Approaches
| Method | Description | Use Case |
|---|---|---|
| Response Distillation | Student mimics teacher's output probabilities | General purpose |
| Feature Distillation | Student mimics teacher's internal representations | When internal features matter |
| Relation Distillation | Student learns relationships between samples | Structured data |
| Self-Distillation | Model distills knowledge from its own deeper layers | Single-model optimization |
| LLM Distillation | Small LLM trained on large LLM's outputs | Creating specialized small LLMs |
Benefits
| Aspect | Teacher Model | Distilled Student |
|---|---|---|
| Size | 70B+ parameters | 1-8B parameters |
| Inference Speed | Slow | 5-50x faster |
| Memory | 100+ GB | 2-16 GB |
| Hardware | Multiple GPUs | Single GPU or CPU |
| Cost per Query | High | 10-50x lower |
| Edge Deployment | Impractical | Feasible |
Real-World Examples
- GPT-4 → GPT-4o-mini — Smaller model trained to approximate the larger model's capabilities
- BERT → DistilBERT — 40% smaller, 60% faster, retaining 97% of performance
- LLaMA 70B → LLaMA 8B — Smaller variants informed by larger model insights
- Whisper Large → Whisper Small — Compact speech models for edge devices
When to Use Distillation
- Edge Deployment — Models must fit on devices with limited memory and compute
- Cost Optimization — Reducing inference costs at scale
- Latency Requirements — Applications needing real-time responses
- Proprietary Models — Creating deployable models from API-only teachers
- Specialized Tasks — When you need a focused model rather than a general one