What Is Model Distillation? Creating Smaller, Faster AI Models

Model distillation (or knowledge distillation) is a technique for creating smaller, faster AI models by training a compact "student" model to replicate the behavior of a larger, more capable "teacher" model. The student learns not just the correct answers but also the teacher's confidence patterns across all possible outputs — capturing "dark knowledge" that isn't available from labels alone.

How Distillation Works

The Process

Train a Teacher — Start with a large, high-performance model (e.g., GPT-4, a 70B parameter model)
Generate Soft Labels — Run training data through the teacher to get probability distributions over all outputs (not just the top prediction)
Train the Student — A much smaller model learns to match the teacher's output distributions
Deploy the Student — The compact model serves predictions in production

Why Soft Labels Matter

A teacher classifying an image might output: "cat: 0.85, dog: 0.10, fox: 0.04, wolf: 0.01". The hard label is just "cat," but the soft distribution reveals that dogs look somewhat similar to this image, foxes less so. This relational knowledge helps the student learn richer representations than training on hard labels alone.

Distillation Approaches

Method	Description	Use Case
Response Distillation	Student mimics teacher's output probabilities	General purpose
Feature Distillation	Student mimics teacher's internal representations	When internal features matter
Relation Distillation	Student learns relationships between samples	Structured data
Self-Distillation	Model distills knowledge from its own deeper layers	Single-model optimization
LLM Distillation	Small LLM trained on large LLM's outputs	Creating specialized small LLMs

Benefits

Aspect	Teacher Model	Distilled Student
Size	70B+ parameters	1-8B parameters
Inference Speed	Slow	5-50x faster
Memory	100+ GB	2-16 GB
Hardware	Multiple GPUs	Single GPU or CPU
Cost per Query	High	10-50x lower
Edge Deployment	Impractical	Feasible

Real-World Examples

GPT-4 → GPT-4o-mini — Smaller model trained to approximate the larger model's capabilities
BERT → DistilBERT — 40% smaller, 60% faster, retaining 97% of performance
LLaMA 70B → LLaMA 8B — Smaller variants informed by larger model insights
Whisper Large → Whisper Small — Compact speech models for edge devices

When to Use Distillation

Edge Deployment — Models must fit on devices with limited memory and compute
Cost Optimization — Reducing inference costs at scale
Latency Requirements — Applications needing real-time responses
Proprietary Models — Creating deployable models from API-only teachers
Specialized Tasks — When you need a focused model rather than a general one

Cookie Preferences

What Is Model Distillation?