What Is Quantization?
Quantization is a model optimization technique that reduces the numerical precision of a model's weights and activations — typically from 32-bit floating point (FP32) to 16-bit (FP16), 8-bit (INT8), or even 4-bit (INT4). This dramatically reduces model size and increases inference speed with minimal impact on accuracy.
Why Quantize?
| Precision | Memory per Parameter | Relative Speed | Model Quality |
|---|---|---|---|
| FP32 (full) | 4 bytes | 1x (baseline) | Best |
| FP16 (half) | 2 bytes | ~2x faster | Near-identical |
| INT8 | 1 byte | ~4x faster | Very close |
| INT4 | 0.5 bytes | ~8x faster | Slightly degraded |
| INT2 | 0.25 bytes | ~16x faster | Noticeably degraded |
A 70B parameter model at FP32 requires ~280 GB of memory. At INT4, it fits in ~35 GB — making it runnable on consumer hardware.
Quantization Methods
Post-Training Quantization (PTQ)
Quantize an already-trained model without additional training:
- Static — Calibrate quantization parameters using a small dataset
- Dynamic — Compute quantization parameters at inference time
Quantization-Aware Training (QAT)
Simulate quantization during training, allowing the model to adapt its weights to the lower precision.
Popular Quantization Formats
| Format | Description | Typical Use |
|---|---|---|
| GPTQ | GPU-optimized post-training quantization | Fast GPU inference |
| GGUF | CPU-optimized format (llama.cpp) | Local CPU inference |
| AWQ | Activation-aware weight quantization | High quality at 4-bit |
| bitsandbytes | Dynamic quantization library | Fine-tuning quantized models (QLoRA) |
| ONNX INT8 | Cross-platform inference format | Production deployment |
Impact on Model Quality
Quantization effects vary by model and task:
- FP16: Virtually no quality loss — standard for most deployments
- INT8: <1% accuracy drop in most benchmarks — excellent tradeoff
- INT4: 1-3% accuracy drop — acceptable for many use cases
- INT2: Significant quality loss — only for specific use cases
Applications
- Local LLM Deployment — Run large models on personal hardware
- Edge AI — Deploy models on IoT devices, phones, and embedded systems
- Cost Reduction — Smaller models require less GPU memory and compute
- Faster Inference — Lower precision means faster matrix operations
- Democratization — Makes state-of-the-art models accessible on consumer hardware
Quantization and ELMs
AsterMind's ELMs are inherently lightweight — a single hidden layer with fixed random weights doesn't require the massive parameter counts that make quantization necessary for deep models. ELMs achieve real-time inference on standard CPUs without any compression techniques.