Optimize the size of AI/ML models
Description
Large-scale AI/ML models require significant storage space and take more resources to run as compared to optimized models. Modern AI models, particularly large language models (LLMs), can range from billions to hundreds of billions of parameters, making deployment challenging. Model optimization techniques can reduce model size by 50-90% while maintaining acceptable accuracy, dramatically lowering storage, memory, and compute requirements.
Solution
Apply modern model optimization strategies to reduce model size and improve inference efficiency. These techniques can be used individually or combined for maximum effect:
1. Quantization (Table Stakes for 2025)
Quantization reduces the numerical precision of model weights and activations, decreasing model size and speeding up inference:
Post-Training Quantization:
- INT8 quantization: Converts 32-bit floating point to 8-bit integers, reducing model size by ~75%
- FP16 (Half-precision): Uses 16-bit floats, reducing size by 50% with minimal accuracy loss
- Easy to apply without retraining, supported by most frameworks
Quantization-Aware Training (QAT):
- Simulates quantization during training for better accuracy preservation
- Results in models that perform well when quantized
- More accurate than post-training quantization but requires retraining
Advanced Quantization for LLMs:
- GPTQ: Optimal quantization specifically designed for large language models
- AWQ (Activation-aware Weight Quantization): Preserves important weights for better accuracy
- GGUF/GGML: Quantization formats optimized for CPU inference (used by llama.cpp)
- Can achieve 3-4 bit quantization with acceptable quality loss
2. Knowledge Distillation
Train a smaller "student" model to mimic a larger "teacher" model's behavior:
Process:
- Train large teacher model to high accuracy
- Use teacher's predictions (soft targets) to train smaller student model
- Student learns to approximate teacher's decision boundaries, not just labels
Benefits:
- Student models are typically 10-100x smaller than teachers
- Can preserve 95%+ of teacher model accuracy
- Enables deployment on resource-constrained devices
Modern Examples:
- DistilBERT (66M params) distilled from BERT (110M params), 97% accuracy retention
- Microsoft Phi-3-mini (3.8B params) achieving performance competitive with much larger models
- TinyLLaMA (1.1B params) distilled knowledge from larger LLaMA models
3. Pruning
Remove unnecessary weights or structures from trained models:
Unstructured Pruning:
- Remove individual weights with smallest magnitude
- Can achieve 80-90% sparsity with minimal accuracy loss
- Requires sparse computation kernels for speedup
Structured Pruning:
- Remove entire channels, filters, or attention heads
- Directly reduces model dimensions and computation
- Works with standard hardware without special kernels
Dynamic Pruning:
- Adaptively remove computations based on input
- Early exit mechanisms for simple inputs
- Lottery Ticket Hypothesis: find sparse subnetworks that train to similar accuracy
4. Architectural Efficiency
Choose or design architectures optimized for efficiency:
Efficient Model Families:
- MobileNets v4: Optimized for mobile and edge devices
- EfficientNet: Compound scaling for optimal size-accuracy tradeoff
- SqueezeNet: Achieves AlexNet-level accuracy at 50x smaller size
- Efficient Transformers: Linear attention mechanisms for reduced complexity
Small but Capable LLMs (2025):
- Microsoft Phi-3 (3.8B, 7B, 14B): Small models with strong performance
- Mistral 7B: Efficient 7B parameter model outperforming larger models
- Gemma 2B/7B (Google): Efficient open models for diverse tasks
- LLaMA 3 variants: Multiple sizes optimized for different deployment scenarios
5. Low-Rank Factorization
Decompose weight matrices into products of smaller matrices:
- LoRA (Low-Rank Adaptation): Efficient fine-tuning with <1% parameters
- QLoRA: Combines quantization with LoRA for ultra-efficient training
- Enables fine-tuning large models on consumer GPUs
SCI Impact
SCI = (E * I) + M per R
Software Carbon Intensity Spec
Optimizing AI/ML model size impacts SCI as follows:
E: Reduces energy consumption for inference through:- Lower memory bandwidth requirements
- Fewer computational operations
- Reduced storage I/O
- Enables running on more energy-efficient hardware
M: Reduces embodied carbon by:- Requiring less memory capacity
- Enabling deployment on smaller, less powerful devices
- Reducing data center infrastructure needs
Assumptions
- Model optimization is applied with accuracy evaluation to ensure acceptable performance
- Target deployment environment is known (cloud, edge, mobile) to guide optimization choices
- Optimization tools and frameworks are available for the model type
Considerations
- Accuracy-Size Tradeoff: Always evaluate optimized models against accuracy requirements
- Technique Selection: Different techniques work better for different model types:
- LLMs: GPTQ, AWQ quantization + LoRA fine-tuning
- Vision models: Quantization + pruning
- Small models for edge: Knowledge distillation + MobileNet architectures
- Hardware Compatibility: Ensure optimization format is supported by target hardware
- Maintenance: Document optimization settings for reproducibility
- Iterative Approach: Start with simple techniques (post-training quantization) before complex methods
- Latency vs. Throughput: Some optimizations improve one at the expense of the other
References
- ONNX Runtime - Model Optimization
- TensorRT - NVIDIA GPU Inference Optimization
- vLLM - High-Throughput LLM Serving
- GPTQ - Accurate Quantization for LLMs
- AWQ - Activation-aware Weight Quantization
- Knowledge Distillation
- The Lottery Ticket Hypothesis
- LoRA - Low-Rank Adaptation
- Software Carbon Intensity Spec