Solution
To address these challenges, the firm partnered with LogTwo, which applied its OptiML framework to compress and optimize the LLaMA 3.2 8B model. OptiML specializes in compression during fine-tuning and quantization-aware fine-tuning, ensuring that performance improvements are maintained even after optimization.
The optimization process included the following key steps:
1. 2:4 Sparsity using OptiML fine-tuning
- LogTwo applied 2:4 sparsity during fine-tuning, reducing the number of active parameters by 50%, bringing the parameter count down from 8 billion to 4 billion with almost no loss in accuracy compared to the dense fine-tuned model.
2. Pruning Attention Heads and Layers
- LogTwo further reduced the model size through attention head and layer pruning, cutting an additional 20% of parameters, resulting in a model with 3.2 billion effective parameters.
3. 8-Bit Quantization with OptiML
- Quantization-aware fine-tuning was applied to compress the model to 8-bit precision, reducing the memory footprint and compute load. This decreased the model’s memory requirement from 32 GB to 2.98 GB, representing a 10x reduction in memory usage while preserving accuracy.
4. Inference Optimization
- FlashAttention Used to speed up attention operations by optimizing memory access patterns during inference.
- TensorRT Implemented for dynamic batching, enabling the processing of multiple inference requests more efficiently.
- vLLM Optimized key-value cache management and provided token-level parallelism, which reduced latency during autoregressive inference.
- Kernel fusion (CUDA) Combined multiple GPU operations into a single call, improving overall throughput by minimizing the number of operations.