LogTwo Reduces Memory and Compute by 10x While Achieving 25x Throughput Gain with LLaMA 3.2 8B Optimization for Financial Signal Detection

Powered by OptiML for model compression and fine-tuning, and flashattention and vLLM for interface efficiency, we help accelerate AI.

LogTwo Reduces Memory and Compute by 10x While Achieving 25x Throughput Gain with LLaMA 3.2 8B Optimization for Financial Signal Detection

Powered by OptiML for model compression and fine-tuning, and flashattention and vLLM for interface efficiency, we help accelerate AI.

Overview

A high-frequency trading (HFT) firm in the financial services industry depends on real-time financial signal detection to execute rapid, data-driven trading decisions. The firm processes vast amounts of unstructured data—such as news articles, earnings reports, and social media sentiment—to generate signals that inform their trading strategies.

They had been using a BERT-based model but faced challenges with processing longer financial documents due to the model’s limited context window. This limitation reduced their ability to extract important signals from large datasets. As a result, they conducted a proof of concept (PoC) with LLaMA 3.2 8B, which has a much larger context window, to improve signal accuracy by capturing more information from long-form texts.

Objectives

Improve Signal Detection Accuracy
Optimize the model to capture more information from long-form financial texts, improving true positive signal detection.

Reduce Memory and Compute Requirements
‍Shrink the memory footprint and compute load of the LLaMA 3.2 8B model to make it feasible for real-time trading.

Enhance Throughput and Latency
‍Achieve significant improvements in throughput while ensuring latency remains low enough for high-frequency trading.

Results

104% Increase in True Positive Signals
‍Improved signal detection accuracy, enabling the firm to make better data-driven trading decisions.

10x Reduction in Memory and Compute
‍Reduced the model's memory requirement from 32 GB to 2.98 GB, making it compatible with existing GPU infrastructure.

25x Throughput Gain
‍Combined optimizations delivered a 25x improvement in throughput over the original PoC.

Real-Time Scalability
‍Allowed for real-time deployment without the need for costly GPU upgrades.

Challenges

While the PoC using LLaMA 3.2 8B showed a 110% increase in true positive signal detection, several technical challenges arose:

The model’s 8 billion parameters required 32 GB of memory, which exceeded the firm’s available GPU resources.The model’s latency was too high to be practical in a time-sensitive trading environment.
The high compute requirements made real-time trading infeasible.
The model’s latency was too high to be practical in a time-sensitive trading environment.

The firm needed a solution that would retain the improved accuracy of LLaMA 3.2 8B but significantly reduce its memory footprint, compute requirements, and latency to allow for scalable, real-time deployment.

Solution

To address these challenges, the firm partnered with LogTwo, which applied its OptiML framework to compress and optimize the LLaMA 3.2 8B model. OptiML specializes in compression during fine-tuning and quantization-aware fine-tuning, ensuring that performance improvements are maintained even after optimization.

The optimization process included the following key steps:

1. 2:4 Sparsity using OptiML fine-tuning

LogTwo applied 2:4 sparsity during fine-tuning, reducing the number of active parameters by 50%, bringing the parameter count down from 8 billion to 4 billion with almost no loss in accuracy compared to the dense fine-tuned model.

2. Pruning Attention Heads and Layers

LogTwo further reduced the model size through attention head and layer pruning, cutting an additional 20% of parameters, resulting in a model with 3.2 billion effective parameters.

3. 8-Bit Quantization with OptiML

Quantization-aware fine-tuning was applied to compress the model to 8-bit precision, reducing the memory footprint and compute load. This decreased the model’s memory requirement from 32 GB to 2.98 GB, representing a 10x reduction in memory usage while preserving accuracy.

4. Inference Optimization

FlashAttention Used to speed up attention operations by optimizing memory access patterns during inference.
TensorRT Implemented for dynamic batching, enabling the processing of multiple inference requests more efficiently.
vLLM Optimized key-value cache management and provided token-level parallelism, which reduced latency during autoregressive inference.
Kernel fusion (CUDA) Combined multiple GPU operations into a single call, improving overall throughput by minimizing the number of operations.

Results

1. Memory Reduction

The memory usage was reduced from 32 GB to 2.98 GB, a 10x reduction, allowing the model to run on the firm’s existing GPU infrastructure without requiring additional hardware investments.

2. Throughput Gains

Compression through OptiML resulted in a 10x throughput gain by reducing the model size and compute requirements.
Additional inference optimizations provided a further 2.5x throughput gain.
Combined, these optimizations delivered a 25x improvement in throughput compared to the original dense PoC model.

3. True Positive Signal Improvement

After optimization, the model achieved a 104% increase in true positive signals, slightly lower than the PoC’s 110%, but with the significant benefit of reduced infrastructure costs and improved efficiency.

4. Scalability

The firm was able to deploy the optimized model at scale using its existing infrastructure, eliminating the need for costly GPU upgrades. The improvements allowed for real-time trading without the previously prohibitive resource costs.

5. Latency Reduction

By optimizing memory access and parallelism, latency did not exceed on average the latency of the BERT-base model, making the model viable for the firm’s time-sensitive trading use case.

Conclusion

By utilizing OptiML for compression during fine-tuning and quantization aware fine-tuning, along with advanced inference optimizations, LogTwo helped the HFT firm achieve a 25x throughput gain while maintaining a 104% increase in true positive signals. The solution reduced memory and compute requirements by 10x, enabling the firm to deploy a high-performance model in real-time without increasing their infrastructure costs. This provided the firm with a scalable and cost-effective solution for optimizing financial signal detection.

About LogTwo

LogTwo specializes in optimizing AI models for cost-effective, scalable performance. With cutting-edge solutions like OptiML, FlashAttention and vLLM, LogTwo enables businesses to compress, fine-tune, and optimize their models for both training and inference, ensuring fast, accurate, and real-time results.

LogTwo

LogTwo Reduces Memory and Compute by 10x While Achieving 25x Throughput Gain with LLaMA 3.2 8B Optimization for Financial Signal Detection