Deploying LLaMA on Nvidia Orin for In-Car Voice Commands — Optimizing LLMs for Real-Time Edge Applications

Optimizing Mistral Large 2 with OptiML for Compliance Analysis

Powered by OptiML for model compression and fine-tuning, and flashattention and vLLM for interface efficiency, we help accelerate AI.
Powered by OptiML for model compression and fine-tuning, and flashattention and vLLM for interface efficiency, we help accelerate AI.

Introduction

A global professional services firm specializing in risk and regulatory compliance sought to enhance its compliance processes using large language models (LLMs). Initially, the firm considered GPT-4o, a widely-used general-purpose model capable of handling long-context documents. However, GPT-4o was not optimized for the precision and efficiency required for complex, domain-specific regulatory tasks. To address this, the firm selected Mistral Large 2, a model designed for long-context document processing and high-throughput operations. Fine-tuning and optimization were critical to adapt the model for tasks such as regulatory gap identification and multilingual compliance analysis.

Objectives

Improve Domain-Specific Compliance Analysis
Fine-tune Mistral Large 2 to specialize in legal and regulatory language, enhancing its performance in regulatory gap identification and multilingual compliance analysis.
Reduce Operational Costs and Increase Efficiency
Optimize the model’s computational requirements to reduce processing costs and increase throughput for large-scale compliance operations without sacrificing accuracy.
Enhance Scalability for Global Operations
Enable the model to handle vast volumes of compliance data and provide real-time analysis across multiple jurisdictions while reducing the need for costly infrastructure upgrades.

Results

Performance Improvements
  • Achieved a 20% increase in regulatory gap identification accuracy compared to GPT-4o.
  • OptiML optimizations provided a 25x improvement in throughput, enabling faster data processing and analysis.
Cost Efficiency
  • 10x reduction in token processing costs through compression and quantization, making large-scale compliance analysis significantly more affordable.
  • Reduced memory usage and computational resources by 50% with 2:4 sparsity compression, lowering energy consumption and cutting operational costs.
Scalability and Speed
  • Enabled 128k context utilization, allowing full policy and regulatory documents to be processed in one pass, ensuring complete data analysis.
  • Utilized dynamic batching and parallelism to process multiple compliance documents simultaneously, increasing scalability for multinational clients.
Accuracy Retained
  • Maintained the 20% improvement in gap identification accuracy post-compression and quantization, ensuring the model’s precision across global jurisdictions.

Challenges

The automotive manufacturer faced several key challenges:
  • The model’s 8 billion parameters required 32 GB of memory, which exceeded the firm’s available GPU resources.The model’s latency was too high to be practical in a time-sensitive trading environment.
  • The high compute requirements made real-time trading infeasible.
  • The model’s latency was too high to be practical in a time-sensitive trading environment.
The firm needed a solution that would retain the improved accuracy of LLaMA 3.2 8B but significantly reduce its memory footprint, compute requirements, and latency to allow for scalable, real-time deployment.

Problem

Although Mistral Large 2 had the architecture needed for long-context document processing, its out-of-the-box performance lacked the domain-specific expertise required for precise legal and regulatory text analysis. The firm faced multiple challenges:
  • Domain-specific customization: The model needed to specialize in legal and regulatory language to meet their compliance requirements.
  • Processing cost and speed: The model's initial fine-tuning was resource-intensive, making large-scale operations costly and inefficient.
  • Scalability: Without optimization, the firm found it challenging to handle the vast volumes of data required for global compliance monitoring.
While fine-tuning improved Mistral Large 2’s accuracy and made it outperform GPT-4o, it wasn’t enough. The firm needed further optimization to reduce costs and improve operational efficiency at scale.
“The content supply chain isn’t just about delivering content faster and more efficiently. It’s about creating content that engages people on an individual level.”

Helen Wallace

Creative Director, Deloitte Digital

Solution

The firm implemented a two-phase approach to address both the performance and cost challenges:
1. Phase 1: Fine-tuning Mistral Large 2
  • The Mistral Large 2 model was first fine-tuned to specialize in regulatory text analysis. This adaptation allowed the model to better understand the intricacies of sector-specific language and regulatory frameworks, outperforming GPT-4o by 20% in regulatory gap identification accuracy.
2. Phase 2: Fine-tuning and compressing with OptiML
  • After achieving better performance than GPT-4o, the firm introduced OptiML to further reduce costs and improve efficiency without sacrificing quality.
Several key techniques were used during this optimization process:
  • 2:4 Sparsity Compression: OptiML applied a structured sparsity pattern, where only two out of every four weights in the model were activated. This reduced the number of active parameters by 50%, significantly lowering memory usage and computational overhead. The firm saw a 50% reduction in active parameters, leading to both faster inference times and reduced energy consumption.
  • Quantization-Aware Fine-Tuning: The model was fine-tuned using 8-bit quantization, which reduced the precision of the computations without affecting accuracy. This lowered the computational load by 10x while retaining model accuracy. The quantization techniques were especially beneficial for large-scale operations where memory and compute resources are critical.
  • Inference Optimization: OptiML further optimized inference through several advanced techniques:
  • FlashAttention: This technique enhanced memory efficiency by reducing the amount of memory required to store intermediate results during attention calculations, leading to faster processing times and lower latency for real-time compliance checks.
  • vLLM (Optimized Inference Engine): The vLLM engine accelerated the inference process, allowing quicker responses while reducing resource consumption. This was especially beneficial in scenarios where the firm needed real-time compliance analysis.
  • Dynamic Batching and Token-Level Parallelism: These optimizations allowed multiple compliance documents to be processed simultaneously, increasing throughput while maintaining high accuracy.
“The content supply chain isn’t just about delivering content faster and more efficiently. It’s about creating content that engages people on an individual level.”

Helen Wallace

Creative Director, Deloitte Digital

Results

The combination of fine-tuning Mistral Large 2 and applying OptiML’s compression and optimization techniques led to remarkable improvements in performance and cost-efficiency:
1. Performance Improvements
  • The fine-tuned model outperformed GPT-4o by 20% in regulatory gap identification accuracy.
  • OptiML’s optimization techniques enabled a 25x improvement in throughput, allowing the firm to process larger datasets in less time.
2. Cost Efficiency
  • 10x Reduction in Token Processing Costs: Thanks to OptiML’s compression and quantization techniques, token processing costs were reduced by 10x, making large-scale compliance analysis affordable.
  • Lowered Computational Resources: The system required significantly fewer computational resources due to sparsity and quantization, allowing the firm to scale operations without the need for costly infrastructure upgrades.
  • Memory Usage Reduction: 2:4 sparsity lowered the memory required for processing, leading to a substantial reduction in energy consumption and operational costs.
3. Scalability and Speed
  • Full 128k Context Utilization: The system could process entire corporate policies and regulatory documents in one pass, ensuring no critical information was lost.
  • Dynamic Batching: Enabled the firm to process multiple documents simultaneously, scaling operations for multinational clients needing real-time compliance checks across various jurisdictions.
4. Cost Savings and Privacy
  • The fine-tuned and compressed model retained the same 20% improvement in gap identification accuracy, ensuring precise compliance analysis across multiple jurisdictions while reducing resource use.
“By adding visibility into available web content, we hope to reduce redundancies by 35% to 50% globally and encourage teams to spend even more time working with clients.”

Helen Wallace

Creative Director, Deloitte Digital

Conclusion

By first fine-tuning Mistral Large 2 and then applying OptiML's advanced compression and inference optimization techniques, the firm successfully created a highly efficient, scalable, and cost-effective compliance solution. The fine-tuned model outperformed GPT-4o in accuracy, and OptiML’s optimizations reduced complexity, resource usage, and operational costs without sacrificing quality. OptiML's techniques, such as sparsity compression and quantization, enabled the firm to maintain high accuracy while significantly lowering computational overhead and scaling compliance services globally. This solution allowed the firm to handle more clients, reduce processing times, and offer competitive pricing, resulting in better compliance outcomes and greater client satisfaction.

About LogTwo

LogTwo specializes in optimizing AI models for cost-effective, scalable performance. With cutting-edge solutions like OptiML, FlashAttention and vLLM, LogTwo enables businesses to compress, fine-tune, and optimize their models for both training and inference, ensuring fast, accurate, and real-time results.

Let's us show you how we can help you on your AI journey.

Contact Us