How Sanas AI unlocked real-time voicebot performance with near-zero accuracy loss at half the cost
Accuracy retained
99.99%
Cost Reduction
53%
Company
Sanas
USE CASE
Real-time, production-grade voicebot requiring ultra-low-latency inference with strict accuracy and cost constraints.
Highlights
  • Near-zero-loss INT4 quantization for Gemma 3 model.
  • Latency-optimized inference with cost efficient NVIDIA A10G GPUs.
  • Global replication & geo-based routing for lowest latency path.

Company Background

Sanas builds real-time speech AI technology deployed in production contact-center environments, where voicebot workloads demand strict latency guarantees, high conversational accuracy, and global availability.

The Gemma 3 model powering their voicebot initially ran on high-end NVIDIA H100 GPUs to meet real-time performance expectations.

The Problem

Sanas faced a multi-dimensional challenge:

  • High inference cost: H100s met SLA, but scaling them globally was expensive.
  • Latency sensitivity: Conversational systems cannot tolerate even small delays.
  • Accuracy constraints: Any quantization or optimization must preserve accuracy to near-perfect levels.
  • Geo-distributed users: Calls originate globally; response needed to come from the nearest region.

They needed a solution that preserved routing quality, kept latency low, reduced hardware cost, and scaled cleanly.

Solution

Simplismart partnered with Sanas to optimize both model inference and infrastructure topology while maintaining model accuracy.

1. Ultra-precise INT4 quantization

The Simplismart team performed a custom INT4 quantization pass on Sanas’s Gemma 3 model, enabling it to run efficiently on A10G GPUs while achieving:

  • ~0.01% accuracy drop for their specific task, effectively imperceptible.
  • low-memory footprint to enable cost-efficient GPU deployment.

This allowed inference to run on much cheaper NVIDIA A10G GPUs without hurting quality.

2. Global replication + geo-based routing

  • Deployed the model across multiple globally distributed regions.
  • Built intelligent geo- and latency-aware routing that directs each request to the region delivering the lowest real-time latency, not just the nearest zone.
  • Reduced round-trip latency for users across continents.

3. Rapid autoscaling for spiky conversational traffic

The Simplismart team designed autoscaling specifically tuned for real-time voice workloads:

  • Event-driven triggers enable instant scaling on traffic surges.
  • State-aware logic optimizes resource decisions under load.
  • Zero cold-start delay to maintain consistent low-latency performance.

Together, these kept the system responsive and cost-efficient.

Results

  • 99.99% accuracy retention after INT4 quantization.
  • SLA met on NVIDIA A10G GPUs.
  • 53% cost savings by shifting inference to A10G GPUs.
  • Global low-latency serving through geo-based routing.
  • Real-time performance under rapidly changing traffic loads.

With Simplismart, Sanas transformed its real-time voice AI stack from a costly, high-end GPU deployment into a globally scalable, latency-optimized system. Inference now runs reliably on cost-efficient A10G GPUs without sacrificing conversational accuracy or SLA guarantees.

Traffic bursts are handled effortlessly through rapid autoscaling, inference stays consistently fast through geo-based routing, and quantization preserves accuracy at near-perfect levels.

The result? A voicebot that feels instant everywhere, costs significantly less to operate, and scales cleanly as adoption grows.

Find out what is tailor-made inference for you.