How Sanas AI unlocked real-time voicebot performance with near-zero accuracy loss at half the cost

Company

Sanas

USE CASE

Real-time, production-grade voicebot requiring ultra-low-latency inference with strict accuracy and cost constraints.

Highlights

Near-zero-loss INT4 quantization for Gemma 3 model.
Latency-optimized inference with cost efficient NVIDIA A10G GPUs.
Global replication & geo-based routing for lowest latency path.

Company Background

‍

Sanas builds real-time speech AI technology deployed in production contact-center environments, where voicebot workloads demand strict latency guarantees, high conversational accuracy, and global availability.

The Gemma 3 model powering their voicebot initially ran on high-end NVIDIA H100 GPUs to meet real-time performance expectations.

‍

The Problem

‍

Sanas faced a multi-dimensional challenge:

‍

High inference cost: H100s met SLA, but scaling them globally was expensive.
Latency sensitivity: Conversational systems cannot tolerate even small delays.
Accuracy constraints: Any quantization or optimization must preserve accuracy to near-perfect levels.
Geo-distributed users: Calls originate globally; response needed to come from the nearest region.

‍

They needed a solution that preserved routing quality, kept latency low, reduced hardware cost, and scaled cleanly.

‍

Solution

‍

Simplismart partnered with Sanas to optimize both model inference and infrastructure topology while maintaining model accuracy.

‍

1. Ultra-precise INT4 quantization

‍

The Simplismart team performed a custom INT4 quantization pass on Sanas’s Gemma 3 model, enabling it to run efficiently on A10G GPUs while achieving:

‍

~0.01% accuracy drop for their specific task, effectively imperceptible.
low-memory footprint to enable cost-efficient GPU deployment.

‍

This allowed inference to run on much cheaper NVIDIA A10G GPUs without hurting quality.

‍

2. Global replication + geo-based routing

‍

Deployed the model across multiple globally distributed regions.
Built intelligent geo- and latency-aware routing that directs each request to the region delivering the lowest real-time latency, not just the nearest zone.

Reduced round-trip latency for users across continents.

‍

3. Rapid autoscaling for spiky conversational traffic

‍

The Simplismart team designed autoscaling specifically tuned for real-time voice workloads:

‍

Event-driven triggers enable instant scaling on traffic surges.
State-aware logic optimizes resource decisions under load.
Zero cold-start delay to maintain consistent low-latency performance.

‍

Together, these kept the system responsive and cost-efficient.

‍

Results

‍

99.99% accuracy retention after INT4 quantization.
SLA met on NVIDIA A10G GPUs.
53% cost savings by shifting inference to A10G GPUs.
Global low-latency serving through geo-based routing.
Real-time performance under rapidly changing traffic loads.

‍

With Simplismart, Sanas transformed its real-time voice AI stack from a costly, high-end GPU deployment into a globally scalable, latency-optimized system. Inference now runs reliably on cost-efficient A10G GPUs without sacrificing conversational accuracy or SLA guarantees.

‍

Traffic bursts are handled effortlessly through rapid autoscaling, inference stays consistently fast through geo-based routing, and quantization preserves accuracy at near-perfect levels.

‍

The result? A voicebot that feels instant everywhere, costs significantly less to operate, and scales cleanly as adoption grows.