From 10 seconds to <1: How Ema made agentic routing feel instant with Simplismart
End-to-end latency
10s → 0.9s
Scale-up time
6 min → <1 min
Combined router latency
<900 ms end-to-end
Company
Ema
USE CASE
Enterprise agentic AI platform needing ultra-low-latency routing across a multi-model stack
Highlights
  • Continuous batching for high GPU utilization
  • Overlap scheduling to parallelize compute and I/O
  • CUDA kernel & compilation optimizations tailored to LLM inference

Company Background 

Ema is building an enterprise-facing AI agent platform that plugs into internal systems and gives customers the flexibility to pick from 12+ foundation models (e.g., GPT, DeepSeek, and others) for downstream tasks. On top of the foundation layer, Ema runs a proprietary intelligence layer composed of two custom LLMs (~1B and ~8B parameters) that act as routing models: they parse enterprise queries, apply a reward signal, and decide which model in the stack should process each request. This architecture lets Ema deliver both control and performance for complex, multi-tool enterprise flows.

The Problem

Because Ema’s routing LLMs allocate work (they don’t generate final responses), the latency requirement was near-zero and every extra millisecond degraded perceived performance and user trust. In production, however, the team observed:

  • ~3.5s latency for the 1B router
  • ~6.5s latency for the 8B router
  • Sequential execution pushed end-to-end latencies beyond 10s, missing enterprise SLAs that required <1s selection.
  • Compounding the issue, scale-up time for new workloads was about 6 minutes, hurting spiky traffic handling.

Solution

To break through the latency wall without sacrificing accuracy or routing quality, Simplismart partnered with Ema on a three-layer optimization plan i.e. serving, inference, and scaling with careful change isolation and A/B validation.

1) Serving-layer throughput: continuous batching

  • Implemented continuous batching to keep GPUs hot, reduce idle gaps between microbatches, and smoothen queueing variance under bursty traffic.
  • Batched work across similar sequence lengths to avoid tail-latency cliffs.

2) Pipeline concurrency: overlap scheduling

  • Introduced overlap scheduling to parallelize CPU preprocessing, GPU compute, and host↔device transfers, minimizing stalls.
  • Re-sequenced the 1B/8B router duo to remove unnecessary sequential steps, cutting end-to-end routing time even when both models are consulted.

3) Inference-kernel & graph compilation improvements

  • Applied CUDA kernel and compilation optimizations purpose-built for LLM inference paths used by the routers.
  • Tuned launch configs and operator fusion points to raise effective SM occupancy under real-world prompt/sequence distributions.

4) Fast elasticity: scale-time reduction

  • Removed pipeline bottlenecks and tuned autoscaling triggers/limits so the system could scale in under one minute without cold-start spikes.

Collectively, these upgrades lifted GPU utilization, eliminated sequential head-of-line blocking, and cut copy/compute bubbles, translating directly into lower p50 latency and steadier p95 under load.

Results

  • p50 end-to-end latency: 10s → 0.9s (–91%)
  • Scale-up (cold-to-ready): 6 min → <1 min (–85%)
  • Combined router latency (1B + 8B): <900 ms end-to-end

With Simplismart, Ema’s routing layer transformed from a sequential system into a high-speed, adaptive engine. Model selection now feels instant with sub-second routing across a multi-model stack, even under enterprise-scale load.

Engineering teams can deploy new routing variants without latency regressions or scale-up lag. GPU clusters stay fully utilized, autoscaling reacts within seconds, and users experience truly real-time agentic AI.

Simplismart helped Ema meet SLAs and achieved true real-time performance.

Simplismart’s team partnered with us to optimize every layer, from batching and pipeline concurrency to CUDA-level inference and fast autoscaling. Their engineering support made sub-second routing a reality.

Soham Shah

Founding ML Enigneer, Ema

Find out what is tailor-made inference for you.