Company Background
Ema is building an enterprise-facing AI agent platform that plugs into internal systems and gives customers the flexibility to pick from 12+ foundation models (e.g., GPT, DeepSeek, and others) for downstream tasks. On top of the foundation layer, Ema runs a proprietary intelligence layer composed of two custom LLMs (~1B and ~8B parameters) that act as routing models: they parse enterprise queries, apply a reward signal, and decide which model in the stack should process each request. This architecture lets Ema deliver both control and performance for complex, multi-tool enterprise flows.
The Problem
Because Ema’s routing LLMs allocate work (they don’t generate final responses), the latency requirement was near-zero and every extra millisecond degraded perceived performance and user trust. In production, however, the team observed:
- ~3.5s latency for the 1B router
- ~6.5s latency for the 8B router
- Sequential execution pushed end-to-end latencies beyond 10s, missing enterprise SLAs that required <1s selection.
- Compounding the issue, scale-up time for new workloads was about 6 minutes, hurting spiky traffic handling.
Solution
To break through the latency wall without sacrificing accuracy or routing quality, Simplismart partnered with Ema on a three-layer optimization plan i.e. serving, inference, and scaling with careful change isolation and A/B validation.
1) Serving-layer throughput: continuous batching
- Implemented continuous batching to keep GPUs hot, reduce idle gaps between microbatches, and smoothen queueing variance under bursty traffic.
- Batched work across similar sequence lengths to avoid tail-latency cliffs.
2) Pipeline concurrency: overlap scheduling
- Introduced overlap scheduling to parallelize CPU preprocessing, GPU compute, and host↔device transfers, minimizing stalls.
- Re-sequenced the 1B/8B router duo to remove unnecessary sequential steps, cutting end-to-end routing time even when both models are consulted.
3) Inference-kernel & graph compilation improvements
- Applied CUDA kernel and compilation optimizations purpose-built for LLM inference paths used by the routers.
- Tuned launch configs and operator fusion points to raise effective SM occupancy under real-world prompt/sequence distributions.
4) Fast elasticity: scale-time reduction
- Removed pipeline bottlenecks and tuned autoscaling triggers/limits so the system could scale in under one minute without cold-start spikes.
Collectively, these upgrades lifted GPU utilization, eliminated sequential head-of-line blocking, and cut copy/compute bubbles, translating directly into lower p50 latency and steadier p95 under load.
Results
- p50 end-to-end latency: 10s → 0.9s (–91%)
- Scale-up (cold-to-ready): 6 min → <1 min (–85%)
- Combined router latency (1B + 8B): <900 ms end-to-end
With Simplismart, Ema’s routing layer transformed from a sequential system into a high-speed, adaptive engine. Model selection now feels instant with sub-second routing across a multi-model stack, even under enterprise-scale load.
Engineering teams can deploy new routing variants without latency regressions or scale-up lag. GPU clusters stay fully utilized, autoscaling reacts within seconds, and users experience truly real-time agentic AI.
Simplismart helped Ema meet SLAs and achieved true real-time performance.
Soham Shah
Founding ML Enigneer, Ema

