Simplismart V2

Company

Ema

USE CASE

Enterprise agentic AI platform needing ultra-low-latency routing across a multi-model stack

Highlights

Continuous batching for high GPU utilization
Overlap scheduling to parallelize compute and I/O‍
CUDA kernel & compilation optimizations tailored to LLM inference

Company Background

‍

Ema is building an enterprise-facing AI agent platform that plugs into internal systems and gives customers the flexibility to pick from 12+ foundation models (e.g., GPT, DeepSeek, and others) for downstream tasks. On top of the foundation layer, Ema runs a proprietary intelligence layer composed of two custom LLMs (~1B and ~8B parameters) that act as routing models: they parse enterprise queries, apply a reward signal, and decide which model in the stack should process each request. This architecture lets Ema deliver both control and performance for complex, multi-tool enterprise flows.

‍

The Problem

‍

Because Ema’s routing LLMs allocate work (they don’t generate final responses), the latency requirement was near-zero and every extra millisecond degraded perceived performance and user trust. In production, however, the team observed:

‍

~3.5s latency for the 1B router
~6.5s latency for the 8B router
Sequential execution pushed end-to-end latencies beyond 10s, missing enterprise SLAs that required <1s selection.
Compounding the issue, scale-up time for new workloads was about 6 minutes, hurting spiky traffic handling.

Solution

‍

To break through the latency wall without sacrificing accuracy or routing quality, Simplismart partnered with Ema on a three-layer optimization plan i.e. serving, inference, and scaling with careful change isolation and A/B validation.

‍

1) Serving-layer throughput: continuous batching

Implemented continuous batching to keep GPUs hot, reduce idle gaps between microbatches, and smoothen queueing variance under bursty traffic.
Batched work across similar sequence lengths to avoid tail-latency cliffs.

‍

2) Pipeline concurrency: overlap scheduling

Introduced overlap scheduling to parallelize CPU preprocessing, GPU compute, and host↔device transfers, minimizing stalls.
Re-sequenced the 1B/8B router duo to remove unnecessary sequential steps, cutting end-to-end routing time even when both models are consulted.

‍

3) Inference-kernel & graph compilation improvements

Applied CUDA kernel and compilation optimizations purpose-built for LLM inference paths used by the routers.
Tuned launch configs and operator fusion points to raise effective SM occupancy under real-world prompt/sequence distributions.

‍

4) Fast elasticity: scale-time reduction

Removed pipeline bottlenecks and tuned autoscaling triggers/limits so the system could scale in under one minute without cold-start spikes.

‍

Collectively, these upgrades lifted GPU utilization, eliminated sequential head-of-line blocking, and cut copy/compute bubbles, translating directly into lower p50 latency and steadier p95 under load.

‍

Results

‍

p50 end-to-end latency: 10s → 0.9s (–91%)
Scale-up (cold-to-ready): 6 min → <1 min (–85%)
Combined router latency (1B + 8B): <900 ms end-to-end

‍

With Simplismart, Ema’s routing layer transformed from a sequential system into a high-speed, adaptive engine. Model selection now feels instant with sub-second routing across a multi-model stack, even under enterprise-scale load.

‍

Engineering teams can deploy new routing variants without latency regressions or scale-up lag. GPU clusters stay fully utilized, autoscaling reacts within seconds, and users experience truly real-time agentic AI.

‍

Simplismart helped Ema meet SLAs and achieved true real-time performance.

Simplismart’s team partnered with us to optimize every layer, from batching and pipeline concurrency to CUDA-level inference and fast autoscaling. Their engineering support made sub-second routing a reality.

Soham Shah

Founding ML Enigneer, Ema

Company Background

The Problem

Solution

Results

Find out what is tailor-made inference for you.