Company background
Invideo is a global AI video creation platform used by 25M+ creators. As user expectations shifted toward higher-fidelity visuals and instant rendering, the platform’s AI pipeline spanning image enhancement, live portrait animation, and speech recognition needed to deliver studio-grade quality without exploding compute costs. The company’s North Star: consistent, high-quality outputs with fast, predictable render times at massive scale.
The Problem
Invideo’s growth brought bursty, heterogeneous traffic (consumer surges, enterprise batch jobs). The existing pipeline struggled with:
- Latency spikes during peak hours that eroded user trust and SLA compliance.
- High and unpredictable GPU costs, driven by redundant compute and suboptimal utilization.
- Long lead times (roughly two weeks) to take AI features from POC into stable production, slowing iteration velocity.
- Inconsistent image quality that required manual tuning, adding operational drag.
Solution
Partnering with Invideo’s infra and research teams, we executed a production-first optimization program across model execution, memory, and scaling. The approach prioritized measurable latency/cost gains while improving visual consistency.
1) Accelerate the hot path with fused kernels
- Fused GPU kernels replaced multiple small operators in the image/video stages, cutting launch overhead and memory traffic.
- Operator fusion targeted the most time-consuming steps in Clarity Upscaler and Live Portrait to lift throughput without altering outputs.
2) Eliminate redundant compute with smart caching
- Introduced feature map / embedding caching between dependent stages so repeated requests and multi-pass enhancements reused intermediate tensors instead of recomputing them.
- Result: lower compute per render and steadier tail latency under repeated edits/export attempts.
3) Fit more work per GPU with memory-efficient attention
- Deployed chunked (memory-efficient) attention, reducing peak VRAM and enabling larger effective batch sizes at the same hardware tier.
- Lower memory pressure also reduced OOM retries (often the hidden culprit behind p90/p99 spikes)
4) Make scale feel instant with rapid autoscaling
- Tuned autoscaling policies to spin up capacity near-instantly during flash traffic (creator launches, promotions).
- Resource-aware load balancing routed heterogeneous jobs (short/long, light/heavy) to the right replicas to keep utilization high without starving small jobs.
5) Engineer for resilience under spikes
- Stress-tested the pipeline for steep traffic ramps, validating stable response times (p50 around 20s, p90 around 40s under synthetic worst-case) before the full optimization rollout, then re-benchmarking after changes to confirm tail improvements and SLA headroom.
Results
- Latency: 20 seconds → 11 seconds (–45%) at p50, with steadier tails under load.
- Cost: –56% overall serving/GPU costs for the image/video pipeline.
- Quality & consistency: Visual quality became consistently higher and no longer required manual tuning for most presets.
- Reliability: Low-latency, cost-optimized, SLA-compliant performance during peak events.
- Velocity: POC → prod in 3–4 days (down from ~2 weeks), enabling faster iteration on features and A/Bs
Invideo’s partnership with Simplismart shows how smart infrastructure tuning can change the game for GenAI at scale. With fused kernels, intelligent caching, and memory-efficient scaling, they cut latency, boosted quality, and reduced costs—all without disrupting creative workflows. Today, Invideo delivers studio-grade AI videos faster, more reliably, and at a fraction of the cost.
Shivam R.
Senior Director of Engineering, Invideo

