How Physics Wallah delivers on-time transcription for massive video workloads with Simplismart
Turnaround SLA
<15 min / video
Whisper Large v3
900 RTF
USE CASE
Large-scale transcription of long-form video lectures (ranging from 1–12 hours) with strict, predictable turnaround SLAs
Highlights
  • Advanced batching using VAD for efficient GPU packing
  • Kernel-level optimizations & fused GPU ops for insanely fast Whisper Large v3
  • Rapid autoscaling + scale-to-zero to manage sinusoidal workloads with cost efficiency

Company Background

Physics Wallah serves millions of students across India with massive volumes of long-form educational video content uploaded throughout the day. Their internal SLA requires that every video, no matter the length, must be fully transcribed within 15 minutes of uploading on Youtube.


The workload is inherently sinusoidal: during class-ending windows, thousands of hours of content spike the system at once; at other times, traffic falls to near-zero. PW needed a system that could:

  1. handle highly bursty workloads without queuing delays,
  2. process extremely long-duration videos (8–12 hours), and
  3. keep costs low during idle periods.

The Problem

PW’s engineering team struggled with three core constraints:

  • Strict SLA: The team had to guarantee <15-minute transcription for any video - long or short.
  • Long inputs: Videos often exceeded 10 hours, requiring scaled and memory-optimized Whisper execution.
  • Highly variable load: When classes ended, thousands of videos arrived together; outside peak hours, load dropped to zero.
  • Cost inefficiency: Traditional Whisper deployments were expensive to run at scale, GPUs sat idle during low-traffic periods, and heavy overprovisioning during spikes drove costs up significantly.

Solution

Simplismart deployed a fully optimized Whisper Large v3 serving pipeline paired with intelligent autoscaling and high-efficiency batching designed specifically for extreme spikes and long-duration audio.

1. High-performance Whisper Large v3 with kernel & fused op optimizations

Simplismart applied custom CUDA kernels, fused GPU operations, and memory-efficient attention paths to accelerate Whisper inference.


This pushed throughput to ~900 RTF, enabling even multi-hour videos to complete within SLA.

2. VAD-based advanced batching for GPU saturation

PW’s workload involved bursts of long video uploads that required intelligent batching, not basic queue-based processing.


Simplismart used VAD (Voice Activity Detection)–driven batch construction to:

  • Segment long audio into meaningful speech chunks instead of raw time-based splits
  • Pack batches tightly by speech-dense segments, minimizing padding waste
  • Keep GPUs saturated with uniform-length work units for peak throughput
  • Reduce tail-latency, preventing long clips from blocking the batch

3. Async Whisper batch-job execution

To process large videos efficiently and handle sudden spikes, Simplismart ran Whisper inference as asynchronous batch jobs rather than synchronous request-response calls.

This enabled:

  • Non-blocking execution, allowing multiple long jobs to be processed concurrently.
  • Queue-less scheduling, where ready batches run immediately without waiting for slow jobs.
  • Higher GPU occupancy, keeping compute pipelines continuously fed.
  • Elastic scaling, since async jobs distribute cleanly across any number of GPUs.

This execution model was critical in supporting PW’s multi-hour audio workloads and extreme throughput targets.

4. Rapid autoscaling + scale-to-zero for sinusoidal workloads

Traffic patterns at PW are bursty and unpredictable. Simplismart implemented:

  • Rapid autoscaling to spin up capacity instantly during peaks.
  • Scale-to-zero to eliminate cost when no videos were being uploaded.

This combination ensured PW met SLAs during load bursts and saved significant GPU cost during idle windows.

Results

  • Guaranteed <15 min transcription SLA on all videos, including 10–12 hour lectures
  • ~900 RTF throughput using optimized Whisper Large v3
  • Zero idle cost via scale-to-zero
  • Smooth handling of sinusoidal workloads, no queue accumulation

With Simplismart, Physics Wallah turned a burst-heavy transcription pipeline into a predictable, high-throughput system that meets SLA at any scale. Transcription now completes within 15 minutes even for 10–12 hour videos with Whisper Large v3 sustaining ~900 RTF and no queue buildup.

Workload spikes are handled smoothly with rapid autoscaling, VAD-optimized batching, and async job execution, while scale-to-zero eliminates GPU waste during idle periods.

The result: reliable on-time transcription, zero idle cost, and a system that scales effortlessly with PW’s growth without compromising performance or budget.

Find out what is tailor-made inference for you.