Streaming & batch ASR

On-device streaming speech recognition

Speech turns into text as people speak — on the device, with no cloud, no streamed audio, and no waiting for a sentence to finish. Partial transcripts refine as the user talks, so live captions and voice input keep up in real time.

→ the quick → the quick brown → the quick brown fox
Capabilities

What you get

Streaming partials

Partial transcripts refine as you speak — cache-aware streaming, no batch step.

CTC and RNN-T

Both decoders exposed for your latency/accuracy trade-off.

32M FastConformer

Research-grade accuracy running on-device.

WER parity

Word error rate within float-noise of the upstream NeMo reference.

Fully on-device

No cloud, no audio leaving the device.

Overview

True streaming, fully on-device

Most "on-device" speech recognition is really batch transcription with a delay. VoxRT streams: cache-aware chunks with an 80 ms attention look-ahead emit partial transcripts as the user speaks, so you can drive live captions, voice input, and barge-in without waiting for an utterance to finish — and without sending audio to a server. It's free for commercial use, with no per-user fees.

The model is NVIDIA NeMo's stt_en_fastconformer_hybrid_medium_streaming_80ms (CC-BY-4.0), ported tensor-by-tensor onto the VoxRT runtime. Accuracy stays within floating-point noise of the upstream Python NeMo reference — research-grade quality in a practical mobile footprint.

How it works

Streaming partials, finalized on endpoint

Open a streaming session and pick a decoder. VoxRT processes cache-aware ~1.1 s chunks with an 80 ms attention look-ahead, emitting partial transcripts that refine in place as the user speaks, then finalizes the line when speech ends — no offline batch pass.

# streaming session
asr.decoder = "rnnt"   # or "ctc"

# streaming partials
 "the quick"
 "the quick brown"
 "the quick brown fox"  (final)
The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.

Streaming ASR is one product on that runtime, alongside VAD and wake word. All three share the same Rust runtime crate and NEON kernel set — the runtime is the product; the models are what it runs.

Performance

Real-time with headroom to spare

Measured on real devices — a 2020 budget Snapdragon 662 and an iPhone 13 Pro Max — not a desktop or a simulator.

0.08–0.10
streaming RTF on iPhone 13 Pro Max — ~90 ms / 1.12 s chunk
0.30
streaming real-time factor on a Snapdragon 662
32M
parameters (FastConformer hybrid)
~90%
of one core free during live transcription (at RTF ≈ 0.10)
Model quality

Pick your accuracy/latency trade-off

Word error rate on LibriSpeech test-clean, within floating-point noise of the upstream Python NeMo reference. Two decoders are exposed.

DecoderLibriSpeech test-clean WERPer-chunk costNotes
RNN-T (recommended)3.267%~50 msHigher accuracy; LSTM state survives chunk boundaries
CTC4.895%~5 ms~15% cheaper per chunk; marginally lower accuracy
Footprint

What it costs on disk and in memory

  • Swift wrapper source~20 KB
  • Native xcframework, compressed (device slice)~5 MB
  • Streaming model on disk (fp16)~61 MB
  • Native heap at runtime (steady-state)~150 MB

The honest trade-off: the v1 model holds ~150 MB steady-state native heap. If you need ultra-low-memory ASR, ask us about the int8 roadmap (v0.2 work). v1 is also English-only today — additional languages are roadmap and customer-driven.

How it compares

Why teams choose VoxRT over generic on-device ASR

Whisper.cpp is popular, but its realistic mobile variants are either less accurate or too heavy to stream on cheap Android. Vosk's small mobile model is materially less accurate. Picovoice Cheetah is closed commercial software that publishes desktop benchmarks; VoxRT publishes measured mobile RTF on a Snapdragon 662 and an iPhone. Sherpa-onnx's central docs don't publish the WER and RTF surface you can evaluate against.

VoxRT gives you a measured mobile path: 3.27% WER on LibriSpeech test-clean, 0.30 RTF on a Snapdragon 662, and true cache-aware streaming — and, unlike most on-device ASR pages, that real-time factor is measured on cheap Android hardware, not just desktop or a flagship demo.

FAQ

On-device speech recognition, answered

Is it really streaming, or batch with a delay?

Really streaming: the model processes cache-aware chunks with an 80 ms attention look-ahead and refines partial transcripts as the user speaks, finalizing when speech ends — no offline batch pass, and no audio ever leaves the device.

How accurate is VoxRT's on-device speech recognition?

3.27% word-error rate on LibriSpeech test-clean with the RNN-T decoder (4.90% with the cheaper CTC decoder) — within floating-point noise of the upstream NVIDIA NeMo reference, and the lowest published WER in the on-device field we've surveyed. See the full comparison.

What languages are supported?

English (with punctuation and capitalization) at v1. Additional languages are on the roadmap and customer-driven — talk to us if you need one.

How much memory does it use?

The model is ~61 MB on disk (fp16) and holds ~150 MB of native heap at steady state. If you need ultra-low-memory ASR, ask us about the int8 roadmap.

Is it free for commercial use?

Yes. The SDK wrappers are Apache-2.0, the model weights are CC-BY-4.0 (NVIDIA attribution), and the VoxRT runtime is free for commercial use in production with no per-user fees. Custom domain models, additional languages, and OEM or bulk-device deployments are paid engagements. See licensing.

Build it on-device with VoxRT

Tell us what you're transcribing and which devices it has to run on.

Get started Compare ASR engines