On-device voice activity detection
Know the instant a person starts and stops speaking — the building block that makes every voice feature faster and more battery-friendly, running entirely on the device. Under the hood: the proven Silero v5 model on the VoxRT runtime, at about 1.7 MB of app-size impact.
What you get
Per-frame decisions
Streaming, low-latency speech / no-speech on every audio frame.
Proven model
Silero v5 weights, MIT-licensed — no ONNX Runtime or PyTorch Mobile.
Tiny footprint
~1.7 MB net app-size impact; runs comfortably even on modest, low-power hardware.
Power gate
Gates wake-word and keyword-spotting models so they only run on speech.
Fully on-device
No network, no microphone audio leaving the device.
The foundational voice primitive
Voice activity detection tells you, frame by frame, whether someone is speaking. It's the building block the rest of a voice pipeline sits on: it gates wake-word and keyword-spotting models so they only run when there's speech, drives barge-in and interruption logic, and stands alone for record-trimming and turn-taking.
VoxRT ships the well-known Silero v5 weights (MIT-licensed) on its own from-scratch inference runtime — the accuracy of a proven model with a tiny binary footprint and no heavyweight ML framework dependency. It's free for commercial use, with no per-user fees.
Per-frame speech, segment events
Feed 32 ms frames at 16 kHz. VAD returns a speech probability on every frame and emits speech-start / speech-end events you can gate the rest of your pipeline on — wake word, keyword spotting, or recording.
# 16 kHz, 32 ms frames vad.process(frame) → { speech: true, probability: 0.98 } # segment events → speech_start @ 0.42s → speech_end @ 2.15s
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.
VAD is the free, open-source showcase of that runtime — running Silero v5 with state-of-the-art per-frame latency, alongside wake word and streaming ASR. All three share the same Rust runtime crate and NEON kernel set.
Negligible cost per stream
About 1.7 MB in your app
- Swift wrapper source~17 KB
- Native xcframework, compressed (device slice)~500 KB
- Silero VAD weights (fp16)1.2 MB
- Net app-size impact~1.7 MB
On-device VAD, answered
What is voice activity detection used for?
Anything that needs to know when speech starts and stops: gating a wake word or transcriber so it only runs on speech, trimming silence from recordings, driving turn-taking and barge-in in voice assistants, and keeping always-on features battery-friendly.
Does any audio leave the device?
No. All processing happens locally on the device's CPU — no network connection is needed and no microphone audio is ever uploaded.
What platforms are supported?
iOS 16+ (Swift Package) and Android 8.0+ / API 26 (Gradle via JitPack) today, on the same runtime that powers the wake-word and ASR SDKs.
Is VoxRT VAD free for commercial use?
Yes. The SDK wrappers are Apache-2.0, the Silero v5 weights are MIT, and the VoxRT runtime is free for commercial use in production with no per-user fees. If you need more than the published model, paid engagements cover custom work — tuned models, additional platforms, and OEM or bulk-device deployments. See licensing.
Which Silero version do I get?
Silero v5 — the version with the published ROC-AUC scores. Silero has since released v6 upstream; VoxRT tests upstream releases before porting, and because the weights ride on the VoxRT runtime, moving versions is a runtime update rather than a re-integration on your side.