Voice Activity Detection

On-device voice activity detection

Know the instant a person starts and stops speaking — the building block that makes every voice feature faster and more battery-friendly, running entirely on the device. Under the hood: the proven Silero v5 model on the VoxRT runtime, at about 1.7 MB of app-size impact.

Get started → View on GitHub

Capabilities

What you get

Per-frame decisions

Streaming, low-latency speech / no-speech on every audio frame.

Proven model

Silero v5 weights, MIT-licensed — no ONNX Runtime or PyTorch Mobile.

Tiny footprint

~1.7 MB net app-size impact; runs comfortably even on modest, low-power hardware.

Power gate

Gates wake-word and keyword-spotting models so they only run on speech.

Fully on-device

No network, no microphone audio leaving the device.

Overview

The foundational voice primitive

Voice activity detection tells you, frame by frame, whether someone is speaking. It's the building block the rest of a voice pipeline sits on: it gates wake-word and keyword-spotting models so they only run when there's speech, drives barge-in and interruption logic, and stands alone for record-trimming and turn-taking.

VoxRT ships the well-known Silero v5 weights (MIT-licensed) on its own from-scratch inference runtime — the accuracy of a proven model with a tiny binary footprint and no heavyweight ML framework dependency. It's free for commercial use, with no per-user fees.

How it works

Per-frame speech, segment events

Feed 32 ms frames at 16 kHz. VAD returns a speech probability on every frame and emits speech-start / speech-end events you can gate the rest of your pipeline on — wake word, keyword spotting, or recording.

# 16 kHz, 32 ms frames
vad.process(frame) → {
  speech: true,
  probability: 0.98
}

# segment events
→ speech_start  @ 0.42s
→ speech_end    @ 2.15s

The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.

VAD is the free, open-source showcase of that runtime — running Silero v5 with state-of-the-art per-frame latency, alongside wake word and streaming ASR. All three share the same Rust runtime crate and NEON kernel set.

Performance

Negligible cost per stream

1.85%

real-time factor on iPhone 13 Pro Max — ~0.6 ms / 32 ms frame

3.05%

real-time factor on a Snapdragon 662

~54

parallel VAD streams on a single core

~1.7 MB

net app-size impact

Footprint

About 1.7 MB in your app

Swift wrapper source~17 KB
Native xcframework, compressed (device slice)~500 KB
Silero VAD weights (fp16)1.2 MB
Net app-size impact~1.7 MB

FAQ

On-device VAD, answered

What is voice activity detection used for?

Anything that needs to know when speech starts and stops: gating a wake word or transcriber so it only runs on speech, trimming silence from recordings, driving turn-taking and barge-in in voice assistants, and keeping always-on features battery-friendly.

Does any audio leave the device?

No. All processing happens locally on the device's CPU — no network connection is needed and no microphone audio is ever uploaded.

What platforms are supported?

iOS 16+ (Swift Package) and Android 8.0+ / API 26 (Gradle via JitPack) today, on the same runtime that powers the wake-word and ASR SDKs.

Is VoxRT VAD free for commercial use?

Yes. The SDK wrappers are Apache-2.0, the Silero v5 weights are MIT, and the VoxRT runtime is free for commercial use in production with no per-user fees. If you need more than the published model, paid engagements cover custom work — tuned models, additional platforms, and OEM or bulk-device deployments. See licensing.

Which Silero version do I get?

Silero v5 — the version with the published ROC-AUC scores. Silero has since released v6 upstream; VoxRT tests upstream releases before porting, and because the weights ride on the VoxRT runtime, moving versions is a runtime update rather than a re-integration on your side.

Explore