On-Device VAD Comparison — VoxRT vs Silero, Cobra & TEN VAD

Q: What is the best on-device VAD engine?

There is no independent third-party benchmark for on-device VAD, so any single "best" claim is vendor-self-reported. The practical choice comes down to license, published mobile performance and footprint. VoxRT ships the proven Silero v5 weights (MIT) on its own runtime, with measured real-time factors of 1.85% on an iPhone 13 Pro Max and 3.05% on a Snapdragon 662, and a license that's free for commercial use with no per-user check.

Q: What is a good real-time factor (RTF) for on-device VAD?

Real-time factor is the fraction of one CPU core needed to keep up with audio in real time, so lower is better. VoxRT measures 1.85% (RTF 0.0185) on an iPhone 13 Pro Max and 3.05% on a Snapdragon 662 — low enough to run dozens of concurrent VAD streams on a single core.

Q: Am I getting the latest Silero weights?

VoxRT currently ships Silero v5, the version with the published ROC-AUC scores it's benchmarked on. Silero has since released v6 upstream; VoxRT tests new upstream releases before porting them into its runtime, and is evaluating whether and when to move to v6. Because VoxRT packages the weights on its own runtime, moving versions is a runtime update rather than a re-integration on your side.

Q: Is Silero VAD free for commercial use?

Yes. Silero VAD is MIT-licensed for both its code and its weights, so it can be used and redistributed commercially. VoxRT ships these same Silero v5 weights on its own runtime.

Q: How does VoxRT compare to Picovoice Cobra?

Picovoice Cobra is a closed commercial VAD engine with a free plan and a paid tier. It claims the largest AUC in its own three-way benchmark but publishes no concrete F1 or AUC number, and its only real-time-factor figures are on desktop and a Raspberry Pi Zero — not Android or iPhone. VoxRT is free for commercial use with no per-user fees and publishes measured mobile real-time factors.

Q: Is WebRTC VAD good enough?

WebRTC VAD is the lightweight, sub-100 KB, BSD-3-licensed baseline from 2010. It is fast and ubiquitous and fine as a free floor, but it is a generation behind modern neural VAD at separating speech from noise, which is why newer engines benchmark against it.

Why VoxRT

What you get with VoxRT VAD

Per-frame decisions

Streaming, low-latency speech / no-speech on every audio frame.

Proven model

Silero v5 weights, MIT-licensed — the same network many rivals benchmark against.

Tiny footprint

~1.7 MB net app-size impact; runs comfortably even on modest, low-power hardware.

Power gate

Gates wake-word and keyword-spotting models so they only run on speech.

Fully on-device

No network, no microphone audio leaving the device — and no per-use fees.

Side by side

The comparison table

Comparison of on-device Voice Activity Detection engines — VoxRT, Silero VAD, Picovoice Cobra, WebRTC VAD, TEN VAD and MarbleNet — by vendor-reported accuracy, measured mobile real-time factor on Android and iPhone, footprint and license.
Engine	Accuracy (vendor-reported)	Mobile RTF — Android	Mobile RTF — iPhone	Footprint	License
VoxRTSilero v5 on the VoxRT runtime	ROC-AUC 0.94–0.96 ^a (inherits Silero v5)	3.05% Snapdragon 662 (A73)	1.85% iPhone 13 Pro Max (A15)	~1.7 MB net app-size impact	Freemium Free for commercial use · no per-user fees
Silero VAD v5snakers4 · upstream	ROC-AUC 0.94–0.96 ^a	Not published ^b	Not published ^b	~2 MB model file (runtime separate)	MIT code + weights
Picovoice Cobracommercial	"Largest AUC" claim ^c no numeric F1/AUC published	Not published ^d	Not published ^d	Not published	Commercial free plan + paid tier
WebRTC VADGoogle · legacy	2010-era GMM baseline ^e	Not published	Not published	<100 KB	BSD-3
TEN VADAgora	PR-curve claims ^f no numeric F1/AUC published	4.9–5.7% Snapdragon 425 / 450 ^g	0.5–2.1% iPhone 6 / 8 (A8 / A11) ^g	320–532 KB full library (runtime + model)	Apache-2.0 + non-compete ^h
MarbleNetNVIDIA NeMo	Per-checkpoint ⁱ no single headline metric	Not published	Not published	Varies by checkpoint	CC-BY-4.0

RTF = real-time factor — fraction of one CPU core needed to keep up with audio in real time; lower is better. Accuracy and RTF figures are each vendor's own published numbers on different datasets and hardware, so they are directionally indicative, not directly comparable (see the methodology). Sources: ^a Silero VAD wiki, Quality-Metrics — ROC-AUC 0.96 Multi-Domain / 0.96 AliMeeting / 0.94 VoxConverse / 0.79 MSDWild, on 31.25 ms segments; VoxRT ships these same v5 weights. ^b Silero publishes desktop CPU figures only (≈189 µs/chunk, Ryzen Threadripper 3960X), no mobile RTF. ^c Picovoice Cobra blog. ^d Picovoice publishes desktop (Ryzen 9 5900X) and Raspberry Pi Zero RTF (≈0.037–0.05, ARMv6), not Android/iOS. ^e WebRTC VAD has no first-party modern benchmark. ^f TEN VAD README — PR-curve plots only. ^g TEN VAD README, vendor-measured RTF table (iPhone numbers are on older A8/A11 silicon than VoxRT's A15 reference, so not a like-for-like). ^h TEN VAD LICENSE adds a condition barring deployment "in a way that competes with Agora's offerings" or that enables third parties to do so. ⁱ NeMo model card reports metrics per checkpoint.

For technical evaluators

The technical details

VoxRT doesn't claim a new VAD network: it packages the proven Silero v5 weights on a mobile-first Rust runtime — a stateless C ABI, shipped today as native Android (JitPack) and iOS (Swift Package) modules, with published mobile RTF — and built to reach desktop, embedded and IoT next. Below are the specifics we kept out of the summary above: measured performance and the exact footprint that lands in your app.

1.85%

real-time factor on iPhone 13 Pro Max — ~0.6 ms / 32 ms frame

3.05%

real-time factor on a Snapdragon 662

~54

parallel VAD streams on a single core

~1.7 MB

net app-size impact

What it costs in your app

Swift wrapper source~17 KB
Native xcframework, compressed (device slice)~500 KB
Silero VAD weights (fp16)1.2 MB
Net app-size impact~1.7 MB

Engine by engine

How each one really compares

VoxRT

Freemium · Apache-2.0 SDK

Proven Silero accuracy, published mobile RTF, a tiny footprint and a license that's free for commercial use with no per-user fees — packaged as drop-in Android and iOS modules today, on a portable runtime built to reach embedded and IoT hardware next.

Silero VAD

MIT · the upstream model

The proven open model — and the one VoxRT ships. Upstream gives you an ONNX file and a Python wrapper; you build the mobile integration yourself, and there's no published Android/iOS RTF. VoxRT packages those same weights with measured mobile numbers and ready-made modules. Silero has since released v6 upstream; VoxRT ships the proven v5 today and is evaluating v6 for porting.

Picovoice Cobra

Commercial · free plan + paid tier

Picovoice Cobra is a closed commercial VAD engine with a free evaluation plan and a paid tier. It claims the largest AUC of its own three-way benchmark but publishes no concrete F1/AUC number, and its only RTF figures are desktop and Raspberry Pi Zero (≈0.037–0.05 RTF, ARMv6 — not mobile-comparable), with no Android or iPhone number. VoxRT ships the same weights on a freemium runtime — free for commercial use, with no per-user fees — and publishes measured mobile RTF.

WebRTC VAD

BSD-3 · the free floor

The ubiquitous 2010-era GMM detector: smallest binary in the field and zero licensing friction, but a generation behind modern neural VAD on noisy speech. Useful as a baseline, rarely as a product choice.

TEN VAD

Apache-2.0 with added conditions

TEN VAD is Agora's open VAD, with a small full-library binary and vendor-measured mobile RTF. Its license is the deciding factor: an added clause bars deploying it in ways that compete with, or enable others to compete with, Agora. For a redistributable SDK this is more than a caveat — it can make TEN VAD unsuitable if your product competes with Agora, or enables third parties to build apps that might. VoxRT's license carries no such non-compete restriction.

MarbleNet

CC-BY-4.0 · NVIDIA NeMo

A NeMo checkpoint family we evaluated and set aside in favor of Silero. There's no single headline metric to anchor a row on, and it isn't packaged as a mobile SDK surface.

Read the numbers carefully

Why the figures aren't apples-to-apples

We'd rather hand you the caveats than a tidy leaderboard. Every accuracy and speed figure above is vendor-self-reported, and the field has no independent third-party VAD benchmark at the scale ASR enjoys.

Accuracy is measured on different datasets at different granularities — Silero on its own validation sets at 31.25 ms segments, Picovoice on LibriSpeech mixed with DEMAND noise at 0 dB SNR, TEN VAD on precision-recall curves with no published numbers, WebRTC with no first-party benchmark at all. The "who beats whom" ordering each vendor reports is directional, not commensurable.

Real-time factor is even less comparable: Silero publishes desktop Threadripper figures, Picovoice desktop Ryzen plus Raspberry Pi Zero, TEN VAD mobile on older A8/A11 iPhones, and VoxRT on a Snapdragon 662 and an A15 iPhone. The only near-overlap — TEN VAD's iPhone 8 against VoxRT's iPhone 13 Pro Max — runs on different silicon, so treat it as a lower bound, not a head-to-head.

What's solid and comparable is the rest of the table: license terms, footprint, and whether a vendor publishes mobile numbers at all.

Read the full benchmark methodology →

Explore

Related primitives

FAQ

On-device VAD, answered

What is the best on-device VAD engine?