On-device custom keyword spotting
Detect your custom spoken commands, for example — play, pause, next, louder, stop, shuffle — directly on the device. Hands-free control at a fraction of the latency and compute of full speech recognition.
What you get
Your vocabulary
A classifier trained on exactly the commands your product needs.
Per-keyword confidence
Tune each command independently for the precision you need.
Lighter than ASR
Lower latency and compute than full transcription for closed-vocabulary control.
Battery-aware
Voice-activity gated for always-listening devices.
On-device & offline
No network round-trips, no audio leaving the device.
Closed-vocabulary control, done efficiently
When your product only needs to recognize a known set of commands, running a full speech-to-text model is overkill. Keyword spotting trains a compact classifier on exactly your command set, so it responds faster, uses less battery, and is more accurate on those words than a general transcriber would be.
It runs on the same VoxRT on-device runtime as the rest of the stack, gated by voice activity detection so it only fires when there's speech — and like everything else, no audio leaves the device. Keyword models are delivered as a model tuned to your command set: tell us the commands and target devices, and we train and tune it for your product.
Trained on your command set
Tell us the commands you need to recognize and the devices they have to run on, and we train and tune a keyword model for your product — define the vocabulary, get back structured detections with per-keyword confidence.
# your commands keywords: - play - pause - next - louder - stop threshold: 0.9 # at runtime "...skip to the next track" → { keyword: "next", confidence: 0.97 }
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.
Keyword spotting runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.
Keyword spotting, answered
What's the difference between keyword spotting and a wake word?
A wake word is one always-on phrase that brings your app to attention. Keyword spotting recognizes a set of commands — play, pause, next — typically after the app is already listening. They're often used together: wake word to activate, keywords to control.
When should I use keyword spotting instead of full speech recognition?
When your product only needs a known set of commands. A compact classifier trained on exactly those words responds faster, uses far less battery and memory, and is more accurate on those commands than a general transcriber. If you need open-ended text, use streaming ASR; if you need commands with parameters, look at speech-to-intent.
Does any audio leave the device?
No. Detection runs entirely on the device's CPU, works offline, and never uploads microphone audio.
Is it free for commercial use?
The SDK wrappers are Apache-2.0 and the VoxRT runtime is free for commercial use with no per-user fees. Keyword models are a paid engagement — trained and tuned on your specific command set, for your devices and acoustic environment. See licensing.