Custom Keyword Spotting

On-device custom keyword spotting

Detect your custom spoken commands, for example — play, pause, next, louder, stop, shuffle — directly on the device. Hands-free control at a fraction of the latency and compute of full speech recognition.

play skip shuffle mute repeat pause louder next 0.97
Capabilities

What you get

Your vocabulary

A classifier trained on exactly the commands your product needs.

Per-keyword confidence

Tune each command independently for the precision you need.

Lighter than ASR

Lower latency and compute than full transcription for closed-vocabulary control.

Battery-aware

Voice-activity gated for always-listening devices.

On-device & offline

No network round-trips, no audio leaving the device.

Overview

Closed-vocabulary control, done efficiently

When your product only needs to recognize a known set of commands, running a full speech-to-text model is overkill. Keyword spotting trains a compact classifier on exactly your command set, so it responds faster, uses less battery, and is more accurate on those words than a general transcriber would be.

It runs on the same VoxRT on-device runtime as the rest of the stack, gated by voice activity detection so it only fires when there's speech — and like everything else, no audio leaves the device. Keyword models are delivered as a model tuned to your command set: tell us the commands and target devices, and we train and tune it for your product.

How it works

Trained on your command set

Tell us the commands you need to recognize and the devices they have to run on, and we train and tune a keyword model for your product — define the vocabulary, get back structured detections with per-keyword confidence.

# your commands
keywords:
  - play
  - pause
  - next
  - louder
  - stop
threshold: 0.9

# at runtime
"...skip to the next track"  {
  keyword: "next",
  confidence: 0.97
}
The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.

Keyword spotting runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.

FAQ

Keyword spotting, answered

What's the difference between keyword spotting and a wake word?

A wake word is one always-on phrase that brings your app to attention. Keyword spotting recognizes a set of commands — play, pause, next — typically after the app is already listening. They're often used together: wake word to activate, keywords to control.

When should I use keyword spotting instead of full speech recognition?

When your product only needs a known set of commands. A compact classifier trained on exactly those words responds faster, uses far less battery and memory, and is more accurate on those commands than a general transcriber. If you need open-ended text, use streaming ASR; if you need commands with parameters, look at speech-to-intent.

Does any audio leave the device?

No. Detection runs entirely on the device's CPU, works offline, and never uploads microphone audio.

Is it free for commercial use?

The SDK wrappers are Apache-2.0 and the VoxRT runtime is free for commercial use with no per-user fees. Keyword models are a paid engagement — trained and tuned on your specific command set, for your devices and acoustic environment. See licensing.

Build it on-device with VoxRT

Tell us your command set and which devices it has to run on. We reply within a business day.

Get started