On-device speech-to-intent
Turn what users say straight into actions — with no transcript in the middle. Describe the commands and details your app needs to understand, and VoxRT trains a single on-device model that maps speech directly to structured intents.
What you get
One inference
Audio mapped straight to structured intents and slots — no transcript stage.
YAML context spec
Declare your intents and slots in a few lines — we train the model.
Lower latency & memory
One model instead of ASR plus a separate NLU pipeline.
Domain accuracy
Higher accuracy because the model is trained on your intents.
Fully on-device
No audio or transcript leaves the device.
Audio to intent in one inference
The usual pipeline runs speech-to-text, then feeds the transcript into a natural-language-understanding model to extract structured meaning. That's two models, two sources of latency, and two places to lose accuracy. Speech-to-intent collapses both into a single model that reads audio and emits structured intents directly.
Because the model is trained on your own intents and slots, it's both faster and more accurate on your domain — and it runs entirely on-device, with no audio or transcript leaving the user's hardware. Speech-to-intent models are trained per customer: declare your intents and slots, and we deliver a tuned model for your product.
Define a context spec, get structured output
Declare your intents and slots in a few lines of YAML. We train a model that maps audio straight to structured intents in a single on-device inference — no transcript stage in between.
# your context intents: set_temperature: slots: [value, unit] # at runtime "Set the temperature to seventy-two degrees" → { intent: "set_temperature", slots: { value: 72, unit: "°F" } }
The runtime is the product
VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.
Speech-to-intent runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.
Speech-to-intent, answered
Does any audio or text leave the device?
No. The model runs entirely on the device — neither the audio nor any transcript is uploaded, and it works with no network connection.
How is speech-to-intent different from speech recognition plus NLU?
The traditional pipeline transcribes audio to text, then runs a second language-understanding model over the transcript. Speech-to-intent is one model that maps audio directly to a structured intent with slots — one inference instead of two, which means lower latency, a smaller memory footprint, and fewer places to lose accuracy.
How do I define what my app should understand?
Declare your intents and slots in a few lines of YAML — for example set_temperature with value and unit slots. We train a model on that spec and deliver it as a tuned on-device model for your product.
Is it free for commercial use?
The SDK wrappers are Apache-2.0 and the VoxRT runtime is free for commercial use with no per-user fees. Speech-to-intent models are a paid engagement — trained on your intents and slots, delivered as a model tuned to your domain and devices. See licensing.