Speech-to-Intent · end-to-end

On-device speech-to-intent

Turn what users say straight into actions — with no transcript in the middle. Describe the commands and details your app needs to understand, and VoxRT trains a single on-device model that maps speech directly to structured intents.

"set the temperature to 72" { intent: "set_temperature", value: 72, unit: "°F" }
Capabilities

What you get

One inference

Audio mapped straight to structured intents and slots — no transcript stage.

YAML context spec

Declare your intents and slots in a few lines — we train the model.

Lower latency & memory

One model instead of ASR plus a separate NLU pipeline.

Domain accuracy

Higher accuracy because the model is trained on your intents.

Fully on-device

No audio or transcript leaves the device.

Overview

Audio to intent in one inference

The usual pipeline runs speech-to-text, then feeds the transcript into a natural-language-understanding model to extract structured meaning. That's two models, two sources of latency, and two places to lose accuracy. Speech-to-intent collapses both into a single model that reads audio and emits structured intents directly.

Because the model is trained on your own intents and slots, it's both faster and more accurate on your domain — and it runs entirely on-device, with no audio or transcript leaving the user's hardware. Speech-to-intent models are trained per customer: declare your intents and slots, and we deliver a tuned model for your product.

How it works

Define a context spec, get structured output

Declare your intents and slots in a few lines of YAML. We train a model that maps audio straight to structured intents in a single on-device inference — no transcript stage in between.

# your context
intents:
  set_temperature:
    slots: [value, unit]

# at runtime
"Set the temperature to seventy-two degrees"  {
  intent: "set_temperature",
  slots: { value: 72, unit: "°F" }
}
The runtime

The runtime is the product

VoxRT is a from-scratch inference runtime for on-device speech models — no ONNX Runtime, no PyTorch Mobile, no LiteRT. It's a custom Rust core sized and tuned for streaming voice workloads on constrained, low-power hardware.

Speech-to-intent runs on that same runtime as wake word, VAD, and streaming ASR — one Rust runtime crate and one NEON kernel set across the whole stack.

FAQ

Speech-to-intent, answered

Does any audio or text leave the device?

No. The model runs entirely on the device — neither the audio nor any transcript is uploaded, and it works with no network connection.

How is speech-to-intent different from speech recognition plus NLU?

The traditional pipeline transcribes audio to text, then runs a second language-understanding model over the transcript. Speech-to-intent is one model that maps audio directly to a structured intent with slots — one inference instead of two, which means lower latency, a smaller memory footprint, and fewer places to lose accuracy.

How do I define what my app should understand?

Declare your intents and slots in a few lines of YAML — for example set_temperature with value and unit slots. We train a model on that spec and deliver it as a tuned on-device model for your product.

Is it free for commercial use?

The SDK wrappers are Apache-2.0 and the VoxRT runtime is free for commercial use with no per-user fees. Speech-to-intent models are a paid engagement — trained on your intents and slots, delivered as a model tuned to your domain and devices. See licensing.

Build it on-device with VoxRT

Tell us your intents and which devices they have to run on.

Get started