Building a Real-time Local Voice Assistant: Tools, Models, and Best Practices for Open Speech-to-Speech

The quest for a fully local, low-latency, and interruptible voice assistant is a common goal for many developers and enthusiasts. While the promise of end-to-end (E2E) speech-to-speech (S2S) models is alluring, the current state-of-the-art often points to a robust "streaming ASR + LLM + streaming TTS" pipeline as the most practical and flexible solution. This approach allows for greater control, integration with complex logic, and better overall performance for tasks beyond basic banter.

Current Best Practices: The Glued Pipeline Approach

Many practitioners find success in chaining together specialized models for each stage of the voice assistant workflow. This modularity is key for flexibility and handling diverse use cases.

Speech-to-Text (ASR) Solutions:
- Handy paired with Parakeet V3 is frequently praised for its near-instant transcription speed and user experience. While Parakeet V3 might have a slight accuracy drop compared to larger models, it's often negligible when the output feeds into an LLM capable of "reading between the lines."
- Whisper and Distil-Whisper (especially with whisper.cpp for performance) remain popular choices for their high accuracy.
- For those building web-first applications, Vosk-browser offers a browser-compatible option.
- Nvidia Nemotron ASR is also used in advanced streaming setups.
- For mobile devices, Apple's built-in speech model on iOS/Mac 26+ provides strong performance.
Large Language Model (LLM) Integration:
- The choice of LLM is highly customizable, with smaller models (e.g., Mistral 14B Q4_K_S.gguf) fitting within modest VRAM limits (16GB cited). The intermediate text layer generated by ASR is crucial here, enabling complex reasoning, Retrieval Augmented Generation (RAG), and tool use—capabilities often lacking in pure E2E models.
Text-to-Speech (TTS) Models:
- Pocket-TTS stands out for its compact size (100M parameters), amazing speech quality for English, and crucial streaming audio capability. Its ease of installation via uv tool install is a significant usability advantage.
- Even smaller, KittenTTS (25MB) offers good performance, though it currently lacks streaming audio and requires more integration effort for web projects.
- Other notable TTS options include Piper (used by Home Assistant), Nvidia Magpie TTS (multilingual in some configurations), Vits-web for browser-based setups, and Qwen3 TTS (when used as part of a pipeline) or Kokoro TTS for voice cloning.
- Supertonic is another high-quality option.

Integration and Frameworks for Real-time Operation

Connecting these components for low-latency, real-time interaction requires robust frameworks:

Pipecat emerges as a strong contender for gluing together ASR, LLM, and TTS models. It supports various local models and facilitates streaming pipelines.
Home Assistant offers a complete, pluggable, and customizable local voice assistant experience, leveraging models like Whisper and Piper, and even providing purpose-built hardware.
SaynaAI/sayna provides a flexible platform for switching between different STT/TTS providers and supports local models.
For Mac users, Sogni-AI/sogni-voice is an open-source REST API and framework specifically optimized for MLX on Apple Silicon, integrating Parakeet, Kokoro, and Qwen3 TTS.
Research into Kyutai Labs' Delayed Streams Modeling hints at future improvements for ultra-low latency.

The Promise and Challenges of End-to-End Speech-to-Speech

While attractive, true E2E speech-to-speech models like the original Qwen3 Omni described in the initial query are still a developing area.

Current State: Many find that existing E2E S2S models are "pretty weak" for anything beyond basic conversational turns. Proprietary models like "GPT-realtime" have shown mixed results.
Nvidia Personaplex (Persona-Powered Conversational AI) represents a promising development with dual-channel input/output and a permissive license, suggesting a move towards more integrated solutions.
The "Secret Sauce": A significant challenge for pure E2E models is the loss of the intermediate text layer. This text is invaluable for implementing complex agent logic, RAG, and tool use, which are often essential for practical voice assistants. Until E2E models can effectively expose or replicate this structural intermediate state, glued pipelines are likely to remain dominant for sophisticated applications.
Latency vs. Smarts: For true "voice assistant" use cases (like Alexa), low latency often outweighs raw intelligence. E2E models might excel at speed but currently fall short on complexity.

Hardware Considerations and Performance Optimizations

Achieving local, low-latency performance requires mindful hardware choices and optimization:

GPU Power: A single modern GPU is generally necessary for running larger models at real-time speeds. Mac M-class chips with MLX are highly efficient for this.
VRAM: 16GB VRAM can comfortably host a full ASR + LLM + TTS pipeline with moderately sized models.
Latency Tuning: The bottleneck is often the minimum batch size for inference, not the inference time itself. Tuning batch sizes (e.g., down to 200ms for ASR) is critical.
Wake Word Detection: This initial, always-on component can be offloaded to extremely low-power microcontrollers (like an ESP32 or specialized ADC/DSPs) to avoid keeping a powerful GPU constantly active.

Practical Tips for Building Your Setup

Confirm Understanding: Encourage the AI to restate what it understood ("I always ask it to restate back to me what it understood...") to confirm accuracy and guide the agent.
Packaging Matters: Prefer models and tools that offer easy installation (e.g., uv tool install) as it drastically improves immediate usability.
Multilingual Needs: Explore models like Nvidia Magpie TTS (multilingual) or services like Speechmatics for robust multilingual and code-switching support. Canary models are also being integrated for Spanish.
Voice Customization: Use voice plugins to tailor the speaking style and "vibe" of your assistant.

In summary, while dedicated end-to-end speech models are evolving, the most "works today" local and open voice assistant setups successfully integrate streaming ASR, an LLM, and streaming TTS. The key is to select fast, efficient models for each stage and use a robust framework to manage the data flow and maintain low latency.