Skip to content

Voice Pipeline

The voice pipeline turns “Hey GLaDOS” into a response played through the living room speakers — entirely on local hardware, with no cloud services in the loop. Three distinct pipelines handle different invocation contexts: the hardware satellite, the Mac terminal, and the web dashboard.

All voice processing runs on two machines: the Pi 5 (Caroline) hosts the Wyoming STT/TTS/wake-word containers and orchestrates the HA Assist pipeline; Atlas (the M4 Pro) runs Ollama for LLM inference. nightwatch (the AMD GPU machine) provides specialty TTS backends on demand, woken via Wake-on-LAN when needed.

Voice pipeline — Satellite wake word detection through STT, n8n processing, LLM inference, TTS, and Sonos output

[Satellite hardware]
|
[openWakeWord: "Hey GLaDOS"]
|
[Whisper: speech to text]
|
[n8n webhook → Ollama on Atlas]
|
[Piper TTS: text to speech]
|
[Sonos speakers: audio output]

The Satellite1 is a custom ESPHome device with a microphone array that listens continuously for wake words. It runs server-side wake word detection, meaning the raw audio stream is forwarded to the Pi’s openWakeWord container rather than running detection on-device. This keeps the hardware simple and the models upgradeable.

Once the wake word fires:

  1. The audio stream goes to Whisper (faster-whisper small-int8, English) for transcription
  2. The transcript reaches Home Assistant’s Assist pipeline, which routes it through the m_agent custom component to n8n via a local webhook
  3. n8n calls Ollama on Atlas for response generation
  4. The response goes to Piper TTS for synthesis
  5. Audio is returned to the Satellite, which has no built-in speaker — playback routes to the nearest Sonos

The satellite is on the IoT network VLAN, isolated from the main LAN. The wake word detection, STT, and TTS containers listen only on loopback ports; HA reaches them via localhost because it runs in host network mode.

scripts/glados-say.sh is a command-line script that sends text to any of the voice backends on nightwatch and plays the audio locally. It selects the backend by name and logs the interaction to the dashboard API.

BackendTechnologyApproximate Latency
gladosForward Tacotron + HiFiGAN (Wyoming)~1s
kokoroKokoro-82M (OpenAI-compat HTTP)~0.2s
xttsXTTS v2, GLaDOS fine-tune (Wyoming)2-5s
mChatterbox Turbo (Judi Dench voice, Wyoming)varies
peterPeter Griffin RVC v2 (HTTP)varies
peter2Peter Griffin GPT-SoVITS (HTTP)varies

The dashboard’s /chat page uses the Web Speech API for voice input and useTTS for synthesis output. Transcribed speech goes to n8n, which routes to Ollama or Claude depending on the request, and the response plays back in-browser via Web Audio.

Five containers make up the on-Pi voice stack, co-deployed with Home Assistant.

ContainerImageRole
wyoming-whisperrhasspy/wyoming-whisperSTT — faster-whisper small-int8 (loopback only)
wyoming-piperrhasspy/wyoming-piperTTS — Piper en_US-lessac-medium (loopback only)
wyoming-openwakewordrhasspy/wyoming-openwakewordWake word detection, TFLite (loopback only)
homeassistantghcr.io/home-assistant/home-assistantPipeline orchestrator (port 8123)
esphomeghcr.io/esphome/esphomeSatellite firmware management

Custom wake word models are TFLite format, trained on nightwatch’s AMD GPU using tools/wake-words/train_all.sh. They live in ha-data/openwakeword-custom/.

ModelType
hey_gladosCustom (primary active wake word)
gladosCustom
claudeCustom
hudsonCustom
maude / hey_maudeCustom
jarvisCommunity
computerCommunity
ok_computerCommunity
okay_nabu, hey_jarvis, hey_mycroft, alexa, hey_rhasspyBuilt-in (always available)

The Satellite1 is a FutureProofHomes ESPHome device with a microphone array. It connects to the IoT VLAN and communicates with the Pi over the Wyoming protocol.

Key properties:

  • Wake word processing: server-side (audio streamed to Pi; no on-device inference)
  • Speaker: none built in — all TTS audio routes to Sonos
  • Active wake word: hey_glados
  • Firmware config: ha-config/esphome/satellite1-voice-patch.yaml and device config
  • OTA flashing: via ESPHome dashboard (on-demand only, not always running)

The ESPHome repository also manages three BLE proxy devices (bathroom and kitchen) and three MTR1 presence and temperature sensors (bedroom, garage, living room).

nightwatch (the AMD Radeon 7900 XTX machine) hosts all the specialty TTS backends. It is not running continuously — a Wake-on-LAN automation pre-wakes it when the satellite detects a wake word, with a 12-second budget from suspend to service-ready.

Idle management:

  • An activity monitor script stamps input_datetime.nightwatch_last_active every 60 seconds via a systemd timer
  • If idle for 5 minutes with no session active, HA fires the idle shutdown automation
  • A nightwatch_keep_alive boolean overrides the idle timeout when needed

Every satellite voice interaction is automatically archived. The push_voice_interaction HA automation sends the dialogue to an n8n webhook, which writes it to two PostgreSQL tables: voice_interactions (session metadata) and dialogue (the full transcript). This creates a permanent record of all voice interactions for analysis and memory retrieval.

The M voice is a custom voice clone built with Chatterbox Turbo, targeting a Judi Dench-inspired voice for a personalized assistant experience. A dataset of 1,137 audio clips is prepared and ready. The M voice backend is already deployed on Nightwatch and accessible from glados-say.sh. Full integration into the satellite pipeline is the next milestone.

FilePurpose
pi/docker-compose.ha.ymlPi HA stack: HA + all Wyoming containers
docker-compose.voice.ymlMac dev voice stack (Wyoming only)
scripts/glados-say.shTerminal TTS script, all backends
ha-config/custom_components/m_agent/Custom n8n-routing conversation agent
ha-config/esphome/satellite1-voice-patch.yamlServer-side wake word patch for Satellite1
nightwatch/scripts/activity-monitor.shIdle detection for nightwatch power management
tools/wake-words/train_all.shWake word model training runner