Files
homeai/homeai-voice/PLAN.md
Aodhan Collins 38247d7cc4 Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00

7.2 KiB
Raw Blame History

P3: homeai-voice — Speech Pipeline

Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6


Goal

Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.

Test with a desktop USB mic before ESP32 hardware arrives (P6).


Pipeline Architecture

[USB Mic / ESP32 satellite]
        ↓
openWakeWord (always-on, local)
        ↓ wake detected
Wyoming Satellite / Audio capture
        ↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
        ↓ transcribed text
Home Assistant Voice Pipeline
        ↓ text
OpenClaw Agent (P4)            ← intent + LLM response
        ↓ response text
Wyoming TTS Server (Kokoro)
        ↓ audio
[Speaker / ESP32 satellite]

Components

1. Whisper.cpp — Speech-to-Text

Why Whisper.cpp over Python Whisper:

  • Native Apple Silicon build — uses Neural Engine + Metal
  • Significantly lower latency than Python implementation
  • Runs as a server process, not one-shot per request

Installation:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu)      # compiles with Metal support on macOS

# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en

Models stored at ~/models/whisper/.

Wyoming-Whisper adapter:

Use wyoming-faster-whisper or the Wyoming-compatible Whisper.cpp server:

pip install wyoming-faster-whisper
wyoming-faster-whisper \
  --model large-v3 \
  --language en \
  --uri tcp://0.0.0.0:10300 \
  --data-dir ~/models/whisper \
  --download-dir ~/models/whisper

launchd plist: ~/Library/LaunchAgents/com.homeai.wyoming-stt.plist

2. Kokoro TTS — Primary Text-to-Speech

Why Kokoro:

  • Very low latency (~200ms for short phrases)
  • High quality voice output
  • Runs efficiently on Apple Silicon
  • No GPU required (MPS optional)

Installation:

pip install kokoro-onnx

Wyoming-Kokoro adapter:

pip install wyoming-kokoro   # community adapter, or write thin wrapper
wyoming-kokoro \
  --uri tcp://0.0.0.0:10301 \
  --voice af_heart \          # default voice; overridden by character config
  --speed 1.0

launchd plist: ~/Library/LaunchAgents/com.homeai.wyoming-tts.plist

3. Chatterbox TTS — Voice Cloning Engine

Used when a character voice clone is active (character config from P5 sets tts_engine: chatterbox).

# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts

# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"

Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.

4. Qwen3-TTS — MLX Fallback

pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace

Used as a fallback if Chatterbox quality is insufficient. Activated via character config tts_engine: qwen3.

5. openWakeWord — Always-On Detection

Runs continuously, listens for wake word, triggers pipeline.

pip install openwakeword

# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"

Custom wake word (later):

  • Record 3050 utterances of the character's name
  • Train via openWakeWord training toolkit
  • Drop model file into ~/models/wakeword/

launchd plist: ~/Library/LaunchAgents/com.homeai.wakeword.plist

Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.

6. Wyoming Protocol Server

Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.

HA integration:

  1. Home Assistant → Settings → Add Integration → Wyoming Protocol
  2. Add STT: host <mac-mini-ip>, port 10300
  3. Add TTS: host <mac-mini-ip>, port 10301
  4. Create Voice Assistant pipeline in HA using these providers
  5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)

launchd Services

Three launchd plists under ~/Library/LaunchAgents/:

Plist Service Port
com.homeai.wyoming-stt.plist Whisper.cpp Wyoming 10300
com.homeai.wyoming-tts.plist Kokoro Wyoming 10301
com.homeai.wakeword.plist openWakeWord (no port, triggers internally)

Templates stored in scripts/launchd/.


Directory Layout

homeai-voice/
├── whisper/
│   ├── install.sh          # clone, compile whisper.cpp, download models
│   └── README.md
├── tts/
│   ├── install-kokoro.sh
│   ├── install-chatterbox.sh
│   ├── install-qwen3.sh
│   └── test-tts.sh         # quick audio playback test
├── wyoming/
│   ├── install.sh
│   └── test-pipeline.sh    # end-to-end text→audio test
└── scripts/
    ├── launchd/
    │   ├── com.homeai.wyoming-stt.plist
    │   ├── com.homeai.wyoming-tts.plist
    │   └── com.homeai.wakeword.plist
    └── load-all-launchd.sh

Interface Contracts

Exposes:

  • Wyoming STT: tcp://0.0.0.0:10300 — consumed by HA, P6 (ESP32 satellites)
  • Wyoming TTS: tcp://0.0.0.0:10301 — consumed by HA, P6
  • Chatterbox: Python API, invoked directly by P4 skills
  • openWakeWord: triggers HTTP POST to http://localhost:8080/wake (P4 OpenClaw)

Add to .env.services:

WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301

Implementation Steps

  • Compile Whisper.cpp with Metal support
  • Download large-v3 and medium.en Whisper models to ~/models/whisper/
  • Install wyoming-faster-whisper, test STT from audio file
  • Install Kokoro, test TTS to audio file
  • Install Wyoming-Kokoro adapter, verify Wyoming protocol works
  • Write launchd plists for STT and TTS services
  • Load plists, verify both services start on reboot
  • Connect HA Wyoming integration — STT port 10300, TTS port 10301
  • Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
  • Test HA Assist from browser: type query → hear spoken response
  • Install openWakeWord, test wake detection with USB mic
  • Write and load openWakeWord launchd plist
  • Install Chatterbox, test voice clone with sample .wav
  • Install Qwen3-TTS via MLX (fallback, lower priority)
  • Write wyoming/test-pipeline.sh — full end-to-end smoke test

Success Criteria

  • wyoming/test-pipeline.sh passes: audio file → transcribed text → spoken response
  • HA Voice Assistant responds to typed query with Kokoro voice
  • openWakeWord detects "hey jarvis" (or chosen wake word) reliably
  • All three launchd services auto-start after reboot
  • STT latency <2s for 5-second utterances with large-v3
  • Kokoro TTS latency <300ms for a 10-word sentence