Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.2 KiB
P3: homeai-voice — Speech Pipeline
Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
Goal
Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
Test with a desktop USB mic before ESP32 hardware arrives (P6).
Pipeline Architecture
[USB Mic / ESP32 satellite]
↓
openWakeWord (always-on, local)
↓ wake detected
Wyoming Satellite / Audio capture
↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
↓ transcribed text
Home Assistant Voice Pipeline
↓ text
OpenClaw Agent (P4) ← intent + LLM response
↓ response text
Wyoming TTS Server (Kokoro)
↓ audio
[Speaker / ESP32 satellite]
Components
1. Whisper.cpp — Speech-to-Text
Why Whisper.cpp over Python Whisper:
- Native Apple Silicon build — uses Neural Engine + Metal
- Significantly lower latency than Python implementation
- Runs as a server process, not one-shot per request
Installation:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS
# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en
Models stored at ~/models/whisper/.
Wyoming-Whisper adapter:
Use wyoming-faster-whisper or the Wyoming-compatible Whisper.cpp server:
pip install wyoming-faster-whisper
wyoming-faster-whisper \
--model large-v3 \
--language en \
--uri tcp://0.0.0.0:10300 \
--data-dir ~/models/whisper \
--download-dir ~/models/whisper
launchd plist: ~/Library/LaunchAgents/com.homeai.wyoming-stt.plist
2. Kokoro TTS — Primary Text-to-Speech
Why Kokoro:
- Very low latency (~200ms for short phrases)
- High quality voice output
- Runs efficiently on Apple Silicon
- No GPU required (MPS optional)
Installation:
pip install kokoro-onnx
Wyoming-Kokoro adapter:
pip install wyoming-kokoro # community adapter, or write thin wrapper
wyoming-kokoro \
--uri tcp://0.0.0.0:10301 \
--voice af_heart \ # default voice; overridden by character config
--speed 1.0
launchd plist: ~/Library/LaunchAgents/com.homeai.wyoming-tts.plist
3. Chatterbox TTS — Voice Cloning Engine
Used when a character voice clone is active (character config from P5 sets tts_engine: chatterbox).
# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts
# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"
Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
4. Qwen3-TTS — MLX Fallback
pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
Used as a fallback if Chatterbox quality is insufficient. Activated via character config tts_engine: qwen3.
5. openWakeWord — Always-On Detection
Runs continuously, listens for wake word, triggers pipeline.
pip install openwakeword
# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"
Custom wake word (later):
- Record 30–50 utterances of the character's name
- Train via openWakeWord training toolkit
- Drop model file into
~/models/wakeword/
launchd plist: ~/Library/LaunchAgents/com.homeai.wakeword.plist
Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
6. Wyoming Protocol Server
Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
HA integration:
- Home Assistant → Settings → Add Integration → Wyoming Protocol
- Add STT: host
<mac-mini-ip>, port10300 - Add TTS: host
<mac-mini-ip>, port10301 - Create Voice Assistant pipeline in HA using these providers
- Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
launchd Services
Three launchd plists under ~/Library/LaunchAgents/:
| Plist | Service | Port |
|---|---|---|
com.homeai.wyoming-stt.plist |
Whisper.cpp Wyoming | 10300 |
com.homeai.wyoming-tts.plist |
Kokoro Wyoming | 10301 |
com.homeai.wakeword.plist |
openWakeWord | (no port, triggers internally) |
Templates stored in scripts/launchd/.
Directory Layout
homeai-voice/
├── whisper/
│ ├── install.sh # clone, compile whisper.cpp, download models
│ └── README.md
├── tts/
│ ├── install-kokoro.sh
│ ├── install-chatterbox.sh
│ ├── install-qwen3.sh
│ └── test-tts.sh # quick audio playback test
├── wyoming/
│ ├── install.sh
│ └── test-pipeline.sh # end-to-end text→audio test
└── scripts/
├── launchd/
│ ├── com.homeai.wyoming-stt.plist
│ ├── com.homeai.wyoming-tts.plist
│ └── com.homeai.wakeword.plist
└── load-all-launchd.sh
Interface Contracts
Exposes:
- Wyoming STT:
tcp://0.0.0.0:10300— consumed by HA, P6 (ESP32 satellites) - Wyoming TTS:
tcp://0.0.0.0:10301— consumed by HA, P6 - Chatterbox: Python API, invoked directly by P4 skills
- openWakeWord: triggers HTTP POST to
http://localhost:8080/wake(P4 OpenClaw)
Add to .env.services:
WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301
Implementation Steps
- Compile Whisper.cpp with Metal support
- Download
large-v3andmedium.enWhisper models to~/models/whisper/ - Install
wyoming-faster-whisper, test STT from audio file - Install Kokoro, test TTS to audio file
- Install Wyoming-Kokoro adapter, verify Wyoming protocol works
- Write launchd plists for STT and TTS services
- Load plists, verify both services start on reboot
- Connect HA Wyoming integration — STT port 10300, TTS port 10301
- Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
- Test HA Assist from browser: type query → hear spoken response
- Install openWakeWord, test wake detection with USB mic
- Write and load openWakeWord launchd plist
- Install Chatterbox, test voice clone with sample
.wav - Install Qwen3-TTS via MLX (fallback, lower priority)
- Write
wyoming/test-pipeline.sh— full end-to-end smoke test
Success Criteria
wyoming/test-pipeline.shpasses: audio file → transcribed text → spoken response- HA Voice Assistant responds to typed query with Kokoro voice
- openWakeWord detects "hey jarvis" (or chosen wake word) reliably
- All three launchd services auto-start after reboot
- STT latency <2s for 5-second utterances with
large-v3 - Kokoro TTS latency <300ms for a 10-word sentence