Files
homeai/homeai-voice/PLAN.md
Aodhan Collins 38247d7cc4 Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00

248 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# P3: homeai-voice — Speech Pipeline
> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
---
## Goal
Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
Test with a desktop USB mic before ESP32 hardware arrives (P6).
---
## Pipeline Architecture
```
[USB Mic / ESP32 satellite]
openWakeWord (always-on, local)
↓ wake detected
Wyoming Satellite / Audio capture
↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
↓ transcribed text
Home Assistant Voice Pipeline
↓ text
OpenClaw Agent (P4) ← intent + LLM response
↓ response text
Wyoming TTS Server (Kokoro)
↓ audio
[Speaker / ESP32 satellite]
```
---
## Components
### 1. Whisper.cpp — Speech-to-Text
**Why Whisper.cpp over Python Whisper:**
- Native Apple Silicon build — uses Neural Engine + Metal
- Significantly lower latency than Python implementation
- Runs as a server process, not one-shot per request
**Installation:**
```bash
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS
# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en
```
Models stored at `~/models/whisper/`.
**Wyoming-Whisper adapter:**
Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:
```bash
pip install wyoming-faster-whisper
wyoming-faster-whisper \
--model large-v3 \
--language en \
--uri tcp://0.0.0.0:10300 \
--data-dir ~/models/whisper \
--download-dir ~/models/whisper
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`
### 2. Kokoro TTS — Primary Text-to-Speech
**Why Kokoro:**
- Very low latency (~200ms for short phrases)
- High quality voice output
- Runs efficiently on Apple Silicon
- No GPU required (MPS optional)
**Installation:**
```bash
pip install kokoro-onnx
```
**Wyoming-Kokoro adapter:**
```bash
pip install wyoming-kokoro # community adapter, or write thin wrapper
wyoming-kokoro \
--uri tcp://0.0.0.0:10301 \
--voice af_heart \ # default voice; overridden by character config
--speed 1.0
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`
### 3. Chatterbox TTS — Voice Cloning Engine
Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).
```bash
# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts
# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"
```
Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
### 4. Qwen3-TTS — MLX Fallback
```bash
pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
```
Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.
### 5. openWakeWord — Always-On Detection
Runs continuously, listens for wake word, triggers pipeline.
```bash
pip install openwakeword
# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"
```
**Custom wake word (later):**
- Record 3050 utterances of the character's name
- Train via openWakeWord training toolkit
- Drop model file into `~/models/wakeword/`
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`
Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
### 6. Wyoming Protocol Server
Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
**HA integration:**
1. Home Assistant → Settings → Add Integration → Wyoming Protocol
2. Add STT: host `<mac-mini-ip>`, port `10300`
3. Add TTS: host `<mac-mini-ip>`, port `10301`
4. Create Voice Assistant pipeline in HA using these providers
5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
---
## launchd Services
Three launchd plists under `~/Library/LaunchAgents/`:
| Plist | Service | Port |
|---|---|---|
| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |
Templates stored in `scripts/launchd/`.
---
## Directory Layout
```
homeai-voice/
├── whisper/
│ ├── install.sh # clone, compile whisper.cpp, download models
│ └── README.md
├── tts/
│ ├── install-kokoro.sh
│ ├── install-chatterbox.sh
│ ├── install-qwen3.sh
│ └── test-tts.sh # quick audio playback test
├── wyoming/
│ ├── install.sh
│ └── test-pipeline.sh # end-to-end text→audio test
└── scripts/
├── launchd/
│ ├── com.homeai.wyoming-stt.plist
│ ├── com.homeai.wyoming-tts.plist
│ └── com.homeai.wakeword.plist
└── load-all-launchd.sh
```
---
## Interface Contracts
**Exposes:**
- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
- Chatterbox: Python API, invoked directly by P4 skills
- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)
**Add to `.env.services`:**
```dotenv
WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301
```
---
## Implementation Steps
- [ ] Compile Whisper.cpp with Metal support
- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
- [ ] Install Kokoro, test TTS to audio file
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
- [ ] Write launchd plists for STT and TTS services
- [ ] Load plists, verify both services start on reboot
- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
- [ ] Test HA Assist from browser: type query → hear spoken response
- [ ] Install openWakeWord, test wake detection with USB mic
- [ ] Write and load openWakeWord launchd plist
- [ ] Install Chatterbox, test voice clone with sample `.wav`
- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test
---
## Success Criteria
- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
- [ ] HA Voice Assistant responds to typed query with Kokoro voice
- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
- [ ] All three launchd services auto-start after reboot
- [ ] STT latency <2s for 5-second utterances with `large-v3`
- [ ] Kokoro TTS latency <300ms for a 10-word sentence