Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
247
homeai-voice/PLAN.md
Normal file
247
homeai-voice/PLAN.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# P3: homeai-voice — Speech Pipeline
|
||||
|
||||
> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
|
||||
|
||||
Test with a desktop USB mic before ESP32 hardware arrives (P6).
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Architecture
|
||||
|
||||
```
|
||||
[USB Mic / ESP32 satellite]
|
||||
↓
|
||||
openWakeWord (always-on, local)
|
||||
↓ wake detected
|
||||
Wyoming Satellite / Audio capture
|
||||
↓ raw audio stream
|
||||
Wyoming STT Server (Whisper.cpp)
|
||||
↓ transcribed text
|
||||
Home Assistant Voice Pipeline
|
||||
↓ text
|
||||
OpenClaw Agent (P4) ← intent + LLM response
|
||||
↓ response text
|
||||
Wyoming TTS Server (Kokoro)
|
||||
↓ audio
|
||||
[Speaker / ESP32 satellite]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Whisper.cpp — Speech-to-Text
|
||||
|
||||
**Why Whisper.cpp over Python Whisper:**
|
||||
- Native Apple Silicon build — uses Neural Engine + Metal
|
||||
- Significantly lower latency than Python implementation
|
||||
- Runs as a server process, not one-shot per request
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
git clone https://github.com/ggerganov/whisper.cpp
|
||||
cd whisper.cpp
|
||||
make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS
|
||||
|
||||
# Download model
|
||||
bash ./models/download-ggml-model.sh large-v3
|
||||
# Also grab medium.en for faster fallback
|
||||
bash ./models/download-ggml-model.sh medium.en
|
||||
```
|
||||
|
||||
Models stored at `~/models/whisper/`.
|
||||
|
||||
**Wyoming-Whisper adapter:**
|
||||
|
||||
Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:
|
||||
|
||||
```bash
|
||||
pip install wyoming-faster-whisper
|
||||
wyoming-faster-whisper \
|
||||
--model large-v3 \
|
||||
--language en \
|
||||
--uri tcp://0.0.0.0:10300 \
|
||||
--data-dir ~/models/whisper \
|
||||
--download-dir ~/models/whisper
|
||||
```
|
||||
|
||||
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`
|
||||
|
||||
### 2. Kokoro TTS — Primary Text-to-Speech
|
||||
|
||||
**Why Kokoro:**
|
||||
- Very low latency (~200ms for short phrases)
|
||||
- High quality voice output
|
||||
- Runs efficiently on Apple Silicon
|
||||
- No GPU required (MPS optional)
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pip install kokoro-onnx
|
||||
```
|
||||
|
||||
**Wyoming-Kokoro adapter:**
|
||||
|
||||
```bash
|
||||
pip install wyoming-kokoro # community adapter, or write thin wrapper
|
||||
wyoming-kokoro \
|
||||
--uri tcp://0.0.0.0:10301 \
|
||||
--voice af_heart \ # default voice; overridden by character config
|
||||
--speed 1.0
|
||||
```
|
||||
|
||||
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`
|
||||
|
||||
### 3. Chatterbox TTS — Voice Cloning Engine
|
||||
|
||||
Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).
|
||||
|
||||
```bash
|
||||
# Install Chatterbox (MPS-optimised for Apple Silicon)
|
||||
pip install chatterbox-tts
|
||||
|
||||
# Test voice clone
|
||||
python -c "
|
||||
from chatterbox.tts import ChatterboxTTS
|
||||
model = ChatterboxTTS.from_pretrained(device='mps')
|
||||
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
|
||||
"
|
||||
```
|
||||
|
||||
Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
|
||||
|
||||
### 4. Qwen3-TTS — MLX Fallback
|
||||
|
||||
```bash
|
||||
pip install mlx mlx-lm
|
||||
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
|
||||
```
|
||||
|
||||
Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.
|
||||
|
||||
### 5. openWakeWord — Always-On Detection
|
||||
|
||||
Runs continuously, listens for wake word, triggers pipeline.
|
||||
|
||||
```bash
|
||||
pip install openwakeword
|
||||
|
||||
# Test with default "hey_jarvis" model
|
||||
python -c "
|
||||
import openwakeword
|
||||
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
|
||||
# ... audio loop
|
||||
"
|
||||
```
|
||||
|
||||
**Custom wake word (later):**
|
||||
- Record 30–50 utterances of the character's name
|
||||
- Train via openWakeWord training toolkit
|
||||
- Drop model file into `~/models/wakeword/`
|
||||
|
||||
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`
|
||||
|
||||
Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
|
||||
|
||||
### 6. Wyoming Protocol Server
|
||||
|
||||
Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
|
||||
|
||||
**HA integration:**
|
||||
1. Home Assistant → Settings → Add Integration → Wyoming Protocol
|
||||
2. Add STT: host `<mac-mini-ip>`, port `10300`
|
||||
3. Add TTS: host `<mac-mini-ip>`, port `10301`
|
||||
4. Create Voice Assistant pipeline in HA using these providers
|
||||
5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
|
||||
|
||||
---
|
||||
|
||||
## launchd Services
|
||||
|
||||
Three launchd plists under `~/Library/LaunchAgents/`:
|
||||
|
||||
| Plist | Service | Port |
|
||||
|---|---|---|
|
||||
| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
|
||||
| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
|
||||
| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |
|
||||
|
||||
Templates stored in `scripts/launchd/`.
|
||||
|
||||
---
|
||||
|
||||
## Directory Layout
|
||||
|
||||
```
|
||||
homeai-voice/
|
||||
├── whisper/
|
||||
│ ├── install.sh # clone, compile whisper.cpp, download models
|
||||
│ └── README.md
|
||||
├── tts/
|
||||
│ ├── install-kokoro.sh
|
||||
│ ├── install-chatterbox.sh
|
||||
│ ├── install-qwen3.sh
|
||||
│ └── test-tts.sh # quick audio playback test
|
||||
├── wyoming/
|
||||
│ ├── install.sh
|
||||
│ └── test-pipeline.sh # end-to-end text→audio test
|
||||
└── scripts/
|
||||
├── launchd/
|
||||
│ ├── com.homeai.wyoming-stt.plist
|
||||
│ ├── com.homeai.wyoming-tts.plist
|
||||
│ └── com.homeai.wakeword.plist
|
||||
└── load-all-launchd.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Interface Contracts
|
||||
|
||||
**Exposes:**
|
||||
- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
|
||||
- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
|
||||
- Chatterbox: Python API, invoked directly by P4 skills
|
||||
- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)
|
||||
|
||||
**Add to `.env.services`:**
|
||||
```dotenv
|
||||
WYOMING_STT_URL=tcp://localhost:10300
|
||||
WYOMING_TTS_URL=tcp://localhost:10301
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
- [ ] Compile Whisper.cpp with Metal support
|
||||
- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
|
||||
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
|
||||
- [ ] Install Kokoro, test TTS to audio file
|
||||
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
|
||||
- [ ] Write launchd plists for STT and TTS services
|
||||
- [ ] Load plists, verify both services start on reboot
|
||||
- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
|
||||
- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
|
||||
- [ ] Test HA Assist from browser: type query → hear spoken response
|
||||
- [ ] Install openWakeWord, test wake detection with USB mic
|
||||
- [ ] Write and load openWakeWord launchd plist
|
||||
- [ ] Install Chatterbox, test voice clone with sample `.wav`
|
||||
- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
|
||||
- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
|
||||
- [ ] HA Voice Assistant responds to typed query with Kokoro voice
|
||||
- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
|
||||
- [ ] All three launchd services auto-start after reboot
|
||||
- [ ] STT latency <2s for 5-second utterances with `large-v3`
|
||||
- [ ] Kokoro TTS latency <300ms for a 10-word sentence
|
||||
Reference in New Issue
Block a user