Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Aodhan Collins
2026-03-04 01:11:37 +00:00
commit 38247d7cc4
11 changed files with 3060 additions and 0 deletions

247
homeai-voice/PLAN.md Normal file
View File

@@ -0,0 +1,247 @@
# P3: homeai-voice — Speech Pipeline
> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
---
## Goal
Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
Test with a desktop USB mic before ESP32 hardware arrives (P6).
---
## Pipeline Architecture
```
[USB Mic / ESP32 satellite]
openWakeWord (always-on, local)
↓ wake detected
Wyoming Satellite / Audio capture
↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
↓ transcribed text
Home Assistant Voice Pipeline
↓ text
OpenClaw Agent (P4) ← intent + LLM response
↓ response text
Wyoming TTS Server (Kokoro)
↓ audio
[Speaker / ESP32 satellite]
```
---
## Components
### 1. Whisper.cpp — Speech-to-Text
**Why Whisper.cpp over Python Whisper:**
- Native Apple Silicon build — uses Neural Engine + Metal
- Significantly lower latency than Python implementation
- Runs as a server process, not one-shot per request
**Installation:**
```bash
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS
# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en
```
Models stored at `~/models/whisper/`.
**Wyoming-Whisper adapter:**
Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:
```bash
pip install wyoming-faster-whisper
wyoming-faster-whisper \
--model large-v3 \
--language en \
--uri tcp://0.0.0.0:10300 \
--data-dir ~/models/whisper \
--download-dir ~/models/whisper
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`
### 2. Kokoro TTS — Primary Text-to-Speech
**Why Kokoro:**
- Very low latency (~200ms for short phrases)
- High quality voice output
- Runs efficiently on Apple Silicon
- No GPU required (MPS optional)
**Installation:**
```bash
pip install kokoro-onnx
```
**Wyoming-Kokoro adapter:**
```bash
pip install wyoming-kokoro # community adapter, or write thin wrapper
wyoming-kokoro \
--uri tcp://0.0.0.0:10301 \
--voice af_heart \ # default voice; overridden by character config
--speed 1.0
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`
### 3. Chatterbox TTS — Voice Cloning Engine
Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).
```bash
# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts
# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"
```
Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
### 4. Qwen3-TTS — MLX Fallback
```bash
pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
```
Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.
### 5. openWakeWord — Always-On Detection
Runs continuously, listens for wake word, triggers pipeline.
```bash
pip install openwakeword
# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"
```
**Custom wake word (later):**
- Record 3050 utterances of the character's name
- Train via openWakeWord training toolkit
- Drop model file into `~/models/wakeword/`
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`
Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
### 6. Wyoming Protocol Server
Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
**HA integration:**
1. Home Assistant → Settings → Add Integration → Wyoming Protocol
2. Add STT: host `<mac-mini-ip>`, port `10300`
3. Add TTS: host `<mac-mini-ip>`, port `10301`
4. Create Voice Assistant pipeline in HA using these providers
5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
---
## launchd Services
Three launchd plists under `~/Library/LaunchAgents/`:
| Plist | Service | Port |
|---|---|---|
| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |
Templates stored in `scripts/launchd/`.
---
## Directory Layout
```
homeai-voice/
├── whisper/
│ ├── install.sh # clone, compile whisper.cpp, download models
│ └── README.md
├── tts/
│ ├── install-kokoro.sh
│ ├── install-chatterbox.sh
│ ├── install-qwen3.sh
│ └── test-tts.sh # quick audio playback test
├── wyoming/
│ ├── install.sh
│ └── test-pipeline.sh # end-to-end text→audio test
└── scripts/
├── launchd/
│ ├── com.homeai.wyoming-stt.plist
│ ├── com.homeai.wyoming-tts.plist
│ └── com.homeai.wakeword.plist
└── load-all-launchd.sh
```
---
## Interface Contracts
**Exposes:**
- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
- Chatterbox: Python API, invoked directly by P4 skills
- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)
**Add to `.env.services`:**
```dotenv
WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301
```
---
## Implementation Steps
- [ ] Compile Whisper.cpp with Metal support
- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
- [ ] Install Kokoro, test TTS to audio file
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
- [ ] Write launchd plists for STT and TTS services
- [ ] Load plists, verify both services start on reboot
- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
- [ ] Test HA Assist from browser: type query → hear spoken response
- [ ] Install openWakeWord, test wake detection with USB mic
- [ ] Write and load openWakeWord launchd plist
- [ ] Install Chatterbox, test voice clone with sample `.wav`
- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test
---
## Success Criteria
- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
- [ ] HA Voice Assistant responds to typed query with Kokoro voice
- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
- [ ] All three launchd services auto-start after reboot
- [ ] STT latency <2s for 5-second utterances with `large-v3`
- [ ] Kokoro TTS latency <300ms for a 10-word sentence