Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00
commit 38247d7cc4
11 changed files with 3060 additions and 0 deletions
--- a/homeai-voice/PLAN.md
+++ b/homeai-voice/PLAN.md
@@ -0,0 +1,247 @@
+# P3: homeai-voice — Speech Pipeline
+
+> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
+
+---
+
+## Goal
+
+Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
+
+Test with a desktop USB mic before ESP32 hardware arrives (P6).
+
+---
+
+## Pipeline Architecture
+
+```
+[USB Mic / ESP32 satellite]
+        ↓
+openWakeWord (always-on, local)
+        ↓ wake detected
+Wyoming Satellite / Audio capture
+        ↓ raw audio stream
+Wyoming STT Server (Whisper.cpp)
+        ↓ transcribed text
+Home Assistant Voice Pipeline
+        ↓ text
+OpenClaw Agent (P4)            ← intent + LLM response
+        ↓ response text
+Wyoming TTS Server (Kokoro)
+        ↓ audio
+[Speaker / ESP32 satellite]
+```
+
+---
+
+## Components
+
+### 1. Whisper.cpp — Speech-to-Text
+
+**Why Whisper.cpp over Python Whisper:**
+- Native Apple Silicon build — uses Neural Engine + Metal
+- Significantly lower latency than Python implementation
+- Runs as a server process, not one-shot per request
+
+**Installation:**
+```bash
+git clone https://github.com/ggerganov/whisper.cpp
+cd whisper.cpp
+make -j$(sysctl -n hw.logicalcpu)      # compiles with Metal support on macOS
+
+# Download model
+bash ./models/download-ggml-model.sh large-v3
+# Also grab medium.en for faster fallback
+bash ./models/download-ggml-model.sh medium.en
+```
+
+Models stored at `~/models/whisper/`.
+
+**Wyoming-Whisper adapter:**
+
+Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:
+
+```bash
+pip install wyoming-faster-whisper
+wyoming-faster-whisper \
+  --model large-v3 \
+  --language en \
+  --uri tcp://0.0.0.0:10300 \
+  --data-dir ~/models/whisper \
+  --download-dir ~/models/whisper
+```
+
+**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`
+
+### 2. Kokoro TTS — Primary Text-to-Speech
+
+**Why Kokoro:**
+- Very low latency (~200ms for short phrases)
+- High quality voice output
+- Runs efficiently on Apple Silicon
+- No GPU required (MPS optional)
+
+**Installation:**
+```bash
+pip install kokoro-onnx
+```
+
+**Wyoming-Kokoro adapter:**
+
+```bash
+pip install wyoming-kokoro   # community adapter, or write thin wrapper
+wyoming-kokoro \
+  --uri tcp://0.0.0.0:10301 \
+  --voice af_heart \          # default voice; overridden by character config
+  --speed 1.0
+```
+
+**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`
+
+### 3. Chatterbox TTS — Voice Cloning Engine
+
+Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).
+
+```bash
+# Install Chatterbox (MPS-optimised for Apple Silicon)
+pip install chatterbox-tts
+
+# Test voice clone
+python -c "
+from chatterbox.tts import ChatterboxTTS
+model = ChatterboxTTS.from_pretrained(device='mps')
+wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
+"
+```
+
+Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
+
+### 4. Qwen3-TTS — MLX Fallback
+
+```bash
+pip install mlx mlx-lm
+# Pull Qwen3-TTS model via mlx-lm or HuggingFace
+```
+
+Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.
+
+### 5. openWakeWord — Always-On Detection
+
+Runs continuously, listens for wake word, triggers pipeline.
+
+```bash
+pip install openwakeword
+
+# Test with default "hey_jarvis" model
+python -c "
+import openwakeword
+model = openwakeword.Model(wakeword_models=['hey_jarvis'])
+# ... audio loop
+"
+```
+
+**Custom wake word (later):**
+- Record 30–50 utterances of the character's name
+- Train via openWakeWord training toolkit
+- Drop model file into `~/models/wakeword/`
+
+**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`
+
+Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
+
+### 6. Wyoming Protocol Server
+
+Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
+
+**HA integration:**
+1. Home Assistant → Settings → Add Integration → Wyoming Protocol
+2. Add STT: host `<mac-mini-ip>`, port `10300`
+3. Add TTS: host `<mac-mini-ip>`, port `10301`
+4. Create Voice Assistant pipeline in HA using these providers
+5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
+
+---
+
+## launchd Services
+
+Three launchd plists under `~/Library/LaunchAgents/`:
+
+| Plist | Service | Port |
+|---|---|---|
+| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
+| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
+| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |
+
+Templates stored in `scripts/launchd/`.
+
+---
+
+## Directory Layout
+
+```
+homeai-voice/
+├── whisper/
+│   ├── install.sh          # clone, compile whisper.cpp, download models
+│   └── README.md
+├── tts/
+│   ├── install-kokoro.sh
+│   ├── install-chatterbox.sh
+│   ├── install-qwen3.sh
+│   └── test-tts.sh         # quick audio playback test
+├── wyoming/
+│   ├── install.sh
+│   └── test-pipeline.sh    # end-to-end text→audio test
+└── scripts/
+    ├── launchd/
+    │   ├── com.homeai.wyoming-stt.plist
+    │   ├── com.homeai.wyoming-tts.plist
+    │   └── com.homeai.wakeword.plist
+    └── load-all-launchd.sh
+```
+
+---
+
+## Interface Contracts
+
+**Exposes:**
+- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
+- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
+- Chatterbox: Python API, invoked directly by P4 skills
+- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)
+
+**Add to `.env.services`:**
+```dotenv
+WYOMING_STT_URL=tcp://localhost:10300
+WYOMING_TTS_URL=tcp://localhost:10301
+```
+
+---
+
+## Implementation Steps
+
+- [ ] Compile Whisper.cpp with Metal support
+- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
+- [ ] Install `wyoming-faster-whisper`, test STT from audio file
+- [ ] Install Kokoro, test TTS to audio file
+- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
+- [ ] Write launchd plists for STT and TTS services
+- [ ] Load plists, verify both services start on reboot
+- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
+- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
+- [ ] Test HA Assist from browser: type query → hear spoken response
+- [ ] Install openWakeWord, test wake detection with USB mic
+- [ ] Write and load openWakeWord launchd plist
+- [ ] Install Chatterbox, test voice clone with sample `.wav`
+- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
+- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test
+
+---
+
+## Success Criteria
+
+- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
+- [ ] HA Voice Assistant responds to typed query with Kokoro voice
+- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
+- [ ] All three launchd services auto-start after reboot
+- [ ] STT latency <2s for 5-second utterances with `large-v3`
+- [ ] Kokoro TTS latency <300ms for a 10-word sentence