homeai/homeai-voice/PLAN.md

# P3: homeai-voice — Speech Pipeline

> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6

---

## Goal

Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.

Test with a desktop USB mic before ESP32 hardware arrives (P6).

---

## Pipeline Architecture

```
[USB Mic / ESP32 satellite]
        ↓
openWakeWord (always-on, local)
        ↓ wake detected
Wyoming Satellite / Audio capture
        ↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
        ↓ transcribed text
Home Assistant Voice Pipeline
        ↓ text
OpenClaw Agent (P4)            ← intent + LLM response
        ↓ response text
Wyoming TTS Server (Kokoro)
        ↓ audio
[Speaker / ESP32 satellite]
```

---

## Components

### 1. Whisper.cpp — Speech-to-Text

**Why Whisper.cpp over Python Whisper:**
- Native Apple Silicon build — uses Neural Engine + Metal
- Significantly lower latency than Python implementation
- Runs as a server process, not one-shot per request

**Installation:**
```bash
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu)      # compiles with Metal support on macOS

# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en
```

Models stored at `~/models/whisper/`.

**Wyoming-Whisper adapter:**

Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:

```bash
pip install wyoming-faster-whisper
wyoming-faster-whisper \
  --model large-v3 \
  --language en \
  --uri tcp://0.0.0.0:10300 \
  --data-dir ~/models/whisper \
  --download-dir ~/models/whisper
```

**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`

### 2. Kokoro TTS — Primary Text-to-Speech

**Why Kokoro:**
- Very low latency (~200ms for short phrases)
- High quality voice output
- Runs efficiently on Apple Silicon
- No GPU required (MPS optional)

**Installation:**
```bash
pip install kokoro-onnx
```

**Wyoming-Kokoro adapter:**

```bash
pip install wyoming-kokoro   # community adapter, or write thin wrapper
wyoming-kokoro \
  --uri tcp://0.0.0.0:10301 \
  --voice af_heart \          # default voice; overridden by character config
  --speed 1.0
```

**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`

### 3. Chatterbox TTS — Voice Cloning Engine

Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).

```bash
# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts

# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"
```

Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.

### 4. Qwen3-TTS — MLX Fallback

```bash
pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
```

Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.

### 5. openWakeWord — Always-On Detection

Runs continuously, listens for wake word, triggers pipeline.

```bash
pip install openwakeword

# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"
```

**Custom wake word (later):**
- Record 30–50 utterances of the character's name
- Train via openWakeWord training toolkit
- Drop model file into `~/models/wakeword/`

**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`

Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.

### 6. Wyoming Protocol Server

Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.

**HA integration:**
1. Home Assistant → Settings → Add Integration → Wyoming Protocol
2. Add STT: host `<mac-mini-ip>`, port `10300`
3. Add TTS: host `<mac-mini-ip>`, port `10301`
4. Create Voice Assistant pipeline in HA using these providers
5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)

---

## launchd Services

Three launchd plists under `~/Library/LaunchAgents/`:

| Plist | Service | Port |
|---|---|---|
| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |

Templates stored in `scripts/launchd/`.

---

## Directory Layout

```
homeai-voice/
├── whisper/
│   ├── install.sh          # clone, compile whisper.cpp, download models
│   └── README.md
├── tts/
│   ├── install-kokoro.sh
│   ├── install-chatterbox.sh
│   ├── install-qwen3.sh
│   └── test-tts.sh         # quick audio playback test
├── wyoming/
│   ├── install.sh
│   └── test-pipeline.sh    # end-to-end text→audio test
└── scripts/
    ├── launchd/
    │   ├── com.homeai.wyoming-stt.plist
    │   ├── com.homeai.wyoming-tts.plist
    │   └── com.homeai.wakeword.plist
    └── load-all-launchd.sh
```

---

## Interface Contracts

**Exposes:**
- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
- Chatterbox: Python API, invoked directly by P4 skills
- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)

**Add to `.env.services`:**
```dotenv
WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301
```

---

## Implementation Steps

- [ ] Compile Whisper.cpp with Metal support
- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
- [ ] Install Kokoro, test TTS to audio file
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
- [ ] Write launchd plists for STT and TTS services
- [ ] Load plists, verify both services start on reboot
- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
- [ ] Test HA Assist from browser: type query → hear spoken response
- [ ] Install openWakeWord, test wake detection with USB mic
- [ ] Write and load openWakeWord launchd plist
- [ ] Install Chatterbox, test voice clone with sample `.wav`
- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test

---

## Success Criteria

- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
- [ ] HA Voice Assistant responds to typed query with Kokoro voice
- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
- [ ] All three launchd services auto-start after reboot
- [ ] STT latency <2s for 5-second utterances with `large-v3`
- [ ] Kokoro TTS latency <300ms for a 10-word sentence