# P3: homeai-voice — Speech Pipeline > Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6 --- ## Goal Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant. Test with a desktop USB mic before ESP32 hardware arrives (P6). --- ## Pipeline Architecture ``` [USB Mic / ESP32 satellite] ↓ openWakeWord (always-on, local) ↓ wake detected Wyoming Satellite / Audio capture ↓ raw audio stream Wyoming STT Server (Whisper.cpp) ↓ transcribed text Home Assistant Voice Pipeline ↓ text OpenClaw Agent (P4) ← intent + LLM response ↓ response text Wyoming TTS Server (Kokoro) ↓ audio [Speaker / ESP32 satellite] ``` --- ## Components ### 1. Whisper.cpp — Speech-to-Text **Why Whisper.cpp over Python Whisper:** - Native Apple Silicon build — uses Neural Engine + Metal - Significantly lower latency than Python implementation - Runs as a server process, not one-shot per request **Installation:** ```bash git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS # Download model bash ./models/download-ggml-model.sh large-v3 # Also grab medium.en for faster fallback bash ./models/download-ggml-model.sh medium.en ``` Models stored at `~/models/whisper/`. **Wyoming-Whisper adapter:** Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server: ```bash pip install wyoming-faster-whisper wyoming-faster-whisper \ --model large-v3 \ --language en \ --uri tcp://0.0.0.0:10300 \ --data-dir ~/models/whisper \ --download-dir ~/models/whisper ``` **launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist` ### 2. Kokoro TTS — Primary Text-to-Speech **Why Kokoro:** - Very low latency (~200ms for short phrases) - High quality voice output - Runs efficiently on Apple Silicon - No GPU required (MPS optional) **Installation:** ```bash pip install kokoro-onnx ``` **Wyoming-Kokoro adapter:** ```bash pip install wyoming-kokoro # community adapter, or write thin wrapper wyoming-kokoro \ --uri tcp://0.0.0.0:10301 \ --voice af_heart \ # default voice; overridden by character config --speed 1.0 ``` **launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist` ### 3. Chatterbox TTS — Voice Cloning Engine Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`). ```bash # Install Chatterbox (MPS-optimised for Apple Silicon) pip install chatterbox-tts # Test voice clone python -c " from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device='mps') wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav') " ``` Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline. ### 4. Qwen3-TTS — MLX Fallback ```bash pip install mlx mlx-lm # Pull Qwen3-TTS model via mlx-lm or HuggingFace ``` Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`. ### 5. openWakeWord — Always-On Detection Runs continuously, listens for wake word, triggers pipeline. ```bash pip install openwakeword # Test with default "hey_jarvis" model python -c " import openwakeword model = openwakeword.Model(wakeword_models=['hey_jarvis']) # ... audio loop " ``` **Custom wake word (later):** - Record 30–50 utterances of the character's name - Train via openWakeWord training toolkit - Drop model file into `~/models/wakeword/` **launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist` Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff. ### 6. Wyoming Protocol Server Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly. **HA integration:** 1. Home Assistant → Settings → Add Integration → Wyoming Protocol 2. Add STT: host ``, port `10300` 3. Add TTS: host ``, port `10301` 4. Create Voice Assistant pipeline in HA using these providers 5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6) --- ## launchd Services Three launchd plists under `~/Library/LaunchAgents/`: | Plist | Service | Port | |---|---|---| | `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 | | `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 | | `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) | Templates stored in `scripts/launchd/`. --- ## Directory Layout ``` homeai-voice/ ├── whisper/ │ ├── install.sh # clone, compile whisper.cpp, download models │ └── README.md ├── tts/ │ ├── install-kokoro.sh │ ├── install-chatterbox.sh │ ├── install-qwen3.sh │ └── test-tts.sh # quick audio playback test ├── wyoming/ │ ├── install.sh │ └── test-pipeline.sh # end-to-end text→audio test └── scripts/ ├── launchd/ │ ├── com.homeai.wyoming-stt.plist │ ├── com.homeai.wyoming-tts.plist │ └── com.homeai.wakeword.plist └── load-all-launchd.sh ``` --- ## Interface Contracts **Exposes:** - Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites) - Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6 - Chatterbox: Python API, invoked directly by P4 skills - openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw) **Add to `.env.services`:** ```dotenv WYOMING_STT_URL=tcp://localhost:10300 WYOMING_TTS_URL=tcp://localhost:10301 ``` --- ## Implementation Steps - [ ] Compile Whisper.cpp with Metal support - [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/` - [ ] Install `wyoming-faster-whisper`, test STT from audio file - [ ] Install Kokoro, test TTS to audio file - [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works - [ ] Write launchd plists for STT and TTS services - [ ] Load plists, verify both services start on reboot - [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301 - [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS - [ ] Test HA Assist from browser: type query → hear spoken response - [ ] Install openWakeWord, test wake detection with USB mic - [ ] Write and load openWakeWord launchd plist - [ ] Install Chatterbox, test voice clone with sample `.wav` - [ ] Install Qwen3-TTS via MLX (fallback, lower priority) - [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test --- ## Success Criteria - [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response - [ ] HA Voice Assistant responds to typed query with Kokoro voice - [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably - [ ] All three launchd services auto-start after reboot - [ ] STT latency <2s for 5-second utterances with `large-v3` - [ ] Kokoro TTS latency <300ms for a 10-word sentence