feat: upgrade voice pipeline — MLX Whisper STT (20x faster), Qwen3.5 MoE LLM, fix HA tool calling

- Replace faster-whisper with wyoming-mlx-whisper (whisper-large-v3-turbo, MLX Metal GPU) STT latency: 8.4s → 400ms for short voice commands - Add Qwen3.5-35B-A3B (MoE, 3B active params, Q8_0) to Ollama — 26.7 tok/s vs 5.4 tok/s (70B) - Add model preload launchd service to pin voice model in VRAM permanently - Fix HA tool calling: set commands.native=true, symlink ha-ctl to PATH - Add pipeline benchmark script (STT/LLM/TTS latency profiling) - Add service restart buttons and STT endpoint to dashboard - Bind Vite dev server to 0.0.0.0 for LAN access Total estimated pipeline latency: ~27s → ~4s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 18:03:12 +00:00
parent 1bfd7fbd08
commit af6b7bd945
10 changed files with 721 additions and 27 deletions
--- a/TODO.md
+++ b/TODO.md
@@ -25,9 +25,11 @@
 - [x] Write and load launchd plist (`com.homeai.ollama.plist`) — `/opt/homebrew/bin/ollama`
 - [x] Register local GGUF models via Modelfiles (no download): llama3.3:70b, qwen3:32b, codestral:22b, qwen2.5:7b
 - [x] Register additional models: EVA-LLaMA-3.33-70B, Midnight-Miqu-70B, QwQ-32B, Qwen3.5-35B, Qwen3-Coder-30B, Qwen3-VL-30B, GLM-4.6V-Flash, DeepSeek-R1-8B, gemma-3-27b
+- [x] Add qwen3.5:35b-a3b (MoE, Q8_0) — 26.7 tok/s, recommended for voice pipeline
+- [x] Write model preload script + launchd service (keeps voice model in VRAM permanently)
 - [x] Deploy Open WebUI via Docker compose (port 3030)
 - [x] Verify Open WebUI connected to Ollama, all models available
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
+- [x] Run pipeline benchmark (homeai-voice/scripts/benchmark_pipeline.py) — STT/LLM/TTS latency profiled
 - [ ] Add Ollama + Open WebUI to Uptime Kuma monitors

 ---
@@ -37,6 +39,7 @@
 ### P3 · homeai-voice

 - [x] Install `wyoming-faster-whisper` — model: faster-whisper-large-v3 (auto-downloaded)
+- [x] Upgrade STT to wyoming-mlx-whisper (whisper-large-v3-turbo, MLX Metal GPU) — 20x faster (8s → 400ms)
 - [x] Install Kokoro ONNX TTS — models at `~/models/kokoro/`
 - [x] Write Wyoming-Kokoro adapter server (`homeai-voice/tts/wyoming_kokoro_server.py`)
 - [x] Write + load launchd plists for Wyoming STT (10300) and TTS (10301)
@@ -67,10 +70,11 @@
 - [x] Fix context window: set `contextWindow=32768` for llama3.3:70b in `openclaw.json`
 - [x] Fix Llama 3.3 Modelfile: add tool-calling TEMPLATE block
 - [x] Verify `openclaw agent --message "..." --agent main` → completed
- [x] Write `skills/home-assistant` SKILL.md — HA REST API control
+- [x] Write `skills/home-assistant` SKILL.md — HA REST API control via ha-ctl CLI
 - [x] Write `skills/voice-assistant` SKILL.md — voice response style guide
 - [x] Wire HASS_TOKEN — create `~/.homeai/hass_token` or set env in launchd plist
- [x] Test home-assistant skill: "turn on/off the reading lamp"
+- [x] Fix HA tool calling: set commands.native=true, symlink ha-ctl to PATH, update TOOLS.md
+- [x] Test home-assistant skill: "turn on/off the reading lamp" — verified exec→ha-ctl→HA action
 - [x] Set up mem0 with Chroma backend, test semantic recall
 - [x] Write memory backup launchd job
 - [x] Build morning briefing n8n workflow