feat: upgrade voice pipeline — MLX Whisper STT (20x faster), Qwen3.5 MoE LLM, fix HA tool calling

- Replace faster-whisper with wyoming-mlx-whisper (whisper-large-v3-turbo, MLX Metal GPU) STT latency: 8.4s → 400ms for short voice commands - Add Qwen3.5-35B-A3B (MoE, 3B active params, Q8_0) to Ollama — 26.7 tok/s vs 5.4 tok/s (70B) - Add model preload launchd service to pin voice model in VRAM permanently - Fix HA tool calling: set commands.native=true, symlink ha-ctl to PATH - Add pipeline benchmark script (STT/LLM/TTS latency profiling) - Add service restart buttons and STT endpoint to dashboard - Bind Vite dev server to 0.0.0.0 for LAN access Total estimated pipeline latency: ~27s → ~4s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 18:03:12 +00:00
parent 1bfd7fbd08
commit af6b7bd945
10 changed files with 721 additions and 27 deletions
--- a/homeai-llm/scripts/preload-models.sh
+++ b/homeai-llm/scripts/preload-models.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Pre-load voice pipeline models into Ollama with infinite keep_alive.
+# Run after Ollama starts (called by launchd or manually).
+# Only pins lightweight/MoE models — large dense models (70B) use default expiry.
+
+OLLAMA_URL="http://localhost:11434"
+
+# Wait for Ollama to be ready
+for i in $(seq 1 30); do
+    curl -sf "$OLLAMA_URL/api/tags" > /dev/null 2>&1 && break
+    sleep 2
+done
+
+# Pin qwen3.5:35b-a3b (MoE, 38.7GB VRAM, voice pipeline default)
+echo "[preload] Loading qwen3.5:35b-a3b with keep_alive=-1..."
+curl -sf "$OLLAMA_URL/api/generate" \
+    -d '{"model":"qwen3.5:35b-a3b","prompt":"ready","stream":false,"keep_alive":-1,"options":{"num_ctx":512}}' \
+    > /dev/null 2>&1
+echo "[preload] qwen3.5:35b-a3b pinned in memory"