- Replace faster-whisper with wyoming-mlx-whisper (whisper-large-v3-turbo, MLX Metal GPU) STT latency: 8.4s → 400ms for short voice commands - Add Qwen3.5-35B-A3B (MoE, 3B active params, Q8_0) to Ollama — 26.7 tok/s vs 5.4 tok/s (70B) - Add model preload launchd service to pin voice model in VRAM permanently - Fix HA tool calling: set commands.native=true, symlink ha-ctl to PATH - Add pipeline benchmark script (STT/LLM/TTS latency profiling) - Add service restart buttons and STT endpoint to dashboard - Bind Vite dev server to 0.0.0.0 for LAN access Total estimated pipeline latency: ~27s → ~4s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
20 lines
749 B
Bash
Executable File
20 lines
749 B
Bash
Executable File
#!/bin/bash
|
|
# Pre-load voice pipeline models into Ollama with infinite keep_alive.
|
|
# Run after Ollama starts (called by launchd or manually).
|
|
# Only pins lightweight/MoE models — large dense models (70B) use default expiry.
|
|
|
|
OLLAMA_URL="http://localhost:11434"
|
|
|
|
# Wait for Ollama to be ready
|
|
for i in $(seq 1 30); do
|
|
curl -sf "$OLLAMA_URL/api/tags" > /dev/null 2>&1 && break
|
|
sleep 2
|
|
done
|
|
|
|
# Pin qwen3.5:35b-a3b (MoE, 38.7GB VRAM, voice pipeline default)
|
|
echo "[preload] Loading qwen3.5:35b-a3b with keep_alive=-1..."
|
|
curl -sf "$OLLAMA_URL/api/generate" \
|
|
-d '{"model":"qwen3.5:35b-a3b","prompt":"ready","stream":false,"keep_alive":-1,"options":{"num_ctx":512}}' \
|
|
> /dev/null 2>&1
|
|
echo "[preload] qwen3.5:35b-a3b pinned in memory"
|