Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.2 KiB
P2: homeai-llm — Local LLM Runtime
Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing
Goal
Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).
Why Native (not Docker)
Ollama must run natively — not in Docker — because:
- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
- Native Ollama uses Metal for GPU acceleration, giving 3–5× faster inference
- Ollama's launchd integration keeps it alive across reboots
Deliverables
1. Ollama Installation
# Install
brew install ollama
# Or direct install
curl -fsSL https://ollama.com/install.sh | sh
Ollama runs as a background process. Configure as a launchd service for reboot survival.
launchd plist: ~/Library/LaunchAgents/com.ollama.ollama.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.ollama</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama.err</string>
</dict>
</plist>
Load: launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist
2. Model Manifest — ollama-models.txt
Pinned models pulled to Mac Mini:
# Primary — high quality responses
llama3.3:70b
qwen2.5:72b
# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b
# Code — for n8n/skill writing assistance
qwen2.5-coder:32b
# Embedding — for mem0 semantic search
nomic-embed-text
Pull script (scripts/pull-models.sh):
#!/usr/bin/env bash
while IFS= read -r model; do
[[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
echo "Pulling $model..."
ollama pull "$model"
done < ../ollama-models.txt
3. Open WebUI — Docker
Open WebUI connects to Ollama over the Docker-to-host bridge (host.docker.internal):
docker/open-webui/docker-compose.yml:
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
volumes:
- ./open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
ports:
- "3030:8080"
networks:
- homeai
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
homeai:
external: true
Port 3030 chosen to avoid conflict with Gitea (3000).
4. Benchmark Script — scripts/benchmark.sh
Measures tokens/sec for each model to inform model selection per task:
#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
echo "=== $model ==="
time ollama run "$model" "$PROMPT" --nowordwrap
done
Results documented in scripts/benchmark-results.md.
5. API Verification
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Hello"}]
}'
6. Model Selection Guide
Document in scripts/benchmark-results.md after benchmarking:
| Task | Model | Reason |
|---|---|---|
| Main conversation | llama3.3:70b |
Best quality |
| Quick/real-time tasks | qwen2.5:7b |
Lowest latency |
| Code generation (skills) | qwen2.5-coder:32b |
Best code quality |
| Embeddings (mem0) | nomic-embed-text |
Compact, fast |
Interface Contract
- Ollama API:
http://localhost:11434(native Ollama) - OpenAI-compatible API:
http://localhost:11434/v1— used by P3, P4, P7 - Open WebUI:
http://localhost:3030
Add to ~/server/.env.services:
OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030
Implementation Steps
- Install Ollama via brew
- Verify
ollama servestarts and responds at port 11434 - Write launchd plist, load it, verify auto-start on reboot
- Write
ollama-models.txtwith model list - Run
scripts/pull-models.sh— pull all models (allow time for large downloads) - Run
scripts/benchmark.sh— record results inbenchmark-results.md - Deploy Open WebUI via Docker compose
- Verify Open WebUI can chat with all models
- Add
OLLAMA_URLandOPEN_WEBUI_URLto.env.services - Add Ollama and Open WebUI monitors to Uptime Kuma
Success Criteria
curl http://localhost:11434/api/tagsreturns all expected modelsllama3.3:70bgenerates a coherent response in Open WebUI- Ollama survives Mac Mini reboot without manual intervention
- Benchmark results documented — at least one model achieving >10 tok/s
- Open WebUI accessible at
http://localhost:3030via Tailscale