Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
203 lines
5.2 KiB
Markdown
203 lines
5.2 KiB
Markdown
# P2: homeai-llm — Local LLM Runtime
|
||
|
||
> Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing
|
||
|
||
---
|
||
|
||
## Goal
|
||
|
||
Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).
|
||
|
||
---
|
||
|
||
## Why Native (not Docker)
|
||
|
||
Ollama must run natively — not in Docker — because:
|
||
- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
|
||
- Native Ollama uses Metal for GPU acceleration, giving 3–5× faster inference
|
||
- Ollama's launchd integration keeps it alive across reboots
|
||
|
||
---
|
||
|
||
## Deliverables
|
||
|
||
### 1. Ollama Installation
|
||
|
||
```bash
|
||
# Install
|
||
brew install ollama
|
||
|
||
# Or direct install
|
||
curl -fsSL https://ollama.com/install.sh | sh
|
||
```
|
||
|
||
Ollama runs as a background process. Configure as a launchd service for reboot survival.
|
||
|
||
**launchd plist:** `~/Library/LaunchAgents/com.ollama.ollama.plist`
|
||
|
||
```xml
|
||
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||
<plist version="1.0">
|
||
<dict>
|
||
<key>Label</key>
|
||
<string>com.ollama.ollama</string>
|
||
<key>ProgramArguments</key>
|
||
<array>
|
||
<string>/usr/local/bin/ollama</string>
|
||
<string>serve</string>
|
||
</array>
|
||
<key>RunAtLoad</key>
|
||
<true/>
|
||
<key>KeepAlive</key>
|
||
<true/>
|
||
<key>StandardOutPath</key>
|
||
<string>/tmp/ollama.log</string>
|
||
<key>StandardErrorPath</key>
|
||
<string>/tmp/ollama.err</string>
|
||
</dict>
|
||
</plist>
|
||
```
|
||
|
||
Load: `launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist`
|
||
|
||
### 2. Model Manifest — `ollama-models.txt`
|
||
|
||
Pinned models pulled to Mac Mini:
|
||
|
||
```
|
||
# Primary — high quality responses
|
||
llama3.3:70b
|
||
qwen2.5:72b
|
||
|
||
# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
|
||
qwen2.5:7b
|
||
|
||
# Code — for n8n/skill writing assistance
|
||
qwen2.5-coder:32b
|
||
|
||
# Embedding — for mem0 semantic search
|
||
nomic-embed-text
|
||
```
|
||
|
||
Pull script (`scripts/pull-models.sh`):
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
while IFS= read -r model; do
|
||
[[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
|
||
echo "Pulling $model..."
|
||
ollama pull "$model"
|
||
done < ../ollama-models.txt
|
||
```
|
||
|
||
### 3. Open WebUI — Docker
|
||
|
||
Open WebUI connects to Ollama over the Docker-to-host bridge (`host.docker.internal`):
|
||
|
||
**`docker/open-webui/docker-compose.yml`:**
|
||
|
||
```yaml
|
||
services:
|
||
open-webui:
|
||
image: ghcr.io/open-webui/open-webui:main
|
||
container_name: open-webui
|
||
restart: unless-stopped
|
||
volumes:
|
||
- ./open-webui-data:/app/backend/data
|
||
environment:
|
||
- OLLAMA_BASE_URL=http://host.docker.internal:11434
|
||
ports:
|
||
- "3030:8080"
|
||
networks:
|
||
- homeai
|
||
extra_hosts:
|
||
- "host.docker.internal:host-gateway"
|
||
|
||
networks:
|
||
homeai:
|
||
external: true
|
||
```
|
||
|
||
Port `3030` chosen to avoid conflict with Gitea (3000).
|
||
|
||
### 4. Benchmark Script — `scripts/benchmark.sh`
|
||
|
||
Measures tokens/sec for each model to inform model selection per task:
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
PROMPT="Tell me a joke about computers."
|
||
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
|
||
echo "=== $model ==="
|
||
time ollama run "$model" "$PROMPT" --nowordwrap
|
||
done
|
||
```
|
||
|
||
Results documented in `scripts/benchmark-results.md`.
|
||
|
||
### 5. API Verification
|
||
|
||
```bash
|
||
# Check Ollama is running
|
||
curl http://localhost:11434/api/tags
|
||
|
||
# Test OpenAI-compatible endpoint (used by P3, P4)
|
||
curl http://localhost:11434/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "qwen2.5:7b",
|
||
"messages": [{"role": "user", "content": "Hello"}]
|
||
}'
|
||
```
|
||
|
||
### 6. Model Selection Guide
|
||
|
||
Document in `scripts/benchmark-results.md` after benchmarking:
|
||
|
||
| Task | Model | Reason |
|
||
|---|---|---|
|
||
| Main conversation | `llama3.3:70b` | Best quality |
|
||
| Quick/real-time tasks | `qwen2.5:7b` | Lowest latency |
|
||
| Code generation (skills) | `qwen2.5-coder:32b` | Best code quality |
|
||
| Embeddings (mem0) | `nomic-embed-text` | Compact, fast |
|
||
|
||
---
|
||
|
||
## Interface Contract
|
||
|
||
- **Ollama API:** `http://localhost:11434` (native Ollama)
|
||
- **OpenAI-compatible API:** `http://localhost:11434/v1` — used by P3, P4, P7
|
||
- **Open WebUI:** `http://localhost:3030`
|
||
|
||
Add to `~/server/.env.services`:
|
||
```dotenv
|
||
OLLAMA_URL=http://localhost:11434
|
||
OLLAMA_API_URL=http://localhost:11434/v1
|
||
OPEN_WEBUI_URL=http://localhost:3030
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Steps
|
||
|
||
- [ ] Install Ollama via brew
|
||
- [ ] Verify `ollama serve` starts and responds at port 11434
|
||
- [ ] Write launchd plist, load it, verify auto-start on reboot
|
||
- [ ] Write `ollama-models.txt` with model list
|
||
- [ ] Run `scripts/pull-models.sh` — pull all models (allow time for large downloads)
|
||
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
|
||
- [ ] Deploy Open WebUI via Docker compose
|
||
- [ ] Verify Open WebUI can chat with all models
|
||
- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services`
|
||
- [ ] Add Ollama and Open WebUI monitors to Uptime Kuma
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
- [ ] `curl http://localhost:11434/api/tags` returns all expected models
|
||
- [ ] `llama3.3:70b` generates a coherent response in Open WebUI
|
||
- [ ] Ollama survives Mac Mini reboot without manual intervention
|
||
- [ ] Benchmark results documented — at least one model achieving >10 tok/s
|
||
- [ ] Open WebUI accessible at `http://localhost:3030` via Tailscale
|