Files
homeai/homeai-llm/PLAN.md
Aodhan Collins 38247d7cc4 Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00

203 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# P2: homeai-llm — Local LLM Runtime
> Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing
---
## Goal
Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).
---
## Why Native (not Docker)
Ollama must run natively — not in Docker — because:
- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
- Native Ollama uses Metal for GPU acceleration, giving 35× faster inference
- Ollama's launchd integration keeps it alive across reboots
---
## Deliverables
### 1. Ollama Installation
```bash
# Install
brew install ollama
# Or direct install
curl -fsSL https://ollama.com/install.sh | sh
```
Ollama runs as a background process. Configure as a launchd service for reboot survival.
**launchd plist:** `~/Library/LaunchAgents/com.ollama.ollama.plist`
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.ollama</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama.err</string>
</dict>
</plist>
```
Load: `launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist`
### 2. Model Manifest — `ollama-models.txt`
Pinned models pulled to Mac Mini:
```
# Primary — high quality responses
llama3.3:70b
qwen2.5:72b
# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b
# Code — for n8n/skill writing assistance
qwen2.5-coder:32b
# Embedding — for mem0 semantic search
nomic-embed-text
```
Pull script (`scripts/pull-models.sh`):
```bash
#!/usr/bin/env bash
while IFS= read -r model; do
[[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
echo "Pulling $model..."
ollama pull "$model"
done < ../ollama-models.txt
```
### 3. Open WebUI — Docker
Open WebUI connects to Ollama over the Docker-to-host bridge (`host.docker.internal`):
**`docker/open-webui/docker-compose.yml`:**
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
volumes:
- ./open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
ports:
- "3030:8080"
networks:
- homeai
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
homeai:
external: true
```
Port `3030` chosen to avoid conflict with Gitea (3000).
### 4. Benchmark Script — `scripts/benchmark.sh`
Measures tokens/sec for each model to inform model selection per task:
```bash
#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
echo "=== $model ==="
time ollama run "$model" "$PROMPT" --nowordwrap
done
```
Results documented in `scripts/benchmark-results.md`.
### 5. API Verification
```bash
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Hello"}]
}'
```
### 6. Model Selection Guide
Document in `scripts/benchmark-results.md` after benchmarking:
| Task | Model | Reason |
|---|---|---|
| Main conversation | `llama3.3:70b` | Best quality |
| Quick/real-time tasks | `qwen2.5:7b` | Lowest latency |
| Code generation (skills) | `qwen2.5-coder:32b` | Best code quality |
| Embeddings (mem0) | `nomic-embed-text` | Compact, fast |
---
## Interface Contract
- **Ollama API:** `http://localhost:11434` (native Ollama)
- **OpenAI-compatible API:** `http://localhost:11434/v1` — used by P3, P4, P7
- **Open WebUI:** `http://localhost:3030`
Add to `~/server/.env.services`:
```dotenv
OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030
```
---
## Implementation Steps
- [ ] Install Ollama via brew
- [ ] Verify `ollama serve` starts and responds at port 11434
- [ ] Write launchd plist, load it, verify auto-start on reboot
- [ ] Write `ollama-models.txt` with model list
- [ ] Run `scripts/pull-models.sh` — pull all models (allow time for large downloads)
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
- [ ] Deploy Open WebUI via Docker compose
- [ ] Verify Open WebUI can chat with all models
- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services`
- [ ] Add Ollama and Open WebUI monitors to Uptime Kuma
---
## Success Criteria
- [ ] `curl http://localhost:11434/api/tags` returns all expected models
- [ ] `llama3.3:70b` generates a coherent response in Open WebUI
- [ ] Ollama survives Mac Mini reboot without manual intervention
- [ ] Benchmark results documented — at least one model achieving >10 tok/s
- [ ] Open WebUI accessible at `http://localhost:3030` via Tailscale