homeai/homeai-llm/PLAN.md

# P2: homeai-llm — Local LLM Runtime

> Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing

---

## Goal

Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).

---

## Why Native (not Docker)

Ollama must run natively — not in Docker — because:
- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
- Native Ollama uses Metal for GPU acceleration, giving 3–5× faster inference
- Ollama's launchd integration keeps it alive across reboots

---

## Deliverables

### 1. Ollama Installation

```bash
# Install
brew install ollama

# Or direct install
curl -fsSL https://ollama.com/install.sh | sh
```

Ollama runs as a background process. Configure as a launchd service for reboot survival.

**launchd plist:** `~/Library/LaunchAgents/com.ollama.ollama.plist`

```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.ollama</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>
```

Load: `launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist`

### 2. Model Manifest — `ollama-models.txt`

Pinned models pulled to Mac Mini:

```
# Primary — high quality responses
llama3.3:70b
qwen2.5:72b

# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b

# Code — for n8n/skill writing assistance
qwen2.5-coder:32b

# Embedding — for mem0 semantic search
nomic-embed-text
```

Pull script (`scripts/pull-models.sh`):
```bash
#!/usr/bin/env bash
while IFS= read -r model; do
  [[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
  echo "Pulling $model..."
  ollama pull "$model"
done < ../ollama-models.txt
```

### 3. Open WebUI — Docker

Open WebUI connects to Ollama over the Docker-to-host bridge (`host.docker.internal`):

**`docker/open-webui/docker-compose.yml`:**

```yaml
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    volumes:
      - ./open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    ports:
      - "3030:8080"
    networks:
      - homeai
    extra_hosts:
      - "host.docker.internal:host-gateway"

networks:
  homeai:
    external: true
```

Port `3030` chosen to avoid conflict with Gitea (3000).

### 4. Benchmark Script — `scripts/benchmark.sh`

Measures tokens/sec for each model to inform model selection per task:

```bash
#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
  echo "=== $model ==="
  time ollama run "$model" "$PROMPT" --nowordwrap
done
```

Results documented in `scripts/benchmark-results.md`.

### 5. API Verification

```bash
# Check Ollama is running
curl http://localhost:11434/api/tags

# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
```

### 6. Model Selection Guide

Document in `scripts/benchmark-results.md` after benchmarking:

| Task | Model | Reason |
|---|---|---|
| Main conversation | `llama3.3:70b` | Best quality |
| Quick/real-time tasks | `qwen2.5:7b` | Lowest latency |
| Code generation (skills) | `qwen2.5-coder:32b` | Best code quality |
| Embeddings (mem0) | `nomic-embed-text` | Compact, fast |

---

## Interface Contract

- **Ollama API:** `http://localhost:11434` (native Ollama)
- **OpenAI-compatible API:** `http://localhost:11434/v1` — used by P3, P4, P7
- **Open WebUI:** `http://localhost:3030`

Add to `~/server/.env.services`:
```dotenv
OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030
```

---

## Implementation Steps

- [ ] Install Ollama via brew
- [ ] Verify `ollama serve` starts and responds at port 11434
- [ ] Write launchd plist, load it, verify auto-start on reboot
- [ ] Write `ollama-models.txt` with model list
- [ ] Run `scripts/pull-models.sh` — pull all models (allow time for large downloads)
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
- [ ] Deploy Open WebUI via Docker compose
- [ ] Verify Open WebUI can chat with all models
- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services`
- [ ] Add Ollama and Open WebUI monitors to Uptime Kuma

---

## Success Criteria

- [ ] `curl http://localhost:11434/api/tags` returns all expected models
- [ ] `llama3.3:70b` generates a coherent response in Open WebUI
- [ ] Ollama survives Mac Mini reboot without manual intervention
- [ ] Benchmark results documented — at least one model achieving >10 tok/s
- [ ] Open WebUI accessible at `http://localhost:3030` via Tailscale