Files

Aodhan Collins 38247d7cc4 Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-04 01:11:37 +00:00

5.2 KiB

Raw Permalink Blame History

P2: homeai-llm — Local LLM Runtime

Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing

Goal

Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).

Why Native (not Docker)

Ollama must run natively — not in Docker — because:

Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
Native Ollama uses Metal for GPU acceleration, giving 3–5× faster inference
Ollama's launchd integration keeps it alive across reboots

Deliverables

1. Ollama Installation

# Install
brew install ollama

# Or direct install
curl -fsSL https://ollama.com/install.sh | sh

Ollama runs as a background process. Configure as a launchd service for reboot survival.

launchd plist: ~/Library/LaunchAgents/com.ollama.ollama.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.ollama</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>

Load: launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist

2. Model Manifest — `ollama-models.txt`

Pinned models pulled to Mac Mini:

# Primary — high quality responses
llama3.3:70b
qwen2.5:72b

# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b

# Code — for n8n/skill writing assistance
qwen2.5-coder:32b

# Embedding — for mem0 semantic search
nomic-embed-text

Pull script (scripts/pull-models.sh):

#!/usr/bin/env bash
while IFS= read -r model; do
  [[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
  echo "Pulling $model..."
  ollama pull "$model"
done < ../ollama-models.txt

3. Open WebUI — Docker

Open WebUI connects to Ollama over the Docker-to-host bridge (host.docker.internal):

docker/open-webui/docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    volumes:
      - ./open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    ports:
      - "3030:8080"
    networks:
      - homeai
    extra_hosts:
      - "host.docker.internal:host-gateway"

networks:
  homeai:
    external: true

Port 3030 chosen to avoid conflict with Gitea (3000).

4. Benchmark Script — `scripts/benchmark.sh`

Measures tokens/sec for each model to inform model selection per task:

#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
  echo "=== $model ==="
  time ollama run "$model" "$PROMPT" --nowordwrap
done

Results documented in scripts/benchmark-results.md.

5. API Verification

# Check Ollama is running
curl http://localhost:11434/api/tags

# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

6. Model Selection Guide

Document in scripts/benchmark-results.md after benchmarking:

Task	Model	Reason
Main conversation	`llama3.3:70b`	Best quality
Quick/real-time tasks	`qwen2.5:7b`	Lowest latency
Code generation (skills)	`qwen2.5-coder:32b`	Best code quality
Embeddings (mem0)	`nomic-embed-text`	Compact, fast

Interface Contract

Ollama API: http://localhost:11434 (native Ollama)
OpenAI-compatible API: http://localhost:11434/v1 — used by P3, P4, P7
Open WebUI: http://localhost:3030

Add to ~/server/.env.services:

OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030

Implementation Steps

Install Ollama via brew
Verify ollama serve starts and responds at port 11434
Write launchd plist, load it, verify auto-start on reboot
Write ollama-models.txt with model list
Run scripts/pull-models.sh — pull all models (allow time for large downloads)
Run scripts/benchmark.sh — record results in benchmark-results.md
Deploy Open WebUI via Docker compose
Verify Open WebUI can chat with all models
Add OLLAMA_URL and OPEN_WEBUI_URL to .env.services
Add Ollama and Open WebUI monitors to Uptime Kuma

Success Criteria

curl http://localhost:11434/api/tags returns all expected models
llama3.3:70b generates a coherent response in Open WebUI
Ollama survives Mac Mini reboot without manual intervention
Benchmark results documented — at least one model achieving >10 tok/s
Open WebUI accessible at http://localhost:3030 via Tailscale

5.2 KiB Raw Permalink Blame History Unescape Escape