Files
homeai/homeai-llm/PLAN.md
Aodhan Collins 38247d7cc4 Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00

5.2 KiB
Raw Permalink Blame History

P2: homeai-llm — Local LLM Runtime

Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing


Goal

Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).


Why Native (not Docker)

Ollama must run natively — not in Docker — because:

  • Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
  • Native Ollama uses Metal for GPU acceleration, giving 35× faster inference
  • Ollama's launchd integration keeps it alive across reboots

Deliverables

1. Ollama Installation

# Install
brew install ollama

# Or direct install
curl -fsSL https://ollama.com/install.sh | sh

Ollama runs as a background process. Configure as a launchd service for reboot survival.

launchd plist: ~/Library/LaunchAgents/com.ollama.ollama.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.ollama</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>

Load: launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist

2. Model Manifest — ollama-models.txt

Pinned models pulled to Mac Mini:

# Primary — high quality responses
llama3.3:70b
qwen2.5:72b

# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b

# Code — for n8n/skill writing assistance
qwen2.5-coder:32b

# Embedding — for mem0 semantic search
nomic-embed-text

Pull script (scripts/pull-models.sh):

#!/usr/bin/env bash
while IFS= read -r model; do
  [[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
  echo "Pulling $model..."
  ollama pull "$model"
done < ../ollama-models.txt

3. Open WebUI — Docker

Open WebUI connects to Ollama over the Docker-to-host bridge (host.docker.internal):

docker/open-webui/docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    volumes:
      - ./open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    ports:
      - "3030:8080"
    networks:
      - homeai
    extra_hosts:
      - "host.docker.internal:host-gateway"

networks:
  homeai:
    external: true

Port 3030 chosen to avoid conflict with Gitea (3000).

4. Benchmark Script — scripts/benchmark.sh

Measures tokens/sec for each model to inform model selection per task:

#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
  echo "=== $model ==="
  time ollama run "$model" "$PROMPT" --nowordwrap
done

Results documented in scripts/benchmark-results.md.

5. API Verification

# Check Ollama is running
curl http://localhost:11434/api/tags

# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

6. Model Selection Guide

Document in scripts/benchmark-results.md after benchmarking:

Task Model Reason
Main conversation llama3.3:70b Best quality
Quick/real-time tasks qwen2.5:7b Lowest latency
Code generation (skills) qwen2.5-coder:32b Best code quality
Embeddings (mem0) nomic-embed-text Compact, fast

Interface Contract

  • Ollama API: http://localhost:11434 (native Ollama)
  • OpenAI-compatible API: http://localhost:11434/v1 — used by P3, P4, P7
  • Open WebUI: http://localhost:3030

Add to ~/server/.env.services:

OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030

Implementation Steps

  • Install Ollama via brew
  • Verify ollama serve starts and responds at port 11434
  • Write launchd plist, load it, verify auto-start on reboot
  • Write ollama-models.txt with model list
  • Run scripts/pull-models.sh — pull all models (allow time for large downloads)
  • Run scripts/benchmark.sh — record results in benchmark-results.md
  • Deploy Open WebUI via Docker compose
  • Verify Open WebUI can chat with all models
  • Add OLLAMA_URL and OPEN_WEBUI_URL to .env.services
  • Add Ollama and Open WebUI monitors to Uptime Kuma

Success Criteria

  • curl http://localhost:11434/api/tags returns all expected models
  • llama3.3:70b generates a coherent response in Open WebUI
  • Ollama survives Mac Mini reboot without manual intervention
  • Benchmark results documented — at least one model achieving >10 tok/s
  • Open WebUI accessible at http://localhost:3030 via Tailscale