Files
homeai/PROJECT_PLAN.md
Aodhan Collins 6a0bae2a0b feat(phase-04): Wyoming Satellite integration + OpenClaw HA components
## Voice Pipeline (P3)
- Replace openWakeWord daemon with Wyoming Satellite approach
- Add Wyoming Satellite service on port 10700 for HA voice pipeline
- Update setup.sh with cross-platform sed compatibility (macOS/Linux)
- Add version field to Kokoro TTS voice info
- Update launchd service loader to use Wyoming Satellite

## Home Assistant Integration (P4)
- Add custom conversation agent component (openclaw_conversation)
  - Fix: Use IntentResponse instead of plain strings (HA API requirement)
  - Support both HTTP API and CLI fallback modes
  - Config flow for easy HA UI setup
- Add OpenClaw bridge scripts (Python + Bash)
- Add ha-ctl utility for HA entity control
  - Fix: Use context manager for token file reading
- Add HA configuration examples and documentation

## Infrastructure
- Add mem0 backup automation (launchd + script)
- Add n8n workflow templates (morning briefing, notification router)
- Add VS Code workspace configuration
- Reorganize model files into categorized folders:
  - lmstudio-community/
  - mlx-community/
  - bartowski/
  - mradermacher/

## Documentation
- Update PROJECT_PLAN.md with Wyoming Satellite architecture
- Update TODO.md with completed Wyoming integration tasks
- Add OPENCLAW_INTEGRATION.md for HA setup guide

## Testing
- Verified Wyoming services running (STT:10300, TTS:10301, Satellite:10700)
- Verified OpenClaw CLI accessibility
- Confirmed cross-platform compatibility fixes
2026-03-08 02:06:37 +00:00

16 KiB

HomeAI — Full Project Plan

Last updated: 2026-03-04


Overview

This project builds a self-hosted, always-on AI assistant running entirely on a Mac Mini M4 Pro. It is decomposed into 8 sub-projects that can be developed in parallel where dependencies allow, then bridged via well-defined interfaces.

The guiding principle: each sub-project exposes a clean API/config surface. No project hard-codes knowledge of another's internals.


Sub-Project Map

ID Name Description Primary Language
P1 homeai-infra Docker stack, networking, monitoring, secrets YAML / Shell
P2 homeai-llm Ollama + Open WebUI setup, model management YAML / Shell
P3 homeai-voice STT, TTS, Wyoming bridge, wake word Python / Shell
P4 homeai-agent OpenClaw config, skills, n8n workflows, mem0 Python / JSON
P5 homeai-character Character Manager UI, persona JSON schema, voice clone React / JSON
P6 homeai-esp32 ESPHome firmware, Wyoming Satellite, LVGL face C++ / YAML
P7 homeai-visual VTube Studio bridge, Live2D expression mapping Python / JSON
P8 homeai-images ComfyUI workflows, model management, ControlNet Python / JSON

All repos live under ~/gitea/homeai/ on the Mac Mini and are mirrored to the self-hosted Gitea instance (set up in P1).


Phase 1 — Foundation (P1 + P2)

Goal: Everything containerised, stable, accessible remotely. LLM responsive via browser.

P1: homeai-infra

Deliverables:

  • docker-compose.yml — master compose file (or per-service files under ~/server/docker/)
  • Services: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server
  • Tailscale installed on Mac Mini, all services on Tailnet
  • Gitea repos initialised, SSH keys configured
  • Uptime Kuma monitors all service endpoints
  • Docker restart policies: unless-stopped on all containers
  • Documented .env file pattern (secrets never committed)

Key decisions:

  • Single docker-compose.yml vs per-service compose files — recommend per-service files in ~/server/docker/<service>/ orchestrated by a root Makefile
  • Tailscale as sole remote access method (no public port forwarding)
  • Authelia deferred to Phase 4 polish (internal LAN services don't need 2FA immediately)

Interface contract: Exposes service URLs as env vars (e.g. HA_URL, GITEA_URL) written to ~/server/.env.services — consumed by all other projects.


P2: homeai-llm

Deliverables:

  • Ollama installed natively on Mac Mini (not Docker — needs Metal GPU access)
  • Models pulled: llama3.3:70b, qwen2.5:72b (and a fast small model: qwen2.5:7b for low-latency tasks)
  • Open WebUI running as Docker container, connected to Ollama
  • Model benchmark script — measures tokens/sec per model
  • ollama-models.txt — pinned model manifest for reproducibility

Key decisions:

  • Ollama runs as a launchd service (~/Library/LaunchAgents/) to survive reboots
  • Open WebUI exposed only on Tailnet
  • API endpoint: http://localhost:11434 (Ollama default)

Interface contract: Ollama OpenAI-compatible API at http://localhost:11434/v1 — used by P3, P4, P7.


Phase 2 — Voice Pipeline (P3)

Goal: Full end-to-end voice: speak → transcribe → LLM → TTS → hear response. No ESP32 yet — test with a USB mic on Mac Mini.

P3: homeai-voice

Deliverables:

  • Whisper.cpp compiled for Apple Silicon, model downloaded (medium.en or large-v3)
  • Kokoro TTS installed, tested, latency benchmarked
  • Chatterbox TTS installed (MPS optimised build), voice reference .wav ready
  • Qwen3-TTS via MLX installed as fallback
  • openWakeWord running on Mac Mini, detecting wake word
  • Wyoming protocol server running — bridges STT+TTS into Home Assistant
  • Home Assistant voice_assistant pipeline configured end-to-end
  • Test script: test_voice_pipeline.sh — mic in → spoken response out

Sub-components:

[Mic] → Wyoming Satellite (port 10700) → Home Assistant Voice Pipeline → Wyoming STT (Whisper)
                                                                      ↓
[Speaker] ← Wyoming TTS (Kokoro) ← OpenClaw Agent ← transcribed text

Note: The original openWakeWord daemon has been replaced by the Wyoming satellite approach, which handles wake word detection through Home Assistant's voice pipeline.

Key decisions:

  • Whisper.cpp runs as a Wyoming STT provider (via wyoming-faster-whisper)
  • Kokoro is primary TTS; Chatterbox used when voice cloning is active (P5)
  • Wyoming satellite runs on port 10700 — handles audio I/O and connects to HA voice pipeline
  • openWakeWord daemon disabled — wake word detection now handled by HA via Wyoming satellite
  • Wyoming server ports: 10300 (STT), 10301 (TTS), 10700 (Satellite) — standard Wyoming ports

Interface contract:

  • Wyoming STT: tcp://localhost:10300 (Whisper large-v3)
  • Wyoming TTS: tcp://localhost:10301 (Kokoro ONNX)
  • Wyoming Satellite: tcp://localhost:10700 (Mac Mini audio I/O)
  • Direct Python API for P4 (agent bypasses Wyoming for non-HA calls)
  • OpenClaw Bridge: homeai-agent/skills/home-assistant/openclaw_bridge.py (HA integration)

Phase 3 — AI Agent & Character (P4 + P5)

Goal: OpenClaw receives voice/text input, applies character persona, calls tools, returns rich responses.

P4: homeai-agent

Deliverables:

  • OpenClaw installed and configured
  • Connected to Ollama (llama3.3:70b as primary model)
  • Connected to Home Assistant (long-lived access token in config)
  • mem0 installed, configured with local storage backend
  • mem0 backup job: daily git commit to Gitea
  • Core skills written:
    • home_assistant.py — call HA services (lights, switches, scenes)
    • memory.py — read/write mem0 memories
    • weather.py — local weather via HA sensor data
    • timer.py — set timers/reminders
    • music.py — stub for Music Assistant (P9)
  • n8n running as Docker container, webhook trigger from OpenClaw
  • Sample n8n workflow: morning briefing (time + weather + calendar)
  • System prompt template: loads character JSON from P5

Key decisions:

  • OpenClaw config at ~/.openclaw/config.yaml
  • Skills at ~/.openclaw/skills/ — one file per skill, auto-discovered
  • System prompt: ~/.openclaw/characters/<active>.json loaded at startup
  • mem0 store: local file backend at ~/.openclaw/memory/ (SQLite)

Interface contract:

  • OpenClaw exposes a local HTTP API (default port 8080) — used by P3 (voice pipeline hands off transcribed text here)
  • Consumes character JSON from P5

P5: homeai-character

Deliverables:

  • Character Manager UI (character-manager.jsx) — already exists, needs wiring
  • Character JSON schema v1 defined and documented
  • Export produces ~/.openclaw/characters/<name>.json
  • Fields: name, system_prompt, voice_ref_path, tts_engine, live2d_expressions, vtube_ws_triggers, custom_rules, model_overrides
  • Validation: schema validator script rejects malformed exports
  • Sample character: aria.json (default assistant persona)
  • Voice clone: reference .wav recorded/sourced, placed at ~/voices/<name>.wav

Key decisions:

  • JSON schema is versioned ("schema_version": 1) — pipeline components check version before loading
  • Character Manager is a local React app (served by Vite dev server or built to static files)
  • Single active character at a time; OpenClaw watches the file for changes (hot reload)

Interface contract:

  • Output: ~/.openclaw/characters/<name>.json — consumed by P4, P3 (TTS voice selection), P7 (expression mapping)
  • Schema published in homeai-character/schema/character.schema.json

Phase 4 — Hardware Satellites (P6)

Goal: ESP32-S3-BOX-3 units act as room presence nodes — wake word, mic input, audio output, animated face.

P6: homeai-esp32

Deliverables:

  • ESPHome config for ESP32-S3-BOX-3 (esphome/s3-box-living-room.yaml, etc.)
  • Wyoming Satellite component configured — streams mic audio to Mac Mini Wyoming STT
  • Audio playback: receives TTS audio from Mac Mini, plays via built-in speaker
  • LVGL face: animated idle/speaking/thinking states
  • Wake word: either on-device (microWakeWord via ESPHome) or forwarded to Mac Mini openWakeWord
  • OTA update mechanism configured
  • One unit per room — config templated with room name as variable

LVGL Face States:

State Animation
Idle Slow blink, gentle sway
Listening Eyes wide, mic indicator
Thinking Eyes narrow, loading dots
Speaking Mouth animation synced to audio
Error Red eyes, shake

Key decisions:

  • Wake word on-device preferred (lower latency, no always-on network stream)
  • microWakeWord model: hey_jarvis or custom trained word
  • LVGL animations compiled into ESPHome firmware (no runtime asset loading)
  • Each unit has a unique device name for HA entity naming

Interface contract:

  • Wyoming Satellite → Mac Mini Wyoming STT server (tcp://<mac-mini-ip>:10300)
  • Receives audio back via Wyoming TTS response
  • LVGL state driven by Home Assistant entity state (HA → ESPHome event)

Phase 5 — Visual Layer (P7)

Goal: VTube Studio shows Live2D model on desktop/mobile; expressions driven by AI pipeline state.

P7: homeai-visual

Deliverables:

  • VTube Studio installed on Mac Mini (macOS app)
  • Live2D model loaded (sourced from nizima.com or booth.pm)
  • VTube Studio WebSocket API enabled (port 8001)
  • OpenClaw skill: vtube_studio.py
    • Connects to VTube Studio WebSocket
    • Auth token exchange and persistence
    • Methods: trigger_expression(name), trigger_hotkey(name), set_parameter(name, value)
  • Expression map in character JSON → VTube hotkey IDs
  • Lip sync: driven by audio envelope or TTS phoneme timing
  • Mobile: VTube Studio on iOS/Android connected to same model via Tailscale

Key decisions:

  • Expression trigger events: idle, speaking, thinking, happy, sad, error
  • Lip sync approach: simple amplitude-based (fast) rather than phoneme-based (complex) initially
  • Auth token stored at ~/.openclaw/vtube_token.json

Interface contract:

  • OpenClaw calls vtube_studio.trigger_expression(event) from within response pipeline
  • Event names defined in character JSON live2d_expressions field

Phase 6 — Image Generation (P8)

Goal: ComfyUI online with character-consistent image generation workflows.

P8: homeai-images

Deliverables:

  • ComfyUI installed at ~/ComfyUI/, running via launchd
  • Models downloaded: SDXL base, Flux.1-dev (or schnell), ControlNet (canny, depth)
  • Character LoRA: trained on character reference images for consistent appearance
  • Saved workflows:
    • workflows/portrait.json — character portrait, controllable expression
    • workflows/scene.json — character in scene with ControlNet pose
    • workflows/quick.json — fast draft via Flux.1-schnell
  • OpenClaw skill: comfyui.py — submits workflow via ComfyUI REST API, returns image path
  • ComfyUI API port: 8188

Interface contract:

  • OpenClaw calls comfyui.generate(workflow_name, params) → returns local image path
  • ComfyUI REST API: http://localhost:8188

Phase 7 — Extended Integrations & Polish

Deliverables:

  • Music Assistant — Docker container, integrated with HA, OpenClaw music.py skill updated
  • Snapcast — server on Mac Mini, clients on ESP32 units (multi-room sync)
  • Authelia — 2FA in front of all web UIs exposed via Tailscale
  • n8n advanced workflows: daily briefing, calendar reminders, notification routing
  • iOS Shortcuts companion: trigger OpenClaw from iPhone widget
  • Uptime Kuma alerts: pushover/ntfy notifications on service down
  • Backup automation: daily Gitea commits of mem0, character configs, n8n workflows

Dependency Graph

P1 (infra) ─────────────────────────────┐
P2 (llm)   ──────────────────────┐      │
P3 (voice) ────────────────┐     │      │
P5 (character) ──────┐     │     │      │
                      ↓     ↓     ↓      ↓
                      P4 (agent) ─────→ HA
                      ↓
            P6 (esp32) ← Wyoming
            P7 (visual) ← vtube skill
            P8 (images) ← comfyui skill

Hard dependencies:

  • P4 requires P1 (HA URL), P2 (Ollama), P5 (character JSON)
  • P3 requires P2 (LLM), P4 (agent endpoint)
  • P6 requires P3 (Wyoming server), P1 (HA)
  • P7 requires P4 (OpenClaw skill runner), P5 (expression map)
  • P8 requires P4 (OpenClaw skill runner)

Can be done in parallel:

  • P1 + P5 (infra and character manager are independent)
  • P2 + P5 (LLM setup and character UI are independent)
  • P7 + P8 (visual and images are both P4 dependents but independent of each other)

Interface Contracts Summary

Contract Type Defined In Consumed By
~/server/.env.services env file P1 All
Ollama API localhost:11434/v1 HTTP (OpenAI compat) P2 P3, P4, P7
Wyoming STT localhost:10300 TCP/Wyoming P3 P6, HA
Wyoming TTS localhost:10301 TCP/Wyoming P3 P6, HA
Wyoming Satellite localhost:10700 TCP/Wyoming P3 HA
OpenClaw API localhost:8080 HTTP P4 P3, P7, P8
Character JSON ~/.openclaw/characters/ JSON file P5 P4, P3, P7
character.schema.json v1 JSON Schema P5 P4, P3, P7
VTube Studio WS localhost:8001 WebSocket VTube Studio P7
ComfyUI API localhost:8188 HTTP ComfyUI P8
Home Assistant API HTTP/WS P1 (HA) P4, P6

Repo Structure (Gitea)

~/gitea/homeai/
├── homeai-infra/          # P1
│   ├── docker/            # per-service compose files
│   ├── scripts/           # setup/teardown helpers
│   └── Makefile
├── homeai-llm/            # P2
│   ├── ollama-models.txt
│   └── scripts/
├── homeai-voice/          # P3
│   ├── whisper/
│   ├── tts/
│   ├── wyoming/
│   └── scripts/
├── homeai-agent/          # P4
│   ├── skills/
│   ├── workflows/         # n8n exports
│   └── config/
├── homeai-character/      # P5
│   ├── src/               # React character manager
│   ├── schema/
│   └── characters/        # exported JSONs
├── homeai-esp32/          # P6
│   └── esphome/
├── homeai-visual/         # P7
│   └── skills/
└── homeai-images/         # P8
    ├── workflows/          # ComfyUI workflow JSONs
    └── skills/

Suggested Build Order

Week Focus Projects
1 Infrastructure up, LLM running P1, P2
2 Voice pipeline end-to-end (desktop mic test) P3
3 Character Manager wired, OpenClaw connected P4, P5
4 ESP32 firmware, first satellite running P6
5 VTube Studio live, expressions working P7
6 ComfyUI online, character LoRA trained P8
7+ Extended integrations, polish, Authelia Phase 7

Open Questions / Decisions Needed

  • Which OpenClaw version/fork to use? (confirm it supports Ollama natively)
  • Wake word: hey_jarvis vs custom trained word — what should the character's name be?
  • Live2D model: commission custom or buy from nizima.com? Budget?
  • Snapcast: output to ESP32 speakers or separate audio hardware per room?
  • n8n: self-hosted Docker vs n8n Cloud (given local-first preference → Docker)
  • Authelia: local user store or LDAP backend? (local store is simpler)
  • mem0: local SQLite or run Qdrant vector DB for better semantic search?