Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
371
PROJECT_PLAN.md
Normal file
371
PROJECT_PLAN.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# HomeAI — Full Project Plan
|
||||
|
||||
> Last updated: 2026-03-04
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This project builds a self-hosted, always-on AI assistant running entirely on a Mac Mini M4 Pro. It is decomposed into **8 sub-projects** that can be developed in parallel where dependencies allow, then bridged via well-defined interfaces.
|
||||
|
||||
The guiding principle: each sub-project exposes a clean API/config surface. No project hard-codes knowledge of another's internals.
|
||||
|
||||
---
|
||||
|
||||
## Sub-Project Map
|
||||
|
||||
| ID | Name | Description | Primary Language |
|
||||
|---|---|---|---|
|
||||
| P1 | `homeai-infra` | Docker stack, networking, monitoring, secrets | YAML / Shell |
|
||||
| P2 | `homeai-llm` | Ollama + Open WebUI setup, model management | YAML / Shell |
|
||||
| P3 | `homeai-voice` | STT, TTS, Wyoming bridge, wake word | Python / Shell |
|
||||
| P4 | `homeai-agent` | OpenClaw config, skills, n8n workflows, mem0 | Python / JSON |
|
||||
| P5 | `homeai-character` | Character Manager UI, persona JSON schema, voice clone | React / JSON |
|
||||
| P6 | `homeai-esp32` | ESPHome firmware, Wyoming Satellite, LVGL face | C++ / YAML |
|
||||
| P7 | `homeai-visual` | VTube Studio bridge, Live2D expression mapping | Python / JSON |
|
||||
| P8 | `homeai-images` | ComfyUI workflows, model management, ControlNet | Python / JSON |
|
||||
|
||||
All repos live under `~/gitea/homeai/` on the Mac Mini and are mirrored to the self-hosted Gitea instance (set up in P1).
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Foundation (P1 + P2)
|
||||
|
||||
**Goal:** Everything containerised, stable, accessible remotely. LLM responsive via browser.
|
||||
|
||||
### P1: `homeai-infra`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] `docker-compose.yml` — master compose file (or per-service files under `~/server/docker/`)
|
||||
- [ ] Services: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server
|
||||
- [ ] Tailscale installed on Mac Mini, all services on Tailnet
|
||||
- [ ] Gitea repos initialised, SSH keys configured
|
||||
- [ ] Uptime Kuma monitors all service endpoints
|
||||
- [ ] Docker restart policies: `unless-stopped` on all containers
|
||||
- [ ] Documented `.env` file pattern (secrets never committed)
|
||||
|
||||
**Key decisions:**
|
||||
- Single `docker-compose.yml` vs per-service compose files — recommend per-service files in `~/server/docker/<service>/` orchestrated by a root `Makefile`
|
||||
- Tailscale as sole remote access method (no public port forwarding)
|
||||
- Authelia deferred to Phase 4 polish (internal LAN services don't need 2FA immediately)
|
||||
|
||||
**Interface contract:** Exposes service URLs as env vars (e.g. `HA_URL`, `GITEA_URL`) written to `~/server/.env.services` — consumed by all other projects.
|
||||
|
||||
---
|
||||
|
||||
### P2: `homeai-llm`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Ollama installed natively on Mac Mini (not Docker — needs Metal GPU access)
|
||||
- [ ] Models pulled: `llama3.3:70b`, `qwen2.5:72b` (and a fast small model: `qwen2.5:7b` for low-latency tasks)
|
||||
- [ ] Open WebUI running as Docker container, connected to Ollama
|
||||
- [ ] Model benchmark script — measures tokens/sec per model
|
||||
- [ ] `ollama-models.txt` — pinned model manifest for reproducibility
|
||||
|
||||
**Key decisions:**
|
||||
- Ollama runs as a launchd service (`~/Library/LaunchAgents/`) to survive reboots
|
||||
- Open WebUI exposed only on Tailnet
|
||||
- API endpoint: `http://localhost:11434` (Ollama default)
|
||||
|
||||
**Interface contract:** Ollama OpenAI-compatible API at `http://localhost:11434/v1` — used by P3, P4, P7.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Voice Pipeline (P3)
|
||||
|
||||
**Goal:** Full end-to-end voice: speak → transcribe → LLM → TTS → hear response. No ESP32 yet — test with a USB mic on Mac Mini.
|
||||
|
||||
### P3: `homeai-voice`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Whisper.cpp compiled for Apple Silicon, model downloaded (`medium.en` or `large-v3`)
|
||||
- [ ] Kokoro TTS installed, tested, latency benchmarked
|
||||
- [ ] Chatterbox TTS installed (MPS optimised build), voice reference `.wav` ready
|
||||
- [ ] Qwen3-TTS via MLX installed as fallback
|
||||
- [ ] openWakeWord running on Mac Mini, detecting wake word
|
||||
- [ ] Wyoming protocol server running — bridges STT+TTS into Home Assistant
|
||||
- [ ] Home Assistant `voice_assistant` pipeline configured end-to-end
|
||||
- [ ] Test script: `test_voice_pipeline.sh` — mic in → spoken response out
|
||||
|
||||
**Sub-components:**
|
||||
|
||||
```
|
||||
[Mic] → openWakeWord → Wyoming STT (Whisper.cpp) → [text out]
|
||||
[text in] → Wyoming TTS (Kokoro) → [audio out]
|
||||
```
|
||||
|
||||
**Key decisions:**
|
||||
- Whisper.cpp runs as a Wyoming STT provider (via `wyoming-faster-whisper` or native Wyoming adapter)
|
||||
- Kokoro is primary TTS; Chatterbox used when voice cloning is active (P5)
|
||||
- openWakeWord runs as a launchd service
|
||||
- Wyoming server port: `10300` (STT), `10301` (TTS) — standard Wyoming ports
|
||||
|
||||
**Interface contract:**
|
||||
- Wyoming STT: `tcp://localhost:10300`
|
||||
- Wyoming TTS: `tcp://localhost:10301`
|
||||
- Direct Python API for P4 (agent bypasses Wyoming for non-HA calls)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — AI Agent & Character (P4 + P5)
|
||||
|
||||
**Goal:** OpenClaw receives voice/text input, applies character persona, calls tools, returns rich responses.
|
||||
|
||||
### P4: `homeai-agent`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] OpenClaw installed and configured
|
||||
- [ ] Connected to Ollama (`llama3.3:70b` as primary model)
|
||||
- [ ] Connected to Home Assistant (long-lived access token in config)
|
||||
- [ ] mem0 installed, configured with local storage backend
|
||||
- [ ] mem0 backup job: daily git commit to Gitea
|
||||
- [ ] Core skills written:
|
||||
- `home_assistant.py` — call HA services (lights, switches, scenes)
|
||||
- `memory.py` — read/write mem0 memories
|
||||
- `weather.py` — local weather via HA sensor data
|
||||
- `timer.py` — set timers/reminders
|
||||
- `music.py` — stub for Music Assistant (P9)
|
||||
- [ ] n8n running as Docker container, webhook trigger from OpenClaw
|
||||
- [ ] Sample n8n workflow: morning briefing (time + weather + calendar)
|
||||
- [ ] System prompt template: loads character JSON from P5
|
||||
|
||||
**Key decisions:**
|
||||
- OpenClaw config at `~/.openclaw/config.yaml`
|
||||
- Skills at `~/.openclaw/skills/` — one file per skill, auto-discovered
|
||||
- System prompt: `~/.openclaw/characters/<active>.json` loaded at startup
|
||||
- mem0 store: local file backend at `~/.openclaw/memory/` (SQLite)
|
||||
|
||||
**Interface contract:**
|
||||
- OpenClaw exposes a local HTTP API (default port `8080`) — used by P3 (voice pipeline hands off transcribed text here)
|
||||
- Consumes character JSON from P5
|
||||
|
||||
---
|
||||
|
||||
### P5: `homeai-character`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Character Manager UI (`character-manager.jsx`) — already exists, needs wiring
|
||||
- [ ] Character JSON schema v1 defined and documented
|
||||
- [ ] Export produces `~/.openclaw/characters/<name>.json`
|
||||
- [ ] Fields: name, system_prompt, voice_ref_path, tts_engine, live2d_expressions, vtube_ws_triggers, custom_rules, model_overrides
|
||||
- [ ] Validation: schema validator script rejects malformed exports
|
||||
- [ ] Sample character: `aria.json` (default assistant persona)
|
||||
- [ ] Voice clone: reference `.wav` recorded/sourced, placed at `~/voices/<name>.wav`
|
||||
|
||||
**Key decisions:**
|
||||
- JSON schema is versioned (`"schema_version": 1`) — pipeline components check version before loading
|
||||
- Character Manager is a local React app (served by Vite dev server or built to static files)
|
||||
- Single active character at a time; OpenClaw watches the file for changes (hot reload)
|
||||
|
||||
**Interface contract:**
|
||||
- Output: `~/.openclaw/characters/<name>.json` — consumed by P4, P3 (TTS voice selection), P7 (expression mapping)
|
||||
- Schema published in `homeai-character/schema/character.schema.json`
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Hardware Satellites (P6)
|
||||
|
||||
**Goal:** ESP32-S3-BOX-3 units act as room presence nodes — wake word, mic input, audio output, animated face.
|
||||
|
||||
### P6: `homeai-esp32`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] ESPHome config for ESP32-S3-BOX-3 (`esphome/s3-box-living-room.yaml`, etc.)
|
||||
- [ ] Wyoming Satellite component configured — streams mic audio to Mac Mini Wyoming STT
|
||||
- [ ] Audio playback: receives TTS audio from Mac Mini, plays via built-in speaker
|
||||
- [ ] LVGL face: animated idle/speaking/thinking states
|
||||
- [ ] Wake word: either on-device (microWakeWord via ESPHome) or forwarded to Mac Mini openWakeWord
|
||||
- [ ] OTA update mechanism configured
|
||||
- [ ] One unit per room — config templated with room name as variable
|
||||
|
||||
**LVGL Face States:**
|
||||
| State | Animation |
|
||||
|---|---|
|
||||
| Idle | Slow blink, gentle sway |
|
||||
| Listening | Eyes wide, mic indicator |
|
||||
| Thinking | Eyes narrow, loading dots |
|
||||
| Speaking | Mouth animation synced to audio |
|
||||
| Error | Red eyes, shake |
|
||||
|
||||
**Key decisions:**
|
||||
- Wake word on-device preferred (lower latency, no always-on network stream)
|
||||
- microWakeWord model: `hey_jarvis` or custom trained word
|
||||
- LVGL animations compiled into ESPHome firmware (no runtime asset loading)
|
||||
- Each unit has a unique device name for HA entity naming
|
||||
|
||||
**Interface contract:**
|
||||
- Wyoming Satellite → Mac Mini Wyoming STT server (`tcp://<mac-mini-ip>:10300`)
|
||||
- Receives audio back via Wyoming TTS response
|
||||
- LVGL state driven by Home Assistant entity state (HA → ESPHome event)
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Visual Layer (P7)
|
||||
|
||||
**Goal:** VTube Studio shows Live2D model on desktop/mobile; expressions driven by AI pipeline state.
|
||||
|
||||
### P7: `homeai-visual`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] VTube Studio installed on Mac Mini (macOS app)
|
||||
- [ ] Live2D model loaded (sourced from nizima.com or booth.pm)
|
||||
- [ ] VTube Studio WebSocket API enabled (port `8001`)
|
||||
- [ ] OpenClaw skill: `vtube_studio.py`
|
||||
- Connects to VTube Studio WebSocket
|
||||
- Auth token exchange and persistence
|
||||
- Methods: `trigger_expression(name)`, `trigger_hotkey(name)`, `set_parameter(name, value)`
|
||||
- [ ] Expression map in character JSON → VTube hotkey IDs
|
||||
- [ ] Lip sync: driven by audio envelope or TTS phoneme timing
|
||||
- [ ] Mobile: VTube Studio on iOS/Android connected to same model via Tailscale
|
||||
|
||||
**Key decisions:**
|
||||
- Expression trigger events: `idle`, `speaking`, `thinking`, `happy`, `sad`, `error`
|
||||
- Lip sync approach: simple amplitude-based (fast) rather than phoneme-based (complex) initially
|
||||
- Auth token stored at `~/.openclaw/vtube_token.json`
|
||||
|
||||
**Interface contract:**
|
||||
- OpenClaw calls `vtube_studio.trigger_expression(event)` from within response pipeline
|
||||
- Event names defined in character JSON `live2d_expressions` field
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Image Generation (P8)
|
||||
|
||||
**Goal:** ComfyUI online with character-consistent image generation workflows.
|
||||
|
||||
### P8: `homeai-images`
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] ComfyUI installed at `~/ComfyUI/`, running via launchd
|
||||
- [ ] Models downloaded: SDXL base, Flux.1-dev (or schnell), ControlNet (canny, depth)
|
||||
- [ ] Character LoRA: trained on character reference images for consistent appearance
|
||||
- [ ] Saved workflows:
|
||||
- `workflows/portrait.json` — character portrait, controllable expression
|
||||
- `workflows/scene.json` — character in scene with ControlNet pose
|
||||
- `workflows/quick.json` — fast draft via Flux.1-schnell
|
||||
- [ ] OpenClaw skill: `comfyui.py` — submits workflow via ComfyUI REST API, returns image path
|
||||
- [ ] ComfyUI API port: `8188`
|
||||
|
||||
**Interface contract:**
|
||||
- OpenClaw calls `comfyui.generate(workflow_name, params)` → returns local image path
|
||||
- ComfyUI REST API: `http://localhost:8188`
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Extended Integrations & Polish
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] Music Assistant — Docker container, integrated with HA, OpenClaw `music.py` skill updated
|
||||
- [ ] Snapcast — server on Mac Mini, clients on ESP32 units (multi-room sync)
|
||||
- [ ] Authelia — 2FA in front of all web UIs exposed via Tailscale
|
||||
- [ ] n8n advanced workflows: daily briefing, calendar reminders, notification routing
|
||||
- [ ] iOS Shortcuts companion: trigger OpenClaw from iPhone widget
|
||||
- [ ] Uptime Kuma alerts: pushover/ntfy notifications on service down
|
||||
- [ ] Backup automation: daily Gitea commits of mem0, character configs, n8n workflows
|
||||
|
||||
---
|
||||
|
||||
## Dependency Graph
|
||||
|
||||
```
|
||||
P1 (infra) ─────────────────────────────┐
|
||||
P2 (llm) ──────────────────────┐ │
|
||||
P3 (voice) ────────────────┐ │ │
|
||||
P5 (character) ──────┐ │ │ │
|
||||
↓ ↓ ↓ ↓
|
||||
P4 (agent) ─────→ HA
|
||||
↓
|
||||
P6 (esp32) ← Wyoming
|
||||
P7 (visual) ← vtube skill
|
||||
P8 (images) ← comfyui skill
|
||||
```
|
||||
|
||||
**Hard dependencies:**
|
||||
- P4 requires P1 (HA URL), P2 (Ollama), P5 (character JSON)
|
||||
- P3 requires P2 (LLM), P4 (agent endpoint)
|
||||
- P6 requires P3 (Wyoming server), P1 (HA)
|
||||
- P7 requires P4 (OpenClaw skill runner), P5 (expression map)
|
||||
- P8 requires P4 (OpenClaw skill runner)
|
||||
|
||||
**Can be done in parallel:**
|
||||
- P1 + P5 (infra and character manager are independent)
|
||||
- P2 + P5 (LLM setup and character UI are independent)
|
||||
- P7 + P8 (visual and images are both P4 dependents but independent of each other)
|
||||
|
||||
---
|
||||
|
||||
## Interface Contracts Summary
|
||||
|
||||
| Contract | Type | Defined In | Consumed By |
|
||||
|---|---|---|---|
|
||||
| `~/server/.env.services` | env file | P1 | All |
|
||||
| Ollama API `localhost:11434/v1` | HTTP (OpenAI compat) | P2 | P3, P4, P7 |
|
||||
| Wyoming STT `localhost:10300` | TCP/Wyoming | P3 | P6, HA |
|
||||
| Wyoming TTS `localhost:10301` | TCP/Wyoming | P3 | P6, HA |
|
||||
| OpenClaw API `localhost:8080` | HTTP | P4 | P3, P7, P8 |
|
||||
| Character JSON `~/.openclaw/characters/` | JSON file | P5 | P4, P3, P7 |
|
||||
| `character.schema.json` v1 | JSON Schema | P5 | P4, P3, P7 |
|
||||
| VTube Studio WS `localhost:8001` | WebSocket | VTube Studio | P7 |
|
||||
| ComfyUI API `localhost:8188` | HTTP | ComfyUI | P8 |
|
||||
| Home Assistant API | HTTP/WS | P1 (HA) | P4, P6 |
|
||||
|
||||
---
|
||||
|
||||
## Repo Structure (Gitea)
|
||||
|
||||
```
|
||||
~/gitea/homeai/
|
||||
├── homeai-infra/ # P1
|
||||
│ ├── docker/ # per-service compose files
|
||||
│ ├── scripts/ # setup/teardown helpers
|
||||
│ └── Makefile
|
||||
├── homeai-llm/ # P2
|
||||
│ ├── ollama-models.txt
|
||||
│ └── scripts/
|
||||
├── homeai-voice/ # P3
|
||||
│ ├── whisper/
|
||||
│ ├── tts/
|
||||
│ ├── wyoming/
|
||||
│ └── scripts/
|
||||
├── homeai-agent/ # P4
|
||||
│ ├── skills/
|
||||
│ ├── workflows/ # n8n exports
|
||||
│ └── config/
|
||||
├── homeai-character/ # P5
|
||||
│ ├── src/ # React character manager
|
||||
│ ├── schema/
|
||||
│ └── characters/ # exported JSONs
|
||||
├── homeai-esp32/ # P6
|
||||
│ └── esphome/
|
||||
├── homeai-visual/ # P7
|
||||
│ └── skills/
|
||||
└── homeai-images/ # P8
|
||||
├── workflows/ # ComfyUI workflow JSONs
|
||||
└── skills/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Suggested Build Order
|
||||
|
||||
| Week | Focus | Projects |
|
||||
|---|---|---|
|
||||
| 1 | Infrastructure up, LLM running | P1, P2 |
|
||||
| 2 | Voice pipeline end-to-end (desktop mic test) | P3 |
|
||||
| 3 | Character Manager wired, OpenClaw connected | P4, P5 |
|
||||
| 4 | ESP32 firmware, first satellite running | P6 |
|
||||
| 5 | VTube Studio live, expressions working | P7 |
|
||||
| 6 | ComfyUI online, character LoRA trained | P8 |
|
||||
| 7+ | Extended integrations, polish, Authelia | Phase 7 |
|
||||
|
||||
---
|
||||
|
||||
## Open Questions / Decisions Needed
|
||||
|
||||
- [ ] Which OpenClaw version/fork to use? (confirm it supports Ollama natively)
|
||||
- [ ] Wake word: `hey_jarvis` vs custom trained word — what should the character's name be?
|
||||
- [ ] Live2D model: commission custom or buy from nizima.com? Budget?
|
||||
- [ ] Snapcast: output to ESP32 speakers or separate audio hardware per room?
|
||||
- [ ] n8n: self-hosted Docker vs n8n Cloud (given local-first preference → Docker)
|
||||
- [ ] Authelia: local user store or LDAP backend? (local store is simpler)
|
||||
- [ ] mem0: local SQLite or run Qdrant vector DB for better semantic search?
|
||||
Reference in New Issue
Block a user