## Voice Pipeline (P3) - Replace openWakeWord daemon with Wyoming Satellite approach - Add Wyoming Satellite service on port 10700 for HA voice pipeline - Update setup.sh with cross-platform sed compatibility (macOS/Linux) - Add version field to Kokoro TTS voice info - Update launchd service loader to use Wyoming Satellite ## Home Assistant Integration (P4) - Add custom conversation agent component (openclaw_conversation) - Fix: Use IntentResponse instead of plain strings (HA API requirement) - Support both HTTP API and CLI fallback modes - Config flow for easy HA UI setup - Add OpenClaw bridge scripts (Python + Bash) - Add ha-ctl utility for HA entity control - Fix: Use context manager for token file reading - Add HA configuration examples and documentation ## Infrastructure - Add mem0 backup automation (launchd + script) - Add n8n workflow templates (morning briefing, notification router) - Add VS Code workspace configuration - Reorganize model files into categorized folders: - lmstudio-community/ - mlx-community/ - bartowski/ - mradermacher/ ## Documentation - Update PROJECT_PLAN.md with Wyoming Satellite architecture - Update TODO.md with completed Wyoming integration tasks - Add OPENCLAW_INTEGRATION.md for HA setup guide ## Testing - Verified Wyoming services running (STT:10300, TTS:10301, Satellite:10700) - Verified OpenClaw CLI accessibility - Confirmed cross-platform compatibility fixes
16 KiB
HomeAI — Full Project Plan
Last updated: 2026-03-04
Overview
This project builds a self-hosted, always-on AI assistant running entirely on a Mac Mini M4 Pro. It is decomposed into 8 sub-projects that can be developed in parallel where dependencies allow, then bridged via well-defined interfaces.
The guiding principle: each sub-project exposes a clean API/config surface. No project hard-codes knowledge of another's internals.
Sub-Project Map
| ID | Name | Description | Primary Language |
|---|---|---|---|
| P1 | homeai-infra |
Docker stack, networking, monitoring, secrets | YAML / Shell |
| P2 | homeai-llm |
Ollama + Open WebUI setup, model management | YAML / Shell |
| P3 | homeai-voice |
STT, TTS, Wyoming bridge, wake word | Python / Shell |
| P4 | homeai-agent |
OpenClaw config, skills, n8n workflows, mem0 | Python / JSON |
| P5 | homeai-character |
Character Manager UI, persona JSON schema, voice clone | React / JSON |
| P6 | homeai-esp32 |
ESPHome firmware, Wyoming Satellite, LVGL face | C++ / YAML |
| P7 | homeai-visual |
VTube Studio bridge, Live2D expression mapping | Python / JSON |
| P8 | homeai-images |
ComfyUI workflows, model management, ControlNet | Python / JSON |
All repos live under ~/gitea/homeai/ on the Mac Mini and are mirrored to the self-hosted Gitea instance (set up in P1).
Phase 1 — Foundation (P1 + P2)
Goal: Everything containerised, stable, accessible remotely. LLM responsive via browser.
P1: homeai-infra
Deliverables:
docker-compose.yml— master compose file (or per-service files under~/server/docker/)- Services: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server
- Tailscale installed on Mac Mini, all services on Tailnet
- Gitea repos initialised, SSH keys configured
- Uptime Kuma monitors all service endpoints
- Docker restart policies:
unless-stoppedon all containers - Documented
.envfile pattern (secrets never committed)
Key decisions:
- Single
docker-compose.ymlvs per-service compose files — recommend per-service files in~/server/docker/<service>/orchestrated by a rootMakefile - Tailscale as sole remote access method (no public port forwarding)
- Authelia deferred to Phase 4 polish (internal LAN services don't need 2FA immediately)
Interface contract: Exposes service URLs as env vars (e.g. HA_URL, GITEA_URL) written to ~/server/.env.services — consumed by all other projects.
P2: homeai-llm
Deliverables:
- Ollama installed natively on Mac Mini (not Docker — needs Metal GPU access)
- Models pulled:
llama3.3:70b,qwen2.5:72b(and a fast small model:qwen2.5:7bfor low-latency tasks) - Open WebUI running as Docker container, connected to Ollama
- Model benchmark script — measures tokens/sec per model
ollama-models.txt— pinned model manifest for reproducibility
Key decisions:
- Ollama runs as a launchd service (
~/Library/LaunchAgents/) to survive reboots - Open WebUI exposed only on Tailnet
- API endpoint:
http://localhost:11434(Ollama default)
Interface contract: Ollama OpenAI-compatible API at http://localhost:11434/v1 — used by P3, P4, P7.
Phase 2 — Voice Pipeline (P3)
Goal: Full end-to-end voice: speak → transcribe → LLM → TTS → hear response. No ESP32 yet — test with a USB mic on Mac Mini.
P3: homeai-voice
Deliverables:
- Whisper.cpp compiled for Apple Silicon, model downloaded (
medium.enorlarge-v3) - Kokoro TTS installed, tested, latency benchmarked
- Chatterbox TTS installed (MPS optimised build), voice reference
.wavready - Qwen3-TTS via MLX installed as fallback
- openWakeWord running on Mac Mini, detecting wake word
- Wyoming protocol server running — bridges STT+TTS into Home Assistant
- Home Assistant
voice_assistantpipeline configured end-to-end - Test script:
test_voice_pipeline.sh— mic in → spoken response out
Sub-components:
[Mic] → Wyoming Satellite (port 10700) → Home Assistant Voice Pipeline → Wyoming STT (Whisper)
↓
[Speaker] ← Wyoming TTS (Kokoro) ← OpenClaw Agent ← transcribed text
Note: The original openWakeWord daemon has been replaced by the Wyoming satellite approach, which handles wake word detection through Home Assistant's voice pipeline.
Key decisions:
- Whisper.cpp runs as a Wyoming STT provider (via
wyoming-faster-whisper) - Kokoro is primary TTS; Chatterbox used when voice cloning is active (P5)
- Wyoming satellite runs on port
10700— handles audio I/O and connects to HA voice pipeline - openWakeWord daemon disabled — wake word detection now handled by HA via Wyoming satellite
- Wyoming server ports:
10300(STT),10301(TTS),10700(Satellite) — standard Wyoming ports
Interface contract:
- Wyoming STT:
tcp://localhost:10300(Whisper large-v3) - Wyoming TTS:
tcp://localhost:10301(Kokoro ONNX) - Wyoming Satellite:
tcp://localhost:10700(Mac Mini audio I/O) - Direct Python API for P4 (agent bypasses Wyoming for non-HA calls)
- OpenClaw Bridge:
homeai-agent/skills/home-assistant/openclaw_bridge.py(HA integration)
Phase 3 — AI Agent & Character (P4 + P5)
Goal: OpenClaw receives voice/text input, applies character persona, calls tools, returns rich responses.
P4: homeai-agent
Deliverables:
- OpenClaw installed and configured
- Connected to Ollama (
llama3.3:70bas primary model) - Connected to Home Assistant (long-lived access token in config)
- mem0 installed, configured with local storage backend
- mem0 backup job: daily git commit to Gitea
- Core skills written:
home_assistant.py— call HA services (lights, switches, scenes)memory.py— read/write mem0 memoriesweather.py— local weather via HA sensor datatimer.py— set timers/remindersmusic.py— stub for Music Assistant (P9)
- n8n running as Docker container, webhook trigger from OpenClaw
- Sample n8n workflow: morning briefing (time + weather + calendar)
- System prompt template: loads character JSON from P5
Key decisions:
- OpenClaw config at
~/.openclaw/config.yaml - Skills at
~/.openclaw/skills/— one file per skill, auto-discovered - System prompt:
~/.openclaw/characters/<active>.jsonloaded at startup - mem0 store: local file backend at
~/.openclaw/memory/(SQLite)
Interface contract:
- OpenClaw exposes a local HTTP API (default port
8080) — used by P3 (voice pipeline hands off transcribed text here) - Consumes character JSON from P5
P5: homeai-character
Deliverables:
- Character Manager UI (
character-manager.jsx) — already exists, needs wiring - Character JSON schema v1 defined and documented
- Export produces
~/.openclaw/characters/<name>.json - Fields: name, system_prompt, voice_ref_path, tts_engine, live2d_expressions, vtube_ws_triggers, custom_rules, model_overrides
- Validation: schema validator script rejects malformed exports
- Sample character:
aria.json(default assistant persona) - Voice clone: reference
.wavrecorded/sourced, placed at~/voices/<name>.wav
Key decisions:
- JSON schema is versioned (
"schema_version": 1) — pipeline components check version before loading - Character Manager is a local React app (served by Vite dev server or built to static files)
- Single active character at a time; OpenClaw watches the file for changes (hot reload)
Interface contract:
- Output:
~/.openclaw/characters/<name>.json— consumed by P4, P3 (TTS voice selection), P7 (expression mapping) - Schema published in
homeai-character/schema/character.schema.json
Phase 4 — Hardware Satellites (P6)
Goal: ESP32-S3-BOX-3 units act as room presence nodes — wake word, mic input, audio output, animated face.
P6: homeai-esp32
Deliverables:
- ESPHome config for ESP32-S3-BOX-3 (
esphome/s3-box-living-room.yaml, etc.) - Wyoming Satellite component configured — streams mic audio to Mac Mini Wyoming STT
- Audio playback: receives TTS audio from Mac Mini, plays via built-in speaker
- LVGL face: animated idle/speaking/thinking states
- Wake word: either on-device (microWakeWord via ESPHome) or forwarded to Mac Mini openWakeWord
- OTA update mechanism configured
- One unit per room — config templated with room name as variable
LVGL Face States:
| State | Animation |
|---|---|
| Idle | Slow blink, gentle sway |
| Listening | Eyes wide, mic indicator |
| Thinking | Eyes narrow, loading dots |
| Speaking | Mouth animation synced to audio |
| Error | Red eyes, shake |
Key decisions:
- Wake word on-device preferred (lower latency, no always-on network stream)
- microWakeWord model:
hey_jarvisor custom trained word - LVGL animations compiled into ESPHome firmware (no runtime asset loading)
- Each unit has a unique device name for HA entity naming
Interface contract:
- Wyoming Satellite → Mac Mini Wyoming STT server (
tcp://<mac-mini-ip>:10300) - Receives audio back via Wyoming TTS response
- LVGL state driven by Home Assistant entity state (HA → ESPHome event)
Phase 5 — Visual Layer (P7)
Goal: VTube Studio shows Live2D model on desktop/mobile; expressions driven by AI pipeline state.
P7: homeai-visual
Deliverables:
- VTube Studio installed on Mac Mini (macOS app)
- Live2D model loaded (sourced from nizima.com or booth.pm)
- VTube Studio WebSocket API enabled (port
8001) - OpenClaw skill:
vtube_studio.py- Connects to VTube Studio WebSocket
- Auth token exchange and persistence
- Methods:
trigger_expression(name),trigger_hotkey(name),set_parameter(name, value)
- Expression map in character JSON → VTube hotkey IDs
- Lip sync: driven by audio envelope or TTS phoneme timing
- Mobile: VTube Studio on iOS/Android connected to same model via Tailscale
Key decisions:
- Expression trigger events:
idle,speaking,thinking,happy,sad,error - Lip sync approach: simple amplitude-based (fast) rather than phoneme-based (complex) initially
- Auth token stored at
~/.openclaw/vtube_token.json
Interface contract:
- OpenClaw calls
vtube_studio.trigger_expression(event)from within response pipeline - Event names defined in character JSON
live2d_expressionsfield
Phase 6 — Image Generation (P8)
Goal: ComfyUI online with character-consistent image generation workflows.
P8: homeai-images
Deliverables:
- ComfyUI installed at
~/ComfyUI/, running via launchd - Models downloaded: SDXL base, Flux.1-dev (or schnell), ControlNet (canny, depth)
- Character LoRA: trained on character reference images for consistent appearance
- Saved workflows:
workflows/portrait.json— character portrait, controllable expressionworkflows/scene.json— character in scene with ControlNet poseworkflows/quick.json— fast draft via Flux.1-schnell
- OpenClaw skill:
comfyui.py— submits workflow via ComfyUI REST API, returns image path - ComfyUI API port:
8188
Interface contract:
- OpenClaw calls
comfyui.generate(workflow_name, params)→ returns local image path - ComfyUI REST API:
http://localhost:8188
Phase 7 — Extended Integrations & Polish
Deliverables:
- Music Assistant — Docker container, integrated with HA, OpenClaw
music.pyskill updated - Snapcast — server on Mac Mini, clients on ESP32 units (multi-room sync)
- Authelia — 2FA in front of all web UIs exposed via Tailscale
- n8n advanced workflows: daily briefing, calendar reminders, notification routing
- iOS Shortcuts companion: trigger OpenClaw from iPhone widget
- Uptime Kuma alerts: pushover/ntfy notifications on service down
- Backup automation: daily Gitea commits of mem0, character configs, n8n workflows
Dependency Graph
P1 (infra) ─────────────────────────────┐
P2 (llm) ──────────────────────┐ │
P3 (voice) ────────────────┐ │ │
P5 (character) ──────┐ │ │ │
↓ ↓ ↓ ↓
P4 (agent) ─────→ HA
↓
P6 (esp32) ← Wyoming
P7 (visual) ← vtube skill
P8 (images) ← comfyui skill
Hard dependencies:
- P4 requires P1 (HA URL), P2 (Ollama), P5 (character JSON)
- P3 requires P2 (LLM), P4 (agent endpoint)
- P6 requires P3 (Wyoming server), P1 (HA)
- P7 requires P4 (OpenClaw skill runner), P5 (expression map)
- P8 requires P4 (OpenClaw skill runner)
Can be done in parallel:
- P1 + P5 (infra and character manager are independent)
- P2 + P5 (LLM setup and character UI are independent)
- P7 + P8 (visual and images are both P4 dependents but independent of each other)
Interface Contracts Summary
| Contract | Type | Defined In | Consumed By |
|---|---|---|---|
~/server/.env.services |
env file | P1 | All |
Ollama API localhost:11434/v1 |
HTTP (OpenAI compat) | P2 | P3, P4, P7 |
Wyoming STT localhost:10300 |
TCP/Wyoming | P3 | P6, HA |
Wyoming TTS localhost:10301 |
TCP/Wyoming | P3 | P6, HA |
Wyoming Satellite localhost:10700 |
TCP/Wyoming | P3 | HA |
OpenClaw API localhost:8080 |
HTTP | P4 | P3, P7, P8 |
Character JSON ~/.openclaw/characters/ |
JSON file | P5 | P4, P3, P7 |
character.schema.json v1 |
JSON Schema | P5 | P4, P3, P7 |
VTube Studio WS localhost:8001 |
WebSocket | VTube Studio | P7 |
ComfyUI API localhost:8188 |
HTTP | ComfyUI | P8 |
| Home Assistant API | HTTP/WS | P1 (HA) | P4, P6 |
Repo Structure (Gitea)
~/gitea/homeai/
├── homeai-infra/ # P1
│ ├── docker/ # per-service compose files
│ ├── scripts/ # setup/teardown helpers
│ └── Makefile
├── homeai-llm/ # P2
│ ├── ollama-models.txt
│ └── scripts/
├── homeai-voice/ # P3
│ ├── whisper/
│ ├── tts/
│ ├── wyoming/
│ └── scripts/
├── homeai-agent/ # P4
│ ├── skills/
│ ├── workflows/ # n8n exports
│ └── config/
├── homeai-character/ # P5
│ ├── src/ # React character manager
│ ├── schema/
│ └── characters/ # exported JSONs
├── homeai-esp32/ # P6
│ └── esphome/
├── homeai-visual/ # P7
│ └── skills/
└── homeai-images/ # P8
├── workflows/ # ComfyUI workflow JSONs
└── skills/
Suggested Build Order
| Week | Focus | Projects |
|---|---|---|
| 1 | Infrastructure up, LLM running | P1, P2 |
| 2 | Voice pipeline end-to-end (desktop mic test) | P3 |
| 3 | Character Manager wired, OpenClaw connected | P4, P5 |
| 4 | ESP32 firmware, first satellite running | P6 |
| 5 | VTube Studio live, expressions working | P7 |
| 6 | ComfyUI online, character LoRA trained | P8 |
| 7+ | Extended integrations, polish, Authelia | Phase 7 |
Open Questions / Decisions Needed
- Which OpenClaw version/fork to use? (confirm it supports Ollama natively)
- Wake word:
hey_jarvisvs custom trained word — what should the character's name be? - Live2D model: commission custom or buy from nizima.com? Budget?
- Snapcast: output to ESP32 speakers or separate audio hardware per room?
- n8n: self-hosted Docker vs n8n Cloud (given local-first preference → Docker)
- Authelia: local user store or LDAP backend? (local store is simpler)
- mem0: local SQLite or run Qdrant vector DB for better semantic search?