commit 38247d7cc4d221eaa2fd53e699387bf68c1238fb Author: Aodhan Collins Date: Wed Mar 4 01:11:37 2026 +0000 Initial project structure and planning docs Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..352482e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,153 @@ +# CLAUDE.md — Home AI Assistant Project + +## Project Overview + +A self-hosted, always-on personal AI assistant running on a **Mac Mini M4 Pro (64GB RAM, 1TB SSD)**. The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration. + +--- + +## Hardware + +| Component | Spec | +|---|---| +| Chip | Apple M4 Pro | +| CPU | 14-core | +| GPU | 20-core | +| Neural Engine | 16-core | +| RAM | 64GB unified memory | +| Storage | 1TB SSD | +| Network | Gigabit Ethernet | + +All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional). + +--- + +## Core Stack + +### AI & LLM +- **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B) +- **Open WebUI** — browser-based chat interface, runs as Docker container + +### Image Generation +- **ComfyUI** — primary image generation UI, node-based workflows +- Target models: SDXL, Flux.1, ControlNet +- Runs via Metal (Apple GPU API) + +### Speech +- **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine +- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency) +- **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised) +- **Qwen3-TTS** — alternative voice cloning via MLX +- **openWakeWord** — always-on wake word detection + +### Smart Home +- **Home Assistant** — smart home control platform (Docker) +- **Wyoming Protocol** — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant +- **Music Assistant** — self-hosted music control, integrates with Home Assistant +- **Snapcast** — multi-room synchronised audio output + +### AI Agent / Orchestration +- **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality +- **n8n** — visual workflow automation (Docker), chains AI actions +- **mem0** — long-term memory layer for the AI character + +### Character & Personality +- **Character Manager** (built — see `character-manager.jsx`) — single config UI for personality, prompts, models, Live2D mappings, and notes +- Character config exports to JSON, consumed by OpenClaw system prompt and pipeline + +### Visual Representation +- **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android) +- VTube Studio WebSocket API used to drive expressions from the AI pipeline +- **LVGL** — simplified animated face on ESP32-S3-BOX-3 units +- Live2D model: to be sourced/commissioned (nizima.com or booth.pm) + +### Room Presence (Smart Speaker Replacement) +- **ESP32-S3-BOX-3** units — one per room +- Flashed with **ESPHome** +- Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back) +- LVGL display shows animated face + status info +- Communicates over local WiFi + +### Infrastructure +- **Docker Desktop for Mac** — containerises Home Assistant, Open WebUI, n8n, etc. +- **Tailscale** — secure remote access to all services, no port forwarding +- **Authelia** — 2FA authentication layer for exposed web UIs +- **Portainer** — Docker container management UI +- **Uptime Kuma** — service health monitoring and mobile alerts +- **Gitea** — self-hosted Git server for all project code and configs +- **code-server** — browser-based VS Code for remote development + +--- + +## Voice Pipeline (End-to-End) + +``` +ESP32-S3-BOX-3 (room) + → Wake word detected (openWakeWord, runs locally on device or Mac Mini) + → Audio streamed to Mac Mini via Wyoming Satellite + → Whisper.cpp transcribes speech to text + → OpenClaw receives text + context + → Ollama LLM generates response (with character persona from system prompt) + → mem0 updates long-term memory + → Response dispatched: + → Kokoro/Chatterbox renders TTS audio + → Audio sent back to ESP32-S3-BOX-3 (spoken response) + → VTube Studio API triggered (expression + lip sync on desktop/mobile) + → Home Assistant action called if applicable (lights, music, etc.) +``` + +--- + +## Character System + +The AI assistant has a defined personality managed via the Character Manager tool. + +Key config surfaces: +- **System prompt** — injected into every Ollama request +- **Voice clone reference** — `.wav` file path for Chatterbox/Qwen3-TTS +- **Live2D expression mappings** — idle, speaking, thinking, happy, error states +- **VTube Studio WebSocket triggers** — JSON map of events to expressions +- **Custom prompt rules** — trigger/response overrides for specific contexts +- **mem0** — persistent memory that evolves over time + +Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components. + +--- + +## Project Priorities + +1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma) +2. **LLM** — Ollama running with target models, Open WebUI connected +3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant +4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant +5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, LVGL face +6. **Character system** — system prompt wired up, mem0 integrated, voice cloned +7. **VTube Studio** — model loaded, WebSocket API bridge written as OpenClaw skill +8. **ComfyUI** — image generation online, character-consistent model workflows +9. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server +10. **Polish** — Authelia, Tailscale hardening, mobile companion, iOS widgets + +--- + +## Key Paths & Conventions + +- All Docker compose files: `~/server/docker/` +- OpenClaw skills: `~/.openclaw/skills/` +- Character configs: `~/.openclaw/characters/` +- Whisper models: `~/models/whisper/` +- Ollama models: managed by Ollama at `~/.ollama/models/` +- ComfyUI models: `~/ComfyUI/models/` +- Voice reference audio: `~/voices/` +- Gitea repos root: `~/gitea/` + +--- + +## Notes for Planning + +- All services should survive a Mac Mini reboot (launchd or Docker restart policies) +- ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini +- The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values +- OpenClaw skills are the primary extension mechanism — new capabilities = new skills +- Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only +- VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface +- mem0 memory store should be backed up as part of regular Gitea commits diff --git a/PROJECT_PLAN.md b/PROJECT_PLAN.md new file mode 100644 index 0000000..5f942f8 --- /dev/null +++ b/PROJECT_PLAN.md @@ -0,0 +1,371 @@ +# HomeAI — Full Project Plan + +> Last updated: 2026-03-04 + +--- + +## Overview + +This project builds a self-hosted, always-on AI assistant running entirely on a Mac Mini M4 Pro. It is decomposed into **8 sub-projects** that can be developed in parallel where dependencies allow, then bridged via well-defined interfaces. + +The guiding principle: each sub-project exposes a clean API/config surface. No project hard-codes knowledge of another's internals. + +--- + +## Sub-Project Map + +| ID | Name | Description | Primary Language | +|---|---|---|---| +| P1 | `homeai-infra` | Docker stack, networking, monitoring, secrets | YAML / Shell | +| P2 | `homeai-llm` | Ollama + Open WebUI setup, model management | YAML / Shell | +| P3 | `homeai-voice` | STT, TTS, Wyoming bridge, wake word | Python / Shell | +| P4 | `homeai-agent` | OpenClaw config, skills, n8n workflows, mem0 | Python / JSON | +| P5 | `homeai-character` | Character Manager UI, persona JSON schema, voice clone | React / JSON | +| P6 | `homeai-esp32` | ESPHome firmware, Wyoming Satellite, LVGL face | C++ / YAML | +| P7 | `homeai-visual` | VTube Studio bridge, Live2D expression mapping | Python / JSON | +| P8 | `homeai-images` | ComfyUI workflows, model management, ControlNet | Python / JSON | + +All repos live under `~/gitea/homeai/` on the Mac Mini and are mirrored to the self-hosted Gitea instance (set up in P1). + +--- + +## Phase 1 — Foundation (P1 + P2) + +**Goal:** Everything containerised, stable, accessible remotely. LLM responsive via browser. + +### P1: `homeai-infra` + +**Deliverables:** +- [ ] `docker-compose.yml` — master compose file (or per-service files under `~/server/docker/`) +- [ ] Services: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server +- [ ] Tailscale installed on Mac Mini, all services on Tailnet +- [ ] Gitea repos initialised, SSH keys configured +- [ ] Uptime Kuma monitors all service endpoints +- [ ] Docker restart policies: `unless-stopped` on all containers +- [ ] Documented `.env` file pattern (secrets never committed) + +**Key decisions:** +- Single `docker-compose.yml` vs per-service compose files — recommend per-service files in `~/server/docker//` orchestrated by a root `Makefile` +- Tailscale as sole remote access method (no public port forwarding) +- Authelia deferred to Phase 4 polish (internal LAN services don't need 2FA immediately) + +**Interface contract:** Exposes service URLs as env vars (e.g. `HA_URL`, `GITEA_URL`) written to `~/server/.env.services` — consumed by all other projects. + +--- + +### P2: `homeai-llm` + +**Deliverables:** +- [ ] Ollama installed natively on Mac Mini (not Docker — needs Metal GPU access) +- [ ] Models pulled: `llama3.3:70b`, `qwen2.5:72b` (and a fast small model: `qwen2.5:7b` for low-latency tasks) +- [ ] Open WebUI running as Docker container, connected to Ollama +- [ ] Model benchmark script — measures tokens/sec per model +- [ ] `ollama-models.txt` — pinned model manifest for reproducibility + +**Key decisions:** +- Ollama runs as a launchd service (`~/Library/LaunchAgents/`) to survive reboots +- Open WebUI exposed only on Tailnet +- API endpoint: `http://localhost:11434` (Ollama default) + +**Interface contract:** Ollama OpenAI-compatible API at `http://localhost:11434/v1` — used by P3, P4, P7. + +--- + +## Phase 2 — Voice Pipeline (P3) + +**Goal:** Full end-to-end voice: speak → transcribe → LLM → TTS → hear response. No ESP32 yet — test with a USB mic on Mac Mini. + +### P3: `homeai-voice` + +**Deliverables:** +- [ ] Whisper.cpp compiled for Apple Silicon, model downloaded (`medium.en` or `large-v3`) +- [ ] Kokoro TTS installed, tested, latency benchmarked +- [ ] Chatterbox TTS installed (MPS optimised build), voice reference `.wav` ready +- [ ] Qwen3-TTS via MLX installed as fallback +- [ ] openWakeWord running on Mac Mini, detecting wake word +- [ ] Wyoming protocol server running — bridges STT+TTS into Home Assistant +- [ ] Home Assistant `voice_assistant` pipeline configured end-to-end +- [ ] Test script: `test_voice_pipeline.sh` — mic in → spoken response out + +**Sub-components:** + +``` +[Mic] → openWakeWord → Wyoming STT (Whisper.cpp) → [text out] +[text in] → Wyoming TTS (Kokoro) → [audio out] +``` + +**Key decisions:** +- Whisper.cpp runs as a Wyoming STT provider (via `wyoming-faster-whisper` or native Wyoming adapter) +- Kokoro is primary TTS; Chatterbox used when voice cloning is active (P5) +- openWakeWord runs as a launchd service +- Wyoming server port: `10300` (STT), `10301` (TTS) — standard Wyoming ports + +**Interface contract:** +- Wyoming STT: `tcp://localhost:10300` +- Wyoming TTS: `tcp://localhost:10301` +- Direct Python API for P4 (agent bypasses Wyoming for non-HA calls) + +--- + +## Phase 3 — AI Agent & Character (P4 + P5) + +**Goal:** OpenClaw receives voice/text input, applies character persona, calls tools, returns rich responses. + +### P4: `homeai-agent` + +**Deliverables:** +- [ ] OpenClaw installed and configured +- [ ] Connected to Ollama (`llama3.3:70b` as primary model) +- [ ] Connected to Home Assistant (long-lived access token in config) +- [ ] mem0 installed, configured with local storage backend +- [ ] mem0 backup job: daily git commit to Gitea +- [ ] Core skills written: + - `home_assistant.py` — call HA services (lights, switches, scenes) + - `memory.py` — read/write mem0 memories + - `weather.py` — local weather via HA sensor data + - `timer.py` — set timers/reminders + - `music.py` — stub for Music Assistant (P9) +- [ ] n8n running as Docker container, webhook trigger from OpenClaw +- [ ] Sample n8n workflow: morning briefing (time + weather + calendar) +- [ ] System prompt template: loads character JSON from P5 + +**Key decisions:** +- OpenClaw config at `~/.openclaw/config.yaml` +- Skills at `~/.openclaw/skills/` — one file per skill, auto-discovered +- System prompt: `~/.openclaw/characters/.json` loaded at startup +- mem0 store: local file backend at `~/.openclaw/memory/` (SQLite) + +**Interface contract:** +- OpenClaw exposes a local HTTP API (default port `8080`) — used by P3 (voice pipeline hands off transcribed text here) +- Consumes character JSON from P5 + +--- + +### P5: `homeai-character` + +**Deliverables:** +- [ ] Character Manager UI (`character-manager.jsx`) — already exists, needs wiring +- [ ] Character JSON schema v1 defined and documented +- [ ] Export produces `~/.openclaw/characters/.json` +- [ ] Fields: name, system_prompt, voice_ref_path, tts_engine, live2d_expressions, vtube_ws_triggers, custom_rules, model_overrides +- [ ] Validation: schema validator script rejects malformed exports +- [ ] Sample character: `aria.json` (default assistant persona) +- [ ] Voice clone: reference `.wav` recorded/sourced, placed at `~/voices/.wav` + +**Key decisions:** +- JSON schema is versioned (`"schema_version": 1`) — pipeline components check version before loading +- Character Manager is a local React app (served by Vite dev server or built to static files) +- Single active character at a time; OpenClaw watches the file for changes (hot reload) + +**Interface contract:** +- Output: `~/.openclaw/characters/.json` — consumed by P4, P3 (TTS voice selection), P7 (expression mapping) +- Schema published in `homeai-character/schema/character.schema.json` + +--- + +## Phase 4 — Hardware Satellites (P6) + +**Goal:** ESP32-S3-BOX-3 units act as room presence nodes — wake word, mic input, audio output, animated face. + +### P6: `homeai-esp32` + +**Deliverables:** +- [ ] ESPHome config for ESP32-S3-BOX-3 (`esphome/s3-box-living-room.yaml`, etc.) +- [ ] Wyoming Satellite component configured — streams mic audio to Mac Mini Wyoming STT +- [ ] Audio playback: receives TTS audio from Mac Mini, plays via built-in speaker +- [ ] LVGL face: animated idle/speaking/thinking states +- [ ] Wake word: either on-device (microWakeWord via ESPHome) or forwarded to Mac Mini openWakeWord +- [ ] OTA update mechanism configured +- [ ] One unit per room — config templated with room name as variable + +**LVGL Face States:** +| State | Animation | +|---|---| +| Idle | Slow blink, gentle sway | +| Listening | Eyes wide, mic indicator | +| Thinking | Eyes narrow, loading dots | +| Speaking | Mouth animation synced to audio | +| Error | Red eyes, shake | + +**Key decisions:** +- Wake word on-device preferred (lower latency, no always-on network stream) +- microWakeWord model: `hey_jarvis` or custom trained word +- LVGL animations compiled into ESPHome firmware (no runtime asset loading) +- Each unit has a unique device name for HA entity naming + +**Interface contract:** +- Wyoming Satellite → Mac Mini Wyoming STT server (`tcp://:10300`) +- Receives audio back via Wyoming TTS response +- LVGL state driven by Home Assistant entity state (HA → ESPHome event) + +--- + +## Phase 5 — Visual Layer (P7) + +**Goal:** VTube Studio shows Live2D model on desktop/mobile; expressions driven by AI pipeline state. + +### P7: `homeai-visual` + +**Deliverables:** +- [ ] VTube Studio installed on Mac Mini (macOS app) +- [ ] Live2D model loaded (sourced from nizima.com or booth.pm) +- [ ] VTube Studio WebSocket API enabled (port `8001`) +- [ ] OpenClaw skill: `vtube_studio.py` + - Connects to VTube Studio WebSocket + - Auth token exchange and persistence + - Methods: `trigger_expression(name)`, `trigger_hotkey(name)`, `set_parameter(name, value)` +- [ ] Expression map in character JSON → VTube hotkey IDs +- [ ] Lip sync: driven by audio envelope or TTS phoneme timing +- [ ] Mobile: VTube Studio on iOS/Android connected to same model via Tailscale + +**Key decisions:** +- Expression trigger events: `idle`, `speaking`, `thinking`, `happy`, `sad`, `error` +- Lip sync approach: simple amplitude-based (fast) rather than phoneme-based (complex) initially +- Auth token stored at `~/.openclaw/vtube_token.json` + +**Interface contract:** +- OpenClaw calls `vtube_studio.trigger_expression(event)` from within response pipeline +- Event names defined in character JSON `live2d_expressions` field + +--- + +## Phase 6 — Image Generation (P8) + +**Goal:** ComfyUI online with character-consistent image generation workflows. + +### P8: `homeai-images` + +**Deliverables:** +- [ ] ComfyUI installed at `~/ComfyUI/`, running via launchd +- [ ] Models downloaded: SDXL base, Flux.1-dev (or schnell), ControlNet (canny, depth) +- [ ] Character LoRA: trained on character reference images for consistent appearance +- [ ] Saved workflows: + - `workflows/portrait.json` — character portrait, controllable expression + - `workflows/scene.json` — character in scene with ControlNet pose + - `workflows/quick.json` — fast draft via Flux.1-schnell +- [ ] OpenClaw skill: `comfyui.py` — submits workflow via ComfyUI REST API, returns image path +- [ ] ComfyUI API port: `8188` + +**Interface contract:** +- OpenClaw calls `comfyui.generate(workflow_name, params)` → returns local image path +- ComfyUI REST API: `http://localhost:8188` + +--- + +## Phase 7 — Extended Integrations & Polish + +**Deliverables:** +- [ ] Music Assistant — Docker container, integrated with HA, OpenClaw `music.py` skill updated +- [ ] Snapcast — server on Mac Mini, clients on ESP32 units (multi-room sync) +- [ ] Authelia — 2FA in front of all web UIs exposed via Tailscale +- [ ] n8n advanced workflows: daily briefing, calendar reminders, notification routing +- [ ] iOS Shortcuts companion: trigger OpenClaw from iPhone widget +- [ ] Uptime Kuma alerts: pushover/ntfy notifications on service down +- [ ] Backup automation: daily Gitea commits of mem0, character configs, n8n workflows + +--- + +## Dependency Graph + +``` +P1 (infra) ─────────────────────────────┐ +P2 (llm) ──────────────────────┐ │ +P3 (voice) ────────────────┐ │ │ +P5 (character) ──────┐ │ │ │ + ↓ ↓ ↓ ↓ + P4 (agent) ─────→ HA + ↓ + P6 (esp32) ← Wyoming + P7 (visual) ← vtube skill + P8 (images) ← comfyui skill +``` + +**Hard dependencies:** +- P4 requires P1 (HA URL), P2 (Ollama), P5 (character JSON) +- P3 requires P2 (LLM), P4 (agent endpoint) +- P6 requires P3 (Wyoming server), P1 (HA) +- P7 requires P4 (OpenClaw skill runner), P5 (expression map) +- P8 requires P4 (OpenClaw skill runner) + +**Can be done in parallel:** +- P1 + P5 (infra and character manager are independent) +- P2 + P5 (LLM setup and character UI are independent) +- P7 + P8 (visual and images are both P4 dependents but independent of each other) + +--- + +## Interface Contracts Summary + +| Contract | Type | Defined In | Consumed By | +|---|---|---|---| +| `~/server/.env.services` | env file | P1 | All | +| Ollama API `localhost:11434/v1` | HTTP (OpenAI compat) | P2 | P3, P4, P7 | +| Wyoming STT `localhost:10300` | TCP/Wyoming | P3 | P6, HA | +| Wyoming TTS `localhost:10301` | TCP/Wyoming | P3 | P6, HA | +| OpenClaw API `localhost:8080` | HTTP | P4 | P3, P7, P8 | +| Character JSON `~/.openclaw/characters/` | JSON file | P5 | P4, P3, P7 | +| `character.schema.json` v1 | JSON Schema | P5 | P4, P3, P7 | +| VTube Studio WS `localhost:8001` | WebSocket | VTube Studio | P7 | +| ComfyUI API `localhost:8188` | HTTP | ComfyUI | P8 | +| Home Assistant API | HTTP/WS | P1 (HA) | P4, P6 | + +--- + +## Repo Structure (Gitea) + +``` +~/gitea/homeai/ +├── homeai-infra/ # P1 +│ ├── docker/ # per-service compose files +│ ├── scripts/ # setup/teardown helpers +│ └── Makefile +├── homeai-llm/ # P2 +│ ├── ollama-models.txt +│ └── scripts/ +├── homeai-voice/ # P3 +│ ├── whisper/ +│ ├── tts/ +│ ├── wyoming/ +│ └── scripts/ +├── homeai-agent/ # P4 +│ ├── skills/ +│ ├── workflows/ # n8n exports +│ └── config/ +├── homeai-character/ # P5 +│ ├── src/ # React character manager +│ ├── schema/ +│ └── characters/ # exported JSONs +├── homeai-esp32/ # P6 +│ └── esphome/ +├── homeai-visual/ # P7 +│ └── skills/ +└── homeai-images/ # P8 + ├── workflows/ # ComfyUI workflow JSONs + └── skills/ +``` + +--- + +## Suggested Build Order + +| Week | Focus | Projects | +|---|---|---| +| 1 | Infrastructure up, LLM running | P1, P2 | +| 2 | Voice pipeline end-to-end (desktop mic test) | P3 | +| 3 | Character Manager wired, OpenClaw connected | P4, P5 | +| 4 | ESP32 firmware, first satellite running | P6 | +| 5 | VTube Studio live, expressions working | P7 | +| 6 | ComfyUI online, character LoRA trained | P8 | +| 7+ | Extended integrations, polish, Authelia | Phase 7 | + +--- + +## Open Questions / Decisions Needed + +- [ ] Which OpenClaw version/fork to use? (confirm it supports Ollama natively) +- [ ] Wake word: `hey_jarvis` vs custom trained word — what should the character's name be? +- [ ] Live2D model: commission custom or buy from nizima.com? Budget? +- [ ] Snapcast: output to ESP32 speakers or separate audio hardware per room? +- [ ] n8n: self-hosted Docker vs n8n Cloud (given local-first preference → Docker) +- [ ] Authelia: local user store or LDAP backend? (local store is simpler) +- [ ] mem0: local SQLite or run Qdrant vector DB for better semantic search? diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..f568ea6 --- /dev/null +++ b/TODO.md @@ -0,0 +1,189 @@ +# HomeAI — Master TODO + +> Track progress across all sub-projects. See each sub-project `PLAN.md` for detailed implementation notes. +> Status: `[ ]` pending · `[~]` in progress · `[x]` done + +--- + +## Phase 1 — Foundation + +### P1 · homeai-infra + +- [ ] Install Docker Desktop for Mac, enable launch at login +- [ ] Create shared `homeai` Docker network +- [ ] Create `~/server/docker/` directory structure +- [ ] Write compose files: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server, n8n +- [ ] Write `.env.secrets.example` and `Makefile` +- [ ] `make up-all` — bring all services up +- [ ] Home Assistant onboarding — generate long-lived access token +- [ ] Write `~/server/.env.services` with all service URLs +- [ ] Install Tailscale, verify all services reachable on Tailnet +- [ ] Gitea: create admin account, initialise all 8 sub-project repos, configure SSH +- [ ] Uptime Kuma: add monitors for all services, configure mobile alerts +- [ ] Verify all containers survive a cold reboot + +### P2 · homeai-llm + +- [ ] Install Ollama natively via brew +- [ ] Write and load launchd plist (`com.ollama.ollama.plist`) +- [ ] Write `ollama-models.txt` with model manifest +- [ ] Run `scripts/pull-models.sh` — pull all models +- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md` +- [ ] Deploy Open WebUI via Docker compose (port 3030) +- [ ] Verify Open WebUI connected to Ollama, all models available +- [ ] Add Ollama + Open WebUI to Uptime Kuma monitors +- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services` + +--- + +## Phase 2 — Voice Pipeline + +### P3 · homeai-voice + +- [ ] Compile Whisper.cpp with Metal support +- [ ] Download Whisper models (`large-v3`, `medium.en`) to `~/models/whisper/` +- [ ] Install `wyoming-faster-whisper`, test STT from audio file +- [ ] Install Kokoro TTS, test output to audio file +- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol +- [ ] Write + load launchd plists for Wyoming STT (10300) and TTS (10301) +- [ ] Connect Home Assistant Wyoming integration (STT + TTS) +- [ ] Create HA Voice Assistant pipeline +- [ ] Test HA Assist via browser: type query → hear spoken response +- [ ] Install openWakeWord, test wake detection with USB mic +- [ ] Write + load openWakeWord launchd plist +- [ ] Install Chatterbox TTS (MPS build), test with sample `.wav` +- [ ] Install Qwen3-TTS via MLX (fallback) +- [ ] Write `wyoming/test-pipeline.sh` — end-to-end smoke test +- [ ] Add Wyoming STT/TTS to Uptime Kuma monitors + +--- + +## Phase 3 — Agent & Character + +### P5 · homeai-character *(no runtime deps — can start alongside P1)* + +- [ ] Define and write `schema/character.schema.json` (v1) +- [ ] Write `characters/aria.json` — default character +- [ ] Set up Vite project in `src/`, install deps +- [ ] Integrate existing `character-manager.jsx` into Vite project +- [ ] Add schema validation on export (ajv) +- [ ] Add expression mapping UI section +- [ ] Add custom rules editor +- [ ] Test full edit → export → validate → load cycle +- [ ] Record or source voice reference audio for Aria (`~/voices/aria.wav`) +- [ ] Pre-process audio with ffmpeg, test with Chatterbox +- [ ] Update `aria.json` with voice clone path if quality is good +- [ ] Write `SchemaValidator.js` as standalone utility + +### P4 · homeai-agent + +- [ ] Confirm OpenClaw installation method and Ollama compatibility +- [ ] Install OpenClaw, write `~/.openclaw/config.yaml` +- [ ] Verify OpenClaw responds to basic text query via `/chat` +- [ ] Write `skills/home_assistant.py` — test lights on/off via voice +- [ ] Write `skills/memory.py` — test store and recall +- [ ] Write `skills/weather.py` — verify HA weather sensor data +- [ ] Write `skills/timer.py` — test set/fire a timer +- [ ] Write skill stubs: `music.py`, `vtube_studio.py`, `comfyui.py` +- [ ] Set up mem0 with Chroma backend, test semantic recall +- [ ] Write and load memory backup launchd job +- [ ] Symlink `homeai-agent/skills/` → `~/.openclaw/skills/` +- [ ] Build morning briefing n8n workflow +- [ ] Build notification router n8n workflow +- [ ] Verify full voice → agent → HA action flow +- [ ] Add OpenClaw to Uptime Kuma monitors + +--- + +## Phase 4 — Hardware Satellites + +### P6 · homeai-esp32 + +- [ ] Install ESPHome: `pip install esphome` +- [ ] Write `esphome/secrets.yaml` (gitignored) +- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml` +- [ ] Write `s3-box-living-room.yaml` for first unit +- [ ] Flash first unit via USB +- [ ] Verify unit appears in HA device list +- [ ] Assign Wyoming voice pipeline to unit in HA +- [ ] Test full wake → STT → LLM → TTS → audio playback cycle +- [ ] Test LVGL face: idle → listening → thinking → speaking → error +- [ ] Verify OTA firmware update works wirelessly +- [ ] Flash remaining units (bedroom, kitchen, etc.) +- [ ] Document MAC address → room name mapping + +--- + +## Phase 5 — Visual Layer + +### P7 · homeai-visual + +- [ ] Install VTube Studio (Mac App Store) +- [ ] Enable WebSocket API on port 8001 +- [ ] Source/purchase a Live2D model (nizima.com or booth.pm) +- [ ] Load model in VTube Studio +- [ ] Create hotkeys for all 8 expression states +- [ ] Write `skills/vtube_studio.py` full implementation +- [ ] Run auth flow — click Allow in VTube Studio, save token +- [ ] Test all 8 expressions via test script +- [ ] Update `aria.json` with real VTube Studio hotkey IDs +- [ ] Write `lipsync.py` amplitude-based helper +- [ ] Integrate lip sync into OpenClaw TTS dispatch +- [ ] Symlink `skills/` → `~/.openclaw/skills/` +- [ ] Test full pipeline: voice → thinking expression → speaking with lip sync +- [ ] Set up VTube Studio mobile (iPhone/iPad) on Tailnet + +--- + +## Phase 6 — Image Generation + +### P8 · homeai-images + +- [ ] Clone ComfyUI to `~/ComfyUI/`, install deps in venv +- [ ] Verify MPS is detected at launch +- [ ] Write and load launchd plist (`com.homeai.comfyui.plist`) +- [ ] Download SDXL base model +- [ ] Download Flux.1-schnell +- [ ] Download ControlNet models (canny, depth) +- [ ] Test generation via ComfyUI web UI (port 8188) +- [ ] Build and export `quick.json` workflow +- [ ] Build and export `portrait.json` workflow +- [ ] Build and export `scene.json` workflow (ControlNet) +- [ ] Build and export `upscale.json` workflow +- [ ] Write `skills/comfyui.py` full implementation +- [ ] Test skill: `comfyui.quick("test prompt")` → image file returned +- [ ] Collect character reference images for LoRA training +- [ ] Train SDXL LoRA with kohya_ss +- [ ] Load LoRA into `portrait.json`, verify character consistency +- [ ] Symlink `skills/` → `~/.openclaw/skills/` +- [ ] Test via OpenClaw: "Generate a portrait of Aria looking happy" +- [ ] Add ComfyUI to Uptime Kuma monitors + +--- + +## Phase 7 — Extended Integrations & Polish + +- [ ] Deploy Music Assistant (Docker), integrate with Home Assistant +- [ ] Complete `skills/music.py` in OpenClaw +- [ ] Deploy Snapcast server on Mac Mini +- [ ] Configure Snapcast clients on ESP32 units for multi-room audio +- [ ] Configure Authelia as 2FA layer in front of web UIs +- [ ] Build advanced n8n workflows (calendar reminders, daily briefing v2) +- [ ] Create iOS Shortcuts to trigger OpenClaw from iPhone widget +- [ ] Configure ntfy/Pushover alerts in Uptime Kuma for all services +- [ ] Automate mem0 + character config backup to Gitea (daily) +- [ ] Train custom wake word using character's name +- [ ] Document all service URLs, ports, and credentials in a private Gitea wiki +- [ ] Tailscale ACL hardening — restrict which devices can reach which services +- [ ] Stress test: reboot Mac Mini, verify all services recover in <2 minutes + +--- + +## Open Decisions + +- [ ] Confirm character name (determines wake word training) +- [ ] Confirm OpenClaw version/fork and Ollama compatibility +- [ ] Live2D model: purchase off-the-shelf or commission custom? +- [ ] mem0 backend: Chroma (simple) vs Qdrant Docker (better semantic search)? +- [ ] Snapcast output: ESP32 built-in speakers or dedicated audio hardware per room? +- [ ] Authelia user store: local file vs LDAP? diff --git a/homeai-agent/PLAN.md b/homeai-agent/PLAN.md new file mode 100644 index 0000000..cb9dc75 --- /dev/null +++ b/homeai-agent/PLAN.md @@ -0,0 +1,335 @@ +# P4: homeai-agent — AI Agent, Skills & Automation + +> Phase 3 | Depends on: P1 (HA), P2 (Ollama), P3 (Wyoming/TTS), P5 (character JSON) + +--- + +## Goal + +OpenClaw running as the primary AI agent: receives voice/text input, loads character persona, calls tools (skills), manages memory (mem0), dispatches responses (TTS, HA actions, VTube expressions). n8n handles scheduled/automated workflows. + +--- + +## Architecture + +``` +Voice input (text from P3 Wyoming STT) + ↓ +OpenClaw API (port 8080) + ↓ loads character JSON from P5 + System prompt construction + ↓ + Ollama LLM (P2) — llama3.3:70b + ↓ response + tool calls + Skill dispatcher + ├── home_assistant.py → HA REST API (P1) + ├── memory.py → mem0 (local) + ├── vtube_studio.py → VTube WS (P7) + ├── comfyui.py → ComfyUI API (P8) + ├── music.py → Music Assistant (Phase 7) + └── weather.py → HA sensor data + ↓ final response text + TTS dispatch: + ├── Chatterbox (voice clone, if active) + └── Kokoro (via Wyoming, fallback) + ↓ + Audio playback to appropriate room +``` + +--- + +## OpenClaw Setup + +### Installation + +```bash +# Confirm OpenClaw supports Ollama — check repo for latest install method +pip install openclaw +# or +git clone https://github.com//openclaw +pip install -e . +``` + +**Key question:** Verify OpenClaw's Ollama/OpenAI-compatible backend support before installation. If OpenClaw doesn't support local Ollama natively, use a thin adapter layer pointing its OpenAI endpoint at `http://localhost:11434/v1`. + +### Config — `~/.openclaw/config.yaml` + +```yaml +version: 1 + +llm: + provider: ollama # or openai-compatible + base_url: http://localhost:11434/v1 + model: llama3.3:70b + fast_model: qwen2.5:7b # used for quick intent classification + +character: + active: aria + config_dir: ~/.openclaw/characters/ + +memory: + provider: mem0 + store_path: ~/.openclaw/memory/ + embedding_model: nomic-embed-text + embedding_url: http://localhost:11434/v1 + +api: + host: 0.0.0.0 + port: 8080 + +tts: + primary: chatterbox # when voice clone active + fallback: kokoro-wyoming # Wyoming TTS endpoint + wyoming_tts_url: tcp://localhost:10301 + +wake: + endpoint: /wake # openWakeWord POSTs here to trigger listening +``` + +--- + +## Skills + +All skills live in `~/.openclaw/skills/` (symlinked from `homeai-agent/skills/`). + +### `home_assistant.py` + +Wraps the HA REST API for common smart home actions. + +**Functions:** +- `turn_on(entity_id, **kwargs)` — lights, switches, media players +- `turn_off(entity_id)` +- `toggle(entity_id)` +- `set_light(entity_id, brightness=None, color_temp=None, rgb_color=None)` +- `run_scene(scene_id)` +- `get_state(entity_id)` → returns state + attributes +- `list_entities(domain=None)` → returns entity list + +Uses `HA_URL` and `HA_TOKEN` from `.env.services`. + +### `memory.py` + +Wraps mem0 for persistent long-term memory. + +**Functions:** +- `remember(text, category=None)` — store a memory +- `recall(query, limit=5)` — semantic search over memories +- `forget(memory_id)` — delete a specific memory +- `list_recent(n=10)` — list most recent memories + +mem0 uses `nomic-embed-text` via Ollama for embeddings. + +### `weather.py` + +Pulls weather data from Home Assistant sensors (local weather station or HA weather integration). + +**Functions:** +- `get_current()` → temp, humidity, conditions +- `get_forecast(days=3)` → forecast array + +### `timer.py` + +Simple timer/reminder management. + +**Functions:** +- `set_timer(duration_seconds, label=None)` → fires HA notification/TTS on expiry +- `set_reminder(datetime_str, message)` → schedules future TTS playback +- `list_timers()` +- `cancel_timer(timer_id)` + +### `music.py` (stub — completed in Phase 7) + +```python +def play(query: str): ... # "play jazz" → Music Assistant +def pause(): ... +def skip(): ... +def set_volume(level: int): ... # 0-100 +``` + +### `vtube_studio.py` (implemented in P7) + +Stub in P4, full implementation in P7: +```python +def trigger_expression(event: str): ... # "thinking", "happy", etc. +def set_parameter(name: str, value: float): ... +``` + +### `comfyui.py` (implemented in P8) + +Stub in P4, full implementation in P8: +```python +def generate(workflow: str, params: dict) -> str: ... # returns image path +``` + +--- + +## mem0 — Long-Term Memory + +### Setup + +```bash +pip install mem0ai +``` + +### Config + +```python +from mem0 import Memory + +config = { + "llm": { + "provider": "ollama", + "config": { + "model": "llama3.3:70b", + "ollama_base_url": "http://localhost:11434", + } + }, + "embedder": { + "provider": "ollama", + "config": { + "model": "nomic-embed-text", + "ollama_base_url": "http://localhost:11434", + } + }, + "vector_store": { + "provider": "chroma", + "config": { + "collection_name": "homeai_memory", + "path": "~/.openclaw/memory/chroma", + } + } +} + +memory = Memory.from_config(config) +``` + +> **Decision point:** Start with Chroma (local file-based). If semantic recall quality is poor, migrate to Qdrant (Docker container). + +### Backup + +Daily cron (via launchd) commits mem0 data to Gitea: + +```bash +#!/usr/bin/env bash +cd ~/.openclaw/memory +git add . +git commit -m "mem0 backup $(date +%Y-%m-%d)" +git push origin main +``` + +--- + +## n8n Workflows + +n8n runs in Docker (deployed in P1). Workflows exported as JSON and stored in `homeai-agent/workflows/`. + +### Starter Workflows + +**`morning-briefing.json`** +- Trigger: time-based (e.g., 7:30 AM on weekdays) +- Steps: fetch weather → fetch calendar events → compose briefing → POST to OpenClaw TTS → speak aloud + +**`notification-router.json`** +- Trigger: HA webhook (new notification) +- Steps: classify urgency → if high: TTS immediately; if low: queue for next interaction + +**`memory-backup.json`** +- Trigger: daily schedule +- Steps: commit mem0 data to Gitea + +### n8n ↔ OpenClaw Integration + +OpenClaw exposes a webhook endpoint that n8n can call to trigger TTS or run a skill: + +``` +POST http://localhost:8080/speak +{ + "text": "Good morning. It is 7:30 and the weather is...", + "room": "all" +} +``` + +--- + +## API Surface (OpenClaw) + +Key endpoints consumed by other projects: + +| Endpoint | Method | Description | +|---|---|---| +| `/chat` | POST | Send text, get response (+ fires skills) | +| `/wake` | POST | Wake word trigger from openWakeWord | +| `/speak` | POST | TTS only — no LLM, just speak text | +| `/skill/` | POST | Call a specific skill directly | +| `/memory` | GET/POST | Read/write memories | +| `/status` | GET | Health check | + +--- + +## Directory Layout + +``` +homeai-agent/ +├── skills/ +│ ├── home_assistant.py +│ ├── memory.py +│ ├── weather.py +│ ├── timer.py +│ ├── music.py # stub +│ ├── vtube_studio.py # stub +│ └── comfyui.py # stub +├── workflows/ +│ ├── morning-briefing.json +│ ├── notification-router.json +│ └── memory-backup.json +└── config/ + ├── config.yaml.example + └── mem0-config.py +``` + +--- + +## Interface Contracts + +**Consumes:** +- Ollama API: `http://localhost:11434/v1` +- HA API: `$HA_URL` with `$HA_TOKEN` +- Wyoming TTS: `tcp://localhost:10301` +- Character JSON: `~/.openclaw/characters/.json` (from P5) + +**Exposes:** +- OpenClaw HTTP API: `http://localhost:8080` — consumed by P3 (voice), P7 (visual triggers), P8 (image skill) + +**Add to `.env.services`:** +```dotenv +OPENCLAW_URL=http://localhost:8080 +``` + +--- + +## Implementation Steps + +- [ ] Confirm OpenClaw installation method and Ollama compatibility +- [ ] Install OpenClaw, write `config.yaml` pointing at Ollama and HA +- [ ] Verify OpenClaw responds to a basic text query via `/chat` +- [ ] Write `home_assistant.py` skill — test lights on/off via voice +- [ ] Write `memory.py` skill — test store and recall +- [ ] Write `weather.py` skill — verify HA weather sensor data +- [ ] Write `timer.py` skill — test set/fire a timer +- [ ] Write skill stubs: `music.py`, `vtube_studio.py`, `comfyui.py` +- [ ] Set up mem0 with Chroma backend, test semantic recall +- [ ] Write and test memory backup launchd job +- [ ] Deploy n8n via Docker (P1 task if not done) +- [ ] Build morning briefing n8n workflow +- [ ] Symlink `homeai-agent/skills/` → `~/.openclaw/skills/` +- [ ] Verify full voice → agent → HA action flow (with P3 pipeline) + +--- + +## Success Criteria + +- [ ] "Turn on the living room lights" → lights turn on via HA +- [ ] "Remember that I prefer jazz in the mornings" → mem0 stores it; "What do I like in the mornings?" → recalls it +- [ ] Morning briefing n8n workflow fires on schedule and speaks via TTS +- [ ] OpenClaw `/status` returns healthy +- [ ] OpenClaw survives Mac Mini reboot (launchd or Docker — TBD based on OpenClaw's preferred run method) diff --git a/homeai-character/PLAN.md b/homeai-character/PLAN.md new file mode 100644 index 0000000..022367b --- /dev/null +++ b/homeai-character/PLAN.md @@ -0,0 +1,300 @@ +# P5: homeai-character — Character System & Persona Config + +> Phase 3 | No hard runtime dependencies | Consumed by: P3, P4, P7 + +--- + +## Goal + +A single, authoritative character configuration that defines the AI assistant's personality, voice, visual expressions, and prompt rules. The Character Manager UI (already started as `character-manager.jsx`) provides a friendly editor. The exported JSON is the single source of truth for all pipeline components. + +--- + +## Character JSON Schema v1 + +File: `schema/character.schema.json` + +```json +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "title": "HomeAI Character Config", + "version": "1", + "type": "object", + "required": ["schema_version", "name", "system_prompt", "tts"], + "properties": { + "schema_version": { "type": "integer", "const": 1 }, + "name": { "type": "string" }, + "display_name": { "type": "string" }, + "description": { "type": "string" }, + + "system_prompt": { "type": "string" }, + + "model_overrides": { + "type": "object", + "properties": { + "primary": { "type": "string" }, + "fast": { "type": "string" } + } + }, + + "tts": { + "type": "object", + "required": ["engine"], + "properties": { + "engine": { + "type": "string", + "enum": ["kokoro", "chatterbox", "qwen3"] + }, + "voice_ref_path": { "type": "string" }, + "kokoro_voice": { "type": "string" }, + "speed": { "type": "number", "default": 1.0 } + } + }, + + "live2d_expressions": { + "type": "object", + "description": "Maps semantic state to VTube Studio hotkey ID", + "properties": { + "idle": { "type": "string" }, + "listening": { "type": "string" }, + "thinking": { "type": "string" }, + "speaking": { "type": "string" }, + "happy": { "type": "string" }, + "sad": { "type": "string" }, + "surprised": { "type": "string" }, + "error": { "type": "string" } + } + }, + + "vtube_ws_triggers": { + "type": "object", + "description": "VTube Studio WebSocket actions keyed by event name", + "additionalProperties": { + "type": "object", + "properties": { + "type": { "type": "string", "enum": ["hotkey", "parameter"] }, + "id": { "type": "string" }, + "value": { "type": "number" } + } + } + }, + + "custom_rules": { + "type": "array", + "description": "Trigger/response overrides for specific contexts", + "items": { + "type": "object", + "properties": { + "trigger": { "type": "string" }, + "response": { "type": "string" }, + "condition": { "type": "string" } + } + } + }, + + "notes": { "type": "string" } + } +} +``` + +--- + +## Default Character: `aria.json` + +File: `characters/aria.json` + +```json +{ + "schema_version": 1, + "name": "aria", + "display_name": "Aria", + "description": "Default HomeAI assistant persona", + + "system_prompt": "You are Aria, a warm, curious, and helpful AI assistant living in the home. You speak naturally and conversationally — never robotic. You are knowledgeable but never condescending. You remember the people you live with and build on those memories over time. Keep responses concise when controlling smart home devices; be more expressive in casual conversation. Never break character.", + + "model_overrides": { + "primary": "llama3.3:70b", + "fast": "qwen2.5:7b" + }, + + "tts": { + "engine": "kokoro", + "kokoro_voice": "af_heart", + "voice_ref_path": null, + "speed": 1.0 + }, + + "live2d_expressions": { + "idle": "expr_idle", + "listening": "expr_listening", + "thinking": "expr_thinking", + "speaking": "expr_speaking", + "happy": "expr_happy", + "sad": "expr_sad", + "surprised": "expr_surprised", + "error": "expr_error" + }, + + "vtube_ws_triggers": { + "thinking": { "type": "hotkey", "id": "expr_thinking" }, + "speaking": { "type": "hotkey", "id": "expr_speaking" }, + "idle": { "type": "hotkey", "id": "expr_idle" } + }, + + "custom_rules": [ + { + "trigger": "good morning", + "response": "Good morning! How did you sleep?", + "condition": "time_of_day == morning" + } + ], + + "notes": "Default persona. Voice clone to be added once reference audio recorded." +} +``` + +--- + +## Character Manager UI + +### Status + +`character-manager.jsx` already exists — needs: +1. Schema validation before export (reject malformed JSONs) +2. File system integration: save/load from `characters/` directory +3. Live preview of system prompt +4. Expression mapping UI for Live2D states + +### Tech Stack + +- React + Vite (local dev server, not deployed) +- Tailwind CSS (or minimal CSS) +- Runs at `http://localhost:5173` during editing + +### File Structure + +``` +homeai-character/ +├── src/ +│ ├── character-manager.jsx ← existing, extend here +│ ├── SchemaValidator.js ← validate against character.schema.json +│ ├── ExpressionMapper.jsx ← UI for Live2D expression mapping +│ └── main.jsx +├── schema/ +│ └── character.schema.json +├── characters/ +│ ├── aria.json ← default character +│ └── .gitkeep +├── package.json +└── vite.config.js +``` + +### Character Manager Features + +| Feature | Description | +|---|---| +| Basic info | name, display name, description | +| System prompt | Multi-line editor with char count | +| Model overrides | Dropdown: primary + fast model | +| TTS config | Engine picker, voice selector, speed slider, voice ref path | +| Expression mapping | Table: state → VTube hotkey ID | +| VTube WS triggers | JSON editor for advanced triggers | +| Custom rules | Add/edit/delete trigger-response pairs | +| Notes | Free-text notes field | +| Export | Validates schema, writes to `characters/.json` | +| Import | Load existing character JSON for editing | + +### Schema Validation + +```javascript +import Ajv from 'ajv' +import schema from '../schema/character.schema.json' + +const ajv = new Ajv() +const validate = ajv.compile(schema) + +export function validateCharacter(config) { + const valid = validate(config) + if (!valid) throw new Error(ajv.errorsText(validate.errors)) + return true +} +``` + +--- + +## Voice Clone Workflow + +1. Record 30–60 seconds of clean speech at `~/voices/-raw.wav` + - Quiet room, consistent mic distance, natural conversational tone +2. Pre-process: `ffmpeg -i raw.wav -ar 22050 -ac 1 aria.wav` +3. Place at `~/voices/aria.wav` +4. Update character JSON: `"voice_ref_path": "~/voices/aria.wav"`, `"engine": "chatterbox"` +5. Test: run Chatterbox with the reference, verify voice quality +6. If unsatisfactory, try Qwen3-TTS as alternative + +--- + +## Pipeline Integration + +### How P4 (OpenClaw) loads the character + +```python +import json +from pathlib import Path + +def load_character(name: str) -> dict: + path = Path.home() / ".openclaw" / "characters" / f"{name}.json" + config = json.loads(path.read_text()) + assert config["schema_version"] == 1, "Unsupported schema version" + return config + +# System prompt injection +character = load_character("aria") +system_prompt = character["system_prompt"] +# Pass to Ollama as system message +``` + +OpenClaw hot-reloads the character JSON on file change — no restart required. + +### How P3 selects TTS engine + +```python +character = load_character(active_name) +tts_cfg = character["tts"] + +if tts_cfg["engine"] == "chatterbox": + tts = ChatterboxTTS(voice_ref=tts_cfg["voice_ref_path"]) +elif tts_cfg["engine"] == "qwen3": + tts = Qwen3TTS() +else: # kokoro (default) + tts = KokoroWyomingClient(voice=tts_cfg.get("kokoro_voice", "af_heart")) +``` + +--- + +## Implementation Steps + +- [ ] Define and write `schema/character.schema.json` (v1) +- [ ] Write `characters/aria.json` — default character with placeholder expression IDs +- [ ] Set up Vite project in `src/` (install deps: `npm install`) +- [ ] Integrate existing `character-manager.jsx` into new Vite project +- [ ] Add schema validation on export (`ajv`) +- [ ] Add expression mapping UI section +- [ ] Add custom rules editor +- [ ] Test full edit → export → validate → load cycle +- [ ] Record or source voice reference audio for Aria +- [ ] Pre-process audio and test with Chatterbox +- [ ] Update `aria.json` with voice clone path if quality is good +- [ ] Write `SchemaValidator.js` as standalone utility (used by P4 at runtime too) +- [ ] Document schema in `schema/README.md` + +--- + +## Success Criteria + +- [ ] `aria.json` validates against `character.schema.json` without errors +- [ ] Character Manager UI can load, edit, and export `aria.json` +- [ ] OpenClaw loads `aria.json` system prompt and applies it to Ollama requests +- [ ] P3 TTS engine selection correctly follows `tts.engine` field +- [ ] Schema version check in P4 fails gracefully with a clear error message +- [ ] Voice clone sounds natural (if Chatterbox path taken) diff --git a/homeai-esp32/PLAN.md b/homeai-esp32/PLAN.md new file mode 100644 index 0000000..4d406b5 --- /dev/null +++ b/homeai-esp32/PLAN.md @@ -0,0 +1,357 @@ +# P6: homeai-esp32 — Room Satellite Hardware + +> Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running) + +--- + +## Goal + +Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini. + +--- + +## Hardware: ESP32-S3-BOX-3 + +| Feature | Spec | +|---|---| +| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) | +| RAM | 512KB SRAM + 16MB PSRAM | +| Flash | 16MB | +| Display | 2.4" IPS LCD, 320×240, touchscreen | +| Mic | Dual microphone array | +| Speaker | Built-in 1W speaker | +| Connectivity | WiFi 802.11b/g/n, BT 5.0 | +| USB | USB-C (programming + power) | + +--- + +## Architecture Per Unit + +``` +ESP32-S3-BOX-3 +├── microWakeWord (on-device, always listening) +│ └── triggers Wyoming Satellite on wake detection +├── Wyoming Satellite +│ ├── streams mic audio → Mac Mini Wyoming STT (port 10300) +│ └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301) +├── LVGL Display +│ └── animated face, driven by HA entity state +└── ESPHome OTA + └── firmware updates over WiFi +``` + +--- + +## ESPHome Configuration + +### Base Config Template + +`esphome/base.yaml` — shared across all units: + +```yaml +esphome: + name: homeai-${room} + friendly_name: "HomeAI ${room_display}" + platform: esp32 + board: esp32-s3-box-3 + +wifi: + ssid: !secret wifi_ssid + password: !secret wifi_password + ap: + ssid: "HomeAI Fallback" + +api: + encryption: + key: !secret api_key + +ota: + password: !secret ota_password + +logger: + level: INFO +``` + +### Room-Specific Config + +`esphome/s3-box-living-room.yaml`: + +```yaml +substitutions: + room: living-room + room_display: "Living Room" + mac_mini_ip: "192.168.1.x" # or Tailscale IP + +packages: + base: !include base.yaml + voice: !include voice.yaml + display: !include display.yaml +``` + +One file per room, only the substitutions change. + +### Voice / Wyoming Satellite — `esphome/voice.yaml` + +```yaml +microphone: + - platform: esp_adf + id: mic + +speaker: + - platform: esp_adf + id: spk + +micro_wake_word: + model: hey_jarvis # or custom model path + on_wake_word_detected: + - voice_assistant.start: + +voice_assistant: + microphone: mic + speaker: spk + noise_suppression_level: 2 + auto_gain: 31dBFS + volume_multiplier: 2.0 + + on_listening: + - display.page.show: page_listening + - script.execute: animate_face_listening + + on_stt_vad_end: + - display.page.show: page_thinking + - script.execute: animate_face_thinking + + on_tts_start: + - display.page.show: page_speaking + - script.execute: animate_face_speaking + + on_end: + - display.page.show: page_idle + - script.execute: animate_face_idle + + on_error: + - display.page.show: page_error + - script.execute: animate_face_error +``` + +**Note:** ESPHome's `voice_assistant` component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path. + +### LVGL Display — `esphome/display.yaml` + +```yaml +display: + - platform: ili9xxx + model: ILI9341 + id: lcd + cs_pin: GPIO5 + dc_pin: GPIO4 + reset_pin: GPIO48 + +touchscreen: + - platform: tt21100 + id: touch + +lvgl: + displays: + - lcd + touchscreens: + - touch + + # Face widget — centered on screen + widgets: + - obj: + id: face_container + width: 320 + height: 240 + bg_color: 0x000000 + children: + # Eyes (two circles) + - obj: + id: eye_left + x: 90 + y: 90 + width: 50 + height: 50 + radius: 25 + bg_color: 0xFFFFFF + - obj: + id: eye_right + x: 180 + y: 90 + width: 50 + height: 50 + radius: 25 + bg_color: 0xFFFFFF + # Mouth (line/arc) + - arc: + id: mouth + x: 110 + y: 160 + width: 100 + height: 40 + start_angle: 180 + end_angle: 360 + arc_color: 0xFFFFFF + + pages: + - id: page_idle + - id: page_listening + - id: page_thinking + - id: page_speaking + - id: page_error +``` + +### LVGL Face State Animations — `esphome/animations.yaml` + +```yaml +script: + - id: animate_face_idle + then: + - lvgl.widget.modify: + id: eye_left + height: 50 # normal open + - lvgl.widget.modify: + id: eye_right + height: 50 + - lvgl.widget.modify: + id: mouth + arc_color: 0xFFFFFF + + - id: animate_face_listening + then: + - lvgl.widget.modify: + id: eye_left + height: 60 # wider eyes + - lvgl.widget.modify: + id: eye_right + height: 60 + - lvgl.widget.modify: + id: mouth + arc_color: 0x00BFFF # blue tint + + - id: animate_face_thinking + then: + - lvgl.widget.modify: + id: eye_left + height: 20 # squinting + - lvgl.widget.modify: + id: eye_right + height: 20 + + - id: animate_face_speaking + then: + - lvgl.widget.modify: + id: mouth + arc_color: 0x00FF88 # green speaking indicator + + - id: animate_face_error + then: + - lvgl.widget.modify: + id: eye_left + bg_color: 0xFF2200 # red eyes + - lvgl.widget.modify: + id: eye_right + bg_color: 0xFF2200 +``` + +> **Note:** True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback. + +--- + +## Secrets File + +`esphome/secrets.yaml` (gitignored): + +```yaml +wifi_ssid: "YourNetwork" +wifi_password: "YourPassword" +api_key: "<32-byte base64 key>" +ota_password: "YourOTAPassword" +``` + +--- + +## Flash & Deployment Workflow + +```bash +# Install ESPHome +pip install esphome + +# Compile + flash via USB (first time) +esphome run esphome/s3-box-living-room.yaml + +# OTA update (subsequent) +esphome upload esphome/s3-box-living-room.yaml --device + +# View logs +esphome logs esphome/s3-box-living-room.yaml +``` + +--- + +## Home Assistant Integration + +After flashing: +1. HA discovers ESP32 automatically via mDNS +2. Add device in HA → Settings → Devices +3. Assign Wyoming voice assistant pipeline to the device +4. Set up room-specific automations (e.g., "Living Room" light control from that satellite) + +--- + +## Directory Layout + +``` +homeai-esp32/ +└── esphome/ + ├── base.yaml + ├── voice.yaml + ├── display.yaml + ├── animations.yaml + ├── s3-box-living-room.yaml + ├── s3-box-bedroom.yaml # template, fill in when hardware available + ├── s3-box-kitchen.yaml # template + └── secrets.yaml # gitignored +``` + +--- + +## Wake Word Decisions + +| Option | Latency | Privacy | Effort | +|---|---|---|---| +| `hey_jarvis` (built-in microWakeWord) | ~200ms | On-device | Zero | +| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings | +| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium | + +**Recommendation:** Start with `hey_jarvis`. Train a custom word (character's name) once character name is finalised. + +--- + +## Implementation Steps + +- [ ] Install ESPHome: `pip install esphome` +- [ ] Write `esphome/secrets.yaml` (gitignored) +- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml` +- [ ] Write `s3-box-living-room.yaml` for first unit +- [ ] Flash first unit via USB: `esphome run s3-box-living-room.yaml` +- [ ] Verify unit appears in HA device list +- [ ] Assign Wyoming voice pipeline to unit in HA +- [ ] Test: speak wake word → transcription → LLM response → spoken reply +- [ ] Test: LVGL face cycles through idle → listening → thinking → speaking +- [ ] Verify OTA update works: change LVGL color, deploy wirelessly +- [ ] Write config templates for remaining rooms (bedroom, kitchen) +- [ ] Flash remaining units, verify each works independently +- [ ] Document final MAC address → room name mapping + +--- + +## Success Criteria + +- [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance +- [ ] STT transcription accuracy >90% for clear speech in quiet room +- [ ] TTS audio plays clearly through ESP32 speaker +- [ ] LVGL face shows correct state for idle / listening / thinking / speaking / error +- [ ] OTA firmware updates work without USB cable +- [ ] Unit reconnects automatically after WiFi drop +- [ ] Unit survives power cycle and resumes normal operation diff --git a/homeai-images/PLAN.md b/homeai-images/PLAN.md new file mode 100644 index 0000000..2f28ba0 --- /dev/null +++ b/homeai-images/PLAN.md @@ -0,0 +1,393 @@ +# P8: homeai-images — Image Generation + +> Phase 6 | Depends on: P4 (OpenClaw skill runner) | Independent of P6, P7 + +--- + +## Goal + +ComfyUI running natively on Mac Mini with SDXL and Flux.1 models. A character LoRA trained for consistent appearance. OpenClaw skill exposes image generation as a callable tool. Saved workflows cover the most common use cases. + +--- + +## Why Native (not Docker) + +Same reasoning as Ollama: ComfyUI needs Metal GPU acceleration. Docker on Mac can't access the GPU. ComfyUI runs natively as a launchd service. + +--- + +## Installation + +```bash +# Clone ComfyUI +git clone https://github.com/comfyanonymous/ComfyUI ~/ComfyUI +cd ~/ComfyUI + +# Install dependencies (Python 3.11+, venv recommended) +python3 -m venv venv +source venv/bin/activate +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu +pip install -r requirements.txt + +# Launch +python main.py --listen 0.0.0.0 --port 8188 +``` + +**Note:** Use the PyTorch MPS backend for Apple Silicon: + +```python +# ComfyUI auto-detects MPS — no extra config needed +# Verify by checking ComfyUI startup logs for "Using device: mps" +``` + +### launchd plist — `com.homeai.comfyui.plist` + +```xml + + + + + Label + com.homeai.comfyui + ProgramArguments + + /Users//ComfyUI/venv/bin/python + /Users//ComfyUI/main.py + --listen + 0.0.0.0 + --port + 8188 + + WorkingDirectory + /Users//ComfyUI + RunAtLoad + + KeepAlive + + StandardOutPath + /tmp/comfyui.log + StandardErrorPath + /tmp/comfyui.err + + +``` + +--- + +## Model Downloads + +### Model Manifest + +`~/ComfyUI/models/` structure: + +``` +checkpoints/ +├── sd_xl_base_1.0.safetensors # SDXL base +├── flux1-dev.safetensors # Flux.1-dev (high quality) +└── flux1-schnell.safetensors # Flux.1-schnell (fast drafts) + +vae/ +├── sdxl_vae.safetensors +└── ae.safetensors # Flux VAE + +clip/ +├── clip_l.safetensors +└── t5xxl_fp16.safetensors # Flux text encoder + +controlnet/ +├── controlnet-canny-sdxl.safetensors +└── controlnet-depth-sdxl.safetensors + +loras/ +└── aria-v1.safetensors # Character LoRA (trained locally) +``` + +### Download Script — `scripts/download-models.sh` + +```bash +#!/usr/bin/env bash +MODELS_DIR=~/ComfyUI/models + +# HuggingFace downloads (requires huggingface-cli or wget) +pip install huggingface_hub + +python3 -c " +from huggingface_hub import hf_hub_download +import os + +downloads = [ + ('stabilityai/stable-diffusion-xl-base-1.0', 'sd_xl_base_1.0.safetensors', 'checkpoints'), + ('black-forest-labs/FLUX.1-schnell', 'flux1-schnell.safetensors', 'checkpoints'), +] + +for repo, filename, subdir in downloads: + hf_hub_download( + repo_id=repo, + filename=filename, + local_dir=f'{os.path.expanduser(\"~/ComfyUI/models\")}/{subdir}' + ) +" +``` + +> Flux.1-dev requires accepting HuggingFace license agreement. Download manually if script fails. + +--- + +## Saved Workflows + +All workflows stored as ComfyUI JSON in `homeai-images/workflows/`. + +### `portrait.json` — Character Portrait + +Standard character portrait with expression control. + +Key nodes: +- **CheckpointLoader:** SDXL base +- **LoraLoader:** aria character LoRA +- **CLIPTextEncode:** positive prompt includes character description + expression +- **KSampler:** 25 steps, DPM++ 2M, CFG 7 +- **VAEDecode → SaveImage** + +Positive prompt template: +``` +aria, (character lora), 1girl, solo, portrait, looking at viewer, +soft lighting, detailed face, high quality, masterpiece, + +``` + +### `scene.json` — Character in Scene with ControlNet + +Uses ControlNet depth/canny for pose control. + +Key nodes: +- **LoadImage:** input pose reference image +- **ControlNetLoader:** canny or depth model +- **ControlNetApply:** apply to conditioning +- **KSampler** with ControlNet guidance + +### `quick.json` — Fast Draft via Flux.1-schnell + +Low-step, fast generation for quick previews. + +Key nodes: +- **CheckpointLoader:** flux1-schnell +- **KSampler:** 4 steps, Euler, CFG 1 (Flux uses CFG=1) +- Output: 512×512 or 768×768 + +### `upscale.json` — 2× Upscale + +Takes existing image, upscales 2× with detail enhancement. + +Key nodes: +- **LoadImage** +- **UpscaleModelLoader:** `4x_NMKD-Siax_200k.pth` (download separately) +- **ImageUpscaleWithModel** +- **KSampler img2img** for detail pass + +--- + +## `comfyui.py` Skill — OpenClaw Integration + +Full implementation (replaces stub from P4). + +File: `homeai-images/skills/comfyui.py` + +```python +""" +ComfyUI image generation skill for OpenClaw. +Submits workflow JSON via ComfyUI REST API and returns generated image path. +""" + +import json +import time +import uuid +import requests +from pathlib import Path + +COMFYUI_URL = "http://localhost:8188" +WORKFLOWS_DIR = Path(__file__).parent.parent / "workflows" +OUTPUT_DIR = Path.home() / "ComfyUI" / "output" + +def generate(workflow_name: str, params: dict = None) -> str: + """ + Submit a named workflow to ComfyUI. + Returns the path of the generated image. + + Args: + workflow_name: Name of workflow JSON (without .json extension) + params: Dict of node overrides, e.g. {"positive_prompt": "...", "steps": 20} + + Returns: + Absolute path to generated image file + """ + workflow_path = WORKFLOWS_DIR / f"{workflow_name}.json" + if not workflow_path.exists(): + raise ValueError(f"Workflow '{workflow_name}' not found at {workflow_path}") + + workflow = json.loads(workflow_path.read_text()) + + # Apply param overrides + if params: + workflow = _apply_params(workflow, params) + + # Submit to ComfyUI queue + client_id = str(uuid.uuid4()) + prompt_id = _queue_prompt(workflow, client_id) + + # Poll for completion + image_path = _wait_for_output(prompt_id, client_id) + return str(image_path) + + +def _queue_prompt(workflow: dict, client_id: str) -> str: + resp = requests.post( + f"{COMFYUI_URL}/prompt", + json={"prompt": workflow, "client_id": client_id} + ) + resp.raise_for_status() + return resp.json()["prompt_id"] + + +def _wait_for_output(prompt_id: str, client_id: str, timeout: int = 120) -> Path: + start = time.time() + while time.time() - start < timeout: + resp = requests.get(f"{COMFYUI_URL}/history/{prompt_id}") + history = resp.json() + if prompt_id in history: + outputs = history[prompt_id]["outputs"] + for node_output in outputs.values(): + if "images" in node_output: + img = node_output["images"][0] + return OUTPUT_DIR / img["subfolder"] / img["filename"] + time.sleep(2) + raise TimeoutError(f"ComfyUI generation timed out after {timeout}s") + + +def _apply_params(workflow: dict, params: dict) -> dict: + """ + Apply parameter overrides to workflow nodes. + Expects workflow nodes to have a 'title' field for addressing. + e.g., params={"positive_prompt": "new prompt"} updates node titled "positive_prompt" + """ + for node_id, node in workflow.items(): + title = node.get("_meta", {}).get("title", "") + if title in params: + node["inputs"]["text"] = params[title] + return workflow + + +# Convenience wrappers for OpenClaw +def portrait(expression: str = "neutral", extra_prompt: str = "") -> str: + return generate("portrait", {"positive_prompt": f"aria, {expression}, {extra_prompt}"}) + +def quick(prompt: str) -> str: + return generate("quick", {"positive_prompt": prompt}) + +def scene(prompt: str, controlnet_image_path: str = None) -> str: + params = {"positive_prompt": prompt} + if controlnet_image_path: + params["controlnet_image"] = controlnet_image_path + return generate("scene", params) +``` + +--- + +## Character LoRA Training + +A LoRA trains the model to consistently generate the character's appearance. + +### Dataset Preparation + +1. Collect 20–50 reference images of the character (or commission a character sheet) +2. Consistent style, multiple angles/expressions +3. Resize to 1024×1024, square crop +4. Write captions: `aria, 1girl, solo, ` +5. Store in `~/lora-training/aria/` + +### Training + +Use **kohya_ss** or **SimpleTuner** for LoRA training on Apple Silicon: + +```bash +# kohya_ss (SDXL LoRA) +git clone https://github.com/bmaltais/kohya_ss +pip install -r requirements.txt + +# Training config — key params for MPS +python train_network.py \ + --pretrained_model_name_or_path=~/ComfyUI/models/checkpoints/sd_xl_base_1.0.safetensors \ + --train_data_dir=~/lora-training/aria \ + --output_dir=~/ComfyUI/models/loras \ + --output_name=aria-v1 \ + --network_module=networks.lora \ + --network_dim=32 \ + --network_alpha=16 \ + --max_train_epochs=10 \ + --learning_rate=1e-4 +``` + +> Training on M4 Pro via MPS: expect 1–4 hours for a 20-image dataset at 10 epochs. + +--- + +## Directory Layout + +``` +homeai-images/ +├── workflows/ +│ ├── portrait.json +│ ├── scene.json +│ ├── quick.json +│ └── upscale.json +└── skills/ + └── comfyui.py +``` + +--- + +## Interface Contracts + +**Consumes:** +- ComfyUI REST API: `http://localhost:8188` +- Workflows from `homeai-images/workflows/` +- Character LoRA from `~/ComfyUI/models/loras/aria-v1.safetensors` + +**Exposes:** +- `comfyui.generate(workflow, params)` → image path — called by P4 OpenClaw + +**Add to `.env.services`:** +```dotenv +COMFYUI_URL=http://localhost:8188 +``` + +--- + +## Implementation Steps + +- [ ] Clone ComfyUI to `~/ComfyUI/`, install deps in venv +- [ ] Verify MPS is detected at launch (`Using device: mps` in logs) +- [ ] Write and load launchd plist +- [ ] Download SDXL base model via `scripts/download-models.sh` +- [ ] Download Flux.1-schnell +- [ ] Test basic generation via ComfyUI web UI (browse to port 8188) +- [ ] Build and save `quick.json` workflow in ComfyUI UI, export JSON +- [ ] Build and save `portrait.json` workflow, export JSON +- [ ] Build and save `scene.json` workflow with ControlNet, export JSON +- [ ] Write `skills/comfyui.py` full implementation +- [ ] Test skill: `comfyui.quick("a cat sitting on a couch")` → image file +- [ ] Collect character reference images for LoRA training +- [ ] Train SDXL LoRA with kohya_ss +- [ ] Load LoRA in `portrait.json` workflow, verify character consistency +- [ ] Symlink `skills/` to `~/.openclaw/skills/` +- [ ] Test via OpenClaw: "Generate a portrait of Aria looking happy" + +--- + +## Success Criteria + +- [ ] ComfyUI UI accessible at `http://localhost:8188` after reboot +- [ ] `quick.json` workflow generates an image in <30s on M4 Pro +- [ ] `portrait.json` with character LoRA produces consistent character appearance +- [ ] `comfyui.generate("quick", {"positive_prompt": "test"})` returns a valid image path +- [ ] Generated images are saved to `~/ComfyUI/output/` +- [ ] ComfyUI survives Mac Mini reboot via launchd diff --git a/homeai-infra/PLAN.md b/homeai-infra/PLAN.md new file mode 100644 index 0000000..0521340 --- /dev/null +++ b/homeai-infra/PLAN.md @@ -0,0 +1,191 @@ +# P1: homeai-infra — Infrastructure & Foundation + +> Phase 1 | No hard dependencies | Must complete before all other projects + +--- + +## Goal + +Get the Mac Mini running a stable, self-healing Docker stack accessible over Tailscale. All services should survive a reboot with no manual intervention. + +--- + +## Deliverables + +### 1. Directory Layout + +``` +~/server/ +├── docker/ +│ ├── home-assistant/ +│ │ └── docker-compose.yml +│ ├── open-webui/ +│ │ └── docker-compose.yml +│ ├── portainer/ +│ │ └── docker-compose.yml +│ ├── uptime-kuma/ +│ │ └── docker-compose.yml +│ ├── gitea/ +│ │ └── docker-compose.yml +│ ├── n8n/ +│ │ └── docker-compose.yml +│ └── code-server/ +│ └── docker-compose.yml +├── .env.services ← shared service URLs, written by this project +├── .env.secrets ← secrets, never committed +└── Makefile ← up/down/restart/logs per service +``` + +### 2. Services to Deploy + +| Service | Image | Port | Purpose | +|---|---|---|---| +| Home Assistant | `ghcr.io/home-assistant/home-assistant:stable` | 8123 | Smart home platform | +| Portainer | `portainer/portainer-ce` | 9443 | Docker management UI | +| Uptime Kuma | `louislam/uptime-kuma` | 3001 | Service health monitoring | +| Gitea | `gitea/gitea` | 3000 (HTTP), 2222 (SSH) | Self-hosted Git | +| code-server | `codercom/code-server` | 8080 | Browser VS Code | +| n8n | `n8nio/n8n` | 5678 | Workflow automation | + +> Open WebUI deployed in P2 (depends on Ollama being up first). + +### 3. Docker Configuration Standards + +Each compose file follows this pattern: + +```yaml +services: + : + image: + container_name: + restart: unless-stopped + env_file: + - ../../.env.secrets + volumes: + - ./-data:/data + networks: + - homeai + ports: + - ":" + +networks: + homeai: + external: true +``` + +- Shared `homeai` Docker network created once: `docker network create homeai` +- All data volumes stored in service subdirectory (e.g., `home-assistant/ha-data/`) +- Never use `network_mode: host` unless required by service + +### 4. `.env.services` — Interface Contract + +Written by this project, sourced by all others: + +```dotenv +HA_URL=http://localhost:8123 +HA_TOKEN= +PORTAINER_URL=https://localhost:9443 +GITEA_URL=http://localhost:3000 +N8N_URL=http://localhost:5678 +CODE_SERVER_URL=http://localhost:8080 +UPTIME_KUMA_URL=http://localhost:3001 +``` + +### 5. `.env.secrets` (template, not committed) + +```dotenv +HA_TOKEN= +GITEA_ADMIN_PASSWORD= +CODE_SERVER_PASSWORD= +N8N_ENCRYPTION_KEY= +``` + +Committed as `.env.secrets.example` with blank values. + +### 6. Tailscale Setup + +- Install Tailscale on Mac Mini: `brew install tailscale` +- Run `tailscale up --accept-routes` +- All service URLs are LAN-only; Tailscale provides remote access without port forwarding +- No ports opened in router/firewall + +### 7. Makefile Targets + +```makefile +up-ha: # docker compose -f docker/home-assistant/docker-compose.yml up -d +down-ha: +logs-ha: +up-all: # bring up all services in dependency order +down-all: +restart-all: +status: # docker ps --format table +``` + +### 8. Gitea Initialisation + +- Admin user created, SSH key added +- Repos created for all 8 sub-projects +- SSH remote added to each local repo +- `.gitignore` templates: exclude `*.env.secrets`, `*-data/`, `__pycache__/` + +### 9. Uptime Kuma Monitors + +One monitor per service: +- Home Assistant HTTP check → `http://localhost:8123` +- Portainer HTTPS check → `https://localhost:9443` +- Gitea HTTP check → `http://localhost:3000` +- n8n HTTP check → `http://localhost:5678` +- Ollama HTTP check → `http://localhost:11434` (set up after P2) +- Wyoming STT TCP check → port 10300 (set up after P3) + +Alerts: configure ntfy or Pushover for mobile notifications. + +### 10. Reboot Survival + +- Docker Desktop for Mac: set to launch at login +- Docker containers: `restart: unless-stopped` on all +- Ollama: launchd plist (configured in P2) +- Wyoming: launchd plist (configured in P3) +- ComfyUI: launchd plist (configured in P8) + +--- + +## Home Assistant Setup + +After container is running: + +1. Complete onboarding at `http://localhost:8123` +2. Create a long-lived access token: Profile → Long-Lived Access Tokens +3. Write token to `.env.secrets` as `HA_TOKEN` +4. Install HACS (Home Assistant Community Store) — needed for custom integrations +5. Enable advanced mode in user profile + +--- + +## Implementation Steps + +- [ ] Install Docker Desktop for Mac, enable at login +- [ ] Create `homeai` Docker network +- [ ] Create `~/server/` directory structure +- [ ] Write compose files for all services +- [ ] Write `.env.secrets.example` +- [ ] Write `Makefile` with up/down/logs targets +- [ ] `make up-all` — bring all services up +- [ ] Home Assistant onboarding — generate HA_TOKEN +- [ ] Write `.env.services` +- [ ] Install Tailscale, connect all services accessible on Tailnet +- [ ] Gitea: create admin account, initialise repos, push initial commits +- [ ] Uptime Kuma: add all monitors, configure alerts +- [ ] Verify all containers restart cleanly after `docker restart` test +- [ ] Verify all containers survive a Mac Mini reboot + +--- + +## Success Criteria + +- [ ] `docker ps` shows all services running after a cold reboot +- [ ] Home Assistant UI reachable at `http://localhost:8123` +- [ ] Gitea accessible, SSH push/pull working +- [ ] Uptime Kuma showing green for all services +- [ ] All services reachable via Tailscale IP from a remote device +- [ ] `.env.services` exists and all URLs are valid diff --git a/homeai-llm/PLAN.md b/homeai-llm/PLAN.md new file mode 100644 index 0000000..523a11e --- /dev/null +++ b/homeai-llm/PLAN.md @@ -0,0 +1,202 @@ +# P2: homeai-llm — Local LLM Runtime + +> Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing + +--- + +## Goal + +Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7). + +--- + +## Why Native (not Docker) + +Ollama must run natively — not in Docker — because: +- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM) +- Native Ollama uses Metal for GPU acceleration, giving 3–5× faster inference +- Ollama's launchd integration keeps it alive across reboots + +--- + +## Deliverables + +### 1. Ollama Installation + +```bash +# Install +brew install ollama + +# Or direct install +curl -fsSL https://ollama.com/install.sh | sh +``` + +Ollama runs as a background process. Configure as a launchd service for reboot survival. + +**launchd plist:** `~/Library/LaunchAgents/com.ollama.ollama.plist` + +```xml + + + + + Label + com.ollama.ollama + ProgramArguments + + /usr/local/bin/ollama + serve + + RunAtLoad + + KeepAlive + + StandardOutPath + /tmp/ollama.log + StandardErrorPath + /tmp/ollama.err + + +``` + +Load: `launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist` + +### 2. Model Manifest — `ollama-models.txt` + +Pinned models pulled to Mac Mini: + +``` +# Primary — high quality responses +llama3.3:70b +qwen2.5:72b + +# Fast — low-latency tasks (timers, quick queries, TTS pre-processing) +qwen2.5:7b + +# Code — for n8n/skill writing assistance +qwen2.5-coder:32b + +# Embedding — for mem0 semantic search +nomic-embed-text +``` + +Pull script (`scripts/pull-models.sh`): +```bash +#!/usr/bin/env bash +while IFS= read -r model; do + [[ "$model" =~ ^#.*$ || -z "$model" ]] && continue + echo "Pulling $model..." + ollama pull "$model" +done < ../ollama-models.txt +``` + +### 3. Open WebUI — Docker + +Open WebUI connects to Ollama over the Docker-to-host bridge (`host.docker.internal`): + +**`docker/open-webui/docker-compose.yml`:** + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + container_name: open-webui + restart: unless-stopped + volumes: + - ./open-webui-data:/app/backend/data + environment: + - OLLAMA_BASE_URL=http://host.docker.internal:11434 + ports: + - "3030:8080" + networks: + - homeai + extra_hosts: + - "host.docker.internal:host-gateway" + +networks: + homeai: + external: true +``` + +Port `3030` chosen to avoid conflict with Gitea (3000). + +### 4. Benchmark Script — `scripts/benchmark.sh` + +Measures tokens/sec for each model to inform model selection per task: + +```bash +#!/usr/bin/env bash +PROMPT="Tell me a joke about computers." +for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do + echo "=== $model ===" + time ollama run "$model" "$PROMPT" --nowordwrap +done +``` + +Results documented in `scripts/benchmark-results.md`. + +### 5. API Verification + +```bash +# Check Ollama is running +curl http://localhost:11434/api/tags + +# Test OpenAI-compatible endpoint (used by P3, P4) +curl http://localhost:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen2.5:7b", + "messages": [{"role": "user", "content": "Hello"}] + }' +``` + +### 6. Model Selection Guide + +Document in `scripts/benchmark-results.md` after benchmarking: + +| Task | Model | Reason | +|---|---|---| +| Main conversation | `llama3.3:70b` | Best quality | +| Quick/real-time tasks | `qwen2.5:7b` | Lowest latency | +| Code generation (skills) | `qwen2.5-coder:32b` | Best code quality | +| Embeddings (mem0) | `nomic-embed-text` | Compact, fast | + +--- + +## Interface Contract + +- **Ollama API:** `http://localhost:11434` (native Ollama) +- **OpenAI-compatible API:** `http://localhost:11434/v1` — used by P3, P4, P7 +- **Open WebUI:** `http://localhost:3030` + +Add to `~/server/.env.services`: +```dotenv +OLLAMA_URL=http://localhost:11434 +OLLAMA_API_URL=http://localhost:11434/v1 +OPEN_WEBUI_URL=http://localhost:3030 +``` + +--- + +## Implementation Steps + +- [ ] Install Ollama via brew +- [ ] Verify `ollama serve` starts and responds at port 11434 +- [ ] Write launchd plist, load it, verify auto-start on reboot +- [ ] Write `ollama-models.txt` with model list +- [ ] Run `scripts/pull-models.sh` — pull all models (allow time for large downloads) +- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md` +- [ ] Deploy Open WebUI via Docker compose +- [ ] Verify Open WebUI can chat with all models +- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services` +- [ ] Add Ollama and Open WebUI monitors to Uptime Kuma + +--- + +## Success Criteria + +- [ ] `curl http://localhost:11434/api/tags` returns all expected models +- [ ] `llama3.3:70b` generates a coherent response in Open WebUI +- [ ] Ollama survives Mac Mini reboot without manual intervention +- [ ] Benchmark results documented — at least one model achieving >10 tok/s +- [ ] Open WebUI accessible at `http://localhost:3030` via Tailscale diff --git a/homeai-visual/PLAN.md b/homeai-visual/PLAN.md new file mode 100644 index 0000000..dfd96f5 --- /dev/null +++ b/homeai-visual/PLAN.md @@ -0,0 +1,322 @@ +# P7: homeai-visual — VTube Studio Visual Layer + +> Phase 5 | Depends on: P4 (OpenClaw skill runner), P5 (character expression map) + +--- + +## Goal + +VTube Studio displays a Live2D model on Mac Mini desktop and mobile. Expressions are driven by the AI pipeline state (thinking, speaking, happy, etc.) via an OpenClaw skill that talks to VTube Studio's WebSocket API. Lip sync follows audio amplitude. + +--- + +## Architecture + +``` +OpenClaw pipeline state + ↓ (during LLM response generation) +vtube_studio.py skill + ↓ WebSocket (port 8001) +VTube Studio (macOS app) + ↓ +Live2D model renders expression + ↓ +Displayed on: + - Mac Mini desktop (primary) + - iPhone/iPad (VTube Studio mobile, same model via Tailscale) +``` + +--- + +## VTube Studio Setup + +### Installation + +1. Download VTube Studio from the Mac App Store +2. Launch, go through initial setup +3. Enable WebSocket API: Settings → WebSocket API → Enable (port 8001) +4. Load Live2D model (see Model section below) + +### WebSocket API Authentication + +VTube Studio uses a token-based auth flow: + +```python +import asyncio +import websockets +import json + +async def authenticate(): + async with websockets.connect("ws://localhost:8001") as ws: + # Step 1: request authentication token + await ws.send(json.dumps({ + "apiName": "VTubeStudioPublicAPI", + "apiVersion": "1.0", + "requestID": "auth-req", + "messageType": "AuthenticationTokenRequest", + "data": { + "pluginName": "HomeAI", + "pluginDeveloper": "HomeAI", + "pluginIcon": None + } + })) + response = json.loads(await ws.recv()) + token = response["data"]["authenticationToken"] + # User must click "Allow" in VTube Studio UI + + # Step 2: authenticate with token + await ws.send(json.dumps({ + "apiName": "VTubeStudioPublicAPI", + "apiVersion": "1.0", + "requestID": "auth", + "messageType": "AuthenticationRequest", + "data": { + "pluginName": "HomeAI", + "pluginDeveloper": "HomeAI", + "authenticationToken": token + } + })) + auth_resp = json.loads(await ws.recv()) + print("Authenticated:", auth_resp["data"]["authenticated"]) + return token +``` + +Token is persisted to `~/.openclaw/vtube_token.json`. + +--- + +## `vtube_studio.py` Skill + +Full implementation (replaces the stub from P4). + +File: `homeai-visual/skills/vtube_studio.py` (symlinked to `~/.openclaw/skills/`) + +```python +""" +VTube Studio WebSocket skill for OpenClaw. +Drives Live2D model expressions based on AI pipeline state. +""" + +import asyncio +import json +import websockets +from pathlib import Path + +VTUBE_WS_URL = "ws://localhost:8001" +TOKEN_PATH = Path.home() / ".openclaw" / "vtube_token.json" + +class VTubeStudioSkill: + def __init__(self, character_config: dict): + self.expression_map = character_config.get("live2d_expressions", {}) + self.ws_triggers = character_config.get("vtube_ws_triggers", {}) + self.token = self._load_token() + self._ws = None + + def _load_token(self) -> str | None: + if TOKEN_PATH.exists(): + return json.loads(TOKEN_PATH.read_text()).get("token") + return None + + def _save_token(self, token: str): + TOKEN_PATH.write_text(json.dumps({"token": token})) + + async def connect(self): + self._ws = await websockets.connect(VTUBE_WS_URL) + if self.token: + await self._authenticate() + else: + await self._request_new_token() + + async def _authenticate(self): + await self._send({ + "messageType": "AuthenticationRequest", + "data": { + "pluginName": "HomeAI", + "pluginDeveloper": "HomeAI", + "authenticationToken": self.token + } + }) + resp = await self._recv() + if not resp["data"].get("authenticated"): + # Token expired — request a new one + await self._request_new_token() + + async def _request_new_token(self): + await self._send({ + "messageType": "AuthenticationTokenRequest", + "data": { + "pluginName": "HomeAI", + "pluginDeveloper": "HomeAI", + "pluginIcon": None + } + }) + resp = await self._recv() + token = resp["data"]["authenticationToken"] + self._save_token(token) + self.token = token + await self._authenticate() + + async def trigger_expression(self, event: str): + """Trigger a named expression state (idle, thinking, speaking, etc.)""" + hotkey_id = self.expression_map.get(event) + if not hotkey_id: + return + await self._trigger_hotkey(hotkey_id) + + async def _trigger_hotkey(self, hotkey_id: str): + await self._send({ + "messageType": "HotkeyTriggerRequest", + "data": {"hotkeyID": hotkey_id} + }) + await self._recv() + + async def set_parameter(self, name: str, value: float): + """Set a VTube Studio parameter (e.g., mouth open for lip sync)""" + await self._send({ + "messageType": "InjectParameterDataRequest", + "data": { + "parameterValues": [ + {"id": name, "value": value} + ] + } + }) + await self._recv() + + async def _send(self, payload: dict): + full = { + "apiName": "VTubeStudioPublicAPI", + "apiVersion": "1.0", + "requestID": "homeai", + **payload + } + await self._ws.send(json.dumps(full)) + + async def _recv(self) -> dict: + return json.loads(await self._ws.recv()) + + async def close(self): + if self._ws: + await self._ws.close() + + +# OpenClaw skill entry point — synchronous wrapper +def trigger_expression(event: str, character_config: dict): + skill = VTubeStudioSkill(character_config) + asyncio.run(_run(skill, event)) + +async def _run(skill, event): + await skill.connect() + await skill.trigger_expression(event) + await skill.close() +``` + +--- + +## Lip Sync + +### Phase 1: Amplitude-Based (Simple) + +During TTS audio playback, sample audio amplitude and map to mouth open parameter: + +```python +import numpy as np +import sounddevice as sd + +def stream_with_lipsync(audio_data: np.ndarray, sample_rate: int, vtube: VTubeStudioSkill): + chunk_size = 1024 + for i in range(0, len(audio_data), chunk_size): + chunk = audio_data[i:i+chunk_size] + amplitude = float(np.abs(chunk).mean()) / 32768.0 # normalise 16-bit PCM + mouth_value = min(amplitude * 10, 1.0) # scale to 0–1 + asyncio.run(vtube.set_parameter("MouthOpen", mouth_value)) + sd.play(chunk, sample_rate, blocking=True) + asyncio.run(vtube.set_parameter("MouthOpen", 0.0)) # close mouth after +``` + +### Phase 2: Phoneme-Based (Future) + +Parse TTS phoneme timing from Kokoro/Chatterbox output and drive expression per phoneme. More accurate but significantly more complex. Defer to after Phase 5. + +--- + +## Live2D Model + +### Options + +| Option | Cost | Effort | Quality | +|---|---|---|---| +| Free models (VTube Studio sample packs) | Free | Low | Generic | +| Purchase from nizima.com or booth.pm | ¥3,000–¥30,000 | Low | High | +| Commission custom model | ¥50,000–¥200,000+ | Low (for you) | Unique | + +**Recommendation:** Start with a purchased model from nizima.com or booth.pm that matches the character's aesthetic. Commission custom later once personality is locked in. + +### Model Setup + +1. Download `.vtube.model3.json` + associated assets +2. Place in `~/Documents/Live2DModels/` (VTube Studio default) +3. Load in VTube Studio: Model tab → Add Model +4. Map hotkeys: VTube Studio → Hotkeys → create one per expression state +5. Record hotkey IDs, update `aria.json` `live2d_expressions` mapping + +--- + +## Expression Hotkey Mapping Workflow + +1. Launch VTube Studio, load model +2. Go to Hotkeys → add hotkeys for each state: idle, listening, thinking, speaking, happy, sad, surprised, error +3. VTube Studio assigns a UUID to each hotkey — copy these +4. Open Character Manager (P5), paste UUIDs into expression mapping UI +5. Export updated `aria.json` +6. Restart OpenClaw — new expression map loaded + +--- + +## Mobile Setup + +1. Install VTube Studio on iPhone/iPad +2. On same Tailscale network, VTube Studio mobile discovers Mac Mini model +3. Mirror mode: mobile shows same model as desktop +4. Useful for bedside or kitchen display while Mac Mini desktop is the primary + +--- + +## Directory Layout + +``` +homeai-visual/ +└── skills/ + ├── vtube_studio.py ← full implementation + ├── lipsync.py ← amplitude-based lip sync helper + └── auth.py ← token management utility +``` + +--- + +## Implementation Steps + +- [ ] Install VTube Studio (Mac App Store) +- [ ] Enable WebSocket API on port 8001 +- [ ] Source/purchase a Live2D model +- [ ] Load model in VTube Studio, verify it renders +- [ ] Create hotkeys in VTube Studio for all 8 expression states +- [ ] Write `vtube_studio.py` full implementation +- [ ] Run auth flow — click "Allow" in VTube Studio UI, save token +- [ ] Test `trigger_expression("thinking")` → model shows expression +- [ ] Test all 8 expressions via a simple test script +- [ ] Update `aria.json` with real VTube Studio hotkey IDs +- [ ] Write `lipsync.py` amplitude-based helper +- [ ] Integrate lip sync into TTS dispatch in OpenClaw +- [ ] Symlink `skills/` → `~/.openclaw/skills/` +- [ ] Test full pipeline: voice query → thinking expression → LLM → speaking expression with lip sync +- [ ] Set up VTube Studio on iPhone (optional, do last) + +--- + +## Success Criteria + +- [ ] All 8 expression states trigger correctly via `trigger_expression()` +- [ ] Lip sync is visibly responding to TTS audio (even if imperfect) +- [ ] VTube Studio token survives app restart (token file persists) +- [ ] Expression triggers are fast enough to feel responsive (<100ms from call to render) +- [ ] Model stays loaded and connected after Mac Mini sleep/wake diff --git a/homeai-voice/PLAN.md b/homeai-voice/PLAN.md new file mode 100644 index 0000000..f450c75 --- /dev/null +++ b/homeai-voice/PLAN.md @@ -0,0 +1,247 @@ +# P3: homeai-voice — Speech Pipeline + +> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6 + +--- + +## Goal + +Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant. + +Test with a desktop USB mic before ESP32 hardware arrives (P6). + +--- + +## Pipeline Architecture + +``` +[USB Mic / ESP32 satellite] + ↓ +openWakeWord (always-on, local) + ↓ wake detected +Wyoming Satellite / Audio capture + ↓ raw audio stream +Wyoming STT Server (Whisper.cpp) + ↓ transcribed text +Home Assistant Voice Pipeline + ↓ text +OpenClaw Agent (P4) ← intent + LLM response + ↓ response text +Wyoming TTS Server (Kokoro) + ↓ audio +[Speaker / ESP32 satellite] +``` + +--- + +## Components + +### 1. Whisper.cpp — Speech-to-Text + +**Why Whisper.cpp over Python Whisper:** +- Native Apple Silicon build — uses Neural Engine + Metal +- Significantly lower latency than Python implementation +- Runs as a server process, not one-shot per request + +**Installation:** +```bash +git clone https://github.com/ggerganov/whisper.cpp +cd whisper.cpp +make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS + +# Download model +bash ./models/download-ggml-model.sh large-v3 +# Also grab medium.en for faster fallback +bash ./models/download-ggml-model.sh medium.en +``` + +Models stored at `~/models/whisper/`. + +**Wyoming-Whisper adapter:** + +Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server: + +```bash +pip install wyoming-faster-whisper +wyoming-faster-whisper \ + --model large-v3 \ + --language en \ + --uri tcp://0.0.0.0:10300 \ + --data-dir ~/models/whisper \ + --download-dir ~/models/whisper +``` + +**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist` + +### 2. Kokoro TTS — Primary Text-to-Speech + +**Why Kokoro:** +- Very low latency (~200ms for short phrases) +- High quality voice output +- Runs efficiently on Apple Silicon +- No GPU required (MPS optional) + +**Installation:** +```bash +pip install kokoro-onnx +``` + +**Wyoming-Kokoro adapter:** + +```bash +pip install wyoming-kokoro # community adapter, or write thin wrapper +wyoming-kokoro \ + --uri tcp://0.0.0.0:10301 \ + --voice af_heart \ # default voice; overridden by character config + --speed 1.0 +``` + +**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist` + +### 3. Chatterbox TTS — Voice Cloning Engine + +Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`). + +```bash +# Install Chatterbox (MPS-optimised for Apple Silicon) +pip install chatterbox-tts + +# Test voice clone +python -c " +from chatterbox.tts import ChatterboxTTS +model = ChatterboxTTS.from_pretrained(device='mps') +wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav') +" +``` + +Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline. + +### 4. Qwen3-TTS — MLX Fallback + +```bash +pip install mlx mlx-lm +# Pull Qwen3-TTS model via mlx-lm or HuggingFace +``` + +Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`. + +### 5. openWakeWord — Always-On Detection + +Runs continuously, listens for wake word, triggers pipeline. + +```bash +pip install openwakeword + +# Test with default "hey_jarvis" model +python -c " +import openwakeword +model = openwakeword.Model(wakeword_models=['hey_jarvis']) +# ... audio loop +" +``` + +**Custom wake word (later):** +- Record 30–50 utterances of the character's name +- Train via openWakeWord training toolkit +- Drop model file into `~/models/wakeword/` + +**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist` + +Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff. + +### 6. Wyoming Protocol Server + +Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly. + +**HA integration:** +1. Home Assistant → Settings → Add Integration → Wyoming Protocol +2. Add STT: host ``, port `10300` +3. Add TTS: host ``, port `10301` +4. Create Voice Assistant pipeline in HA using these providers +5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6) + +--- + +## launchd Services + +Three launchd plists under `~/Library/LaunchAgents/`: + +| Plist | Service | Port | +|---|---|---| +| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 | +| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 | +| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) | + +Templates stored in `scripts/launchd/`. + +--- + +## Directory Layout + +``` +homeai-voice/ +├── whisper/ +│ ├── install.sh # clone, compile whisper.cpp, download models +│ └── README.md +├── tts/ +│ ├── install-kokoro.sh +│ ├── install-chatterbox.sh +│ ├── install-qwen3.sh +│ └── test-tts.sh # quick audio playback test +├── wyoming/ +│ ├── install.sh +│ └── test-pipeline.sh # end-to-end text→audio test +└── scripts/ + ├── launchd/ + │ ├── com.homeai.wyoming-stt.plist + │ ├── com.homeai.wyoming-tts.plist + │ └── com.homeai.wakeword.plist + └── load-all-launchd.sh +``` + +--- + +## Interface Contracts + +**Exposes:** +- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites) +- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6 +- Chatterbox: Python API, invoked directly by P4 skills +- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw) + +**Add to `.env.services`:** +```dotenv +WYOMING_STT_URL=tcp://localhost:10300 +WYOMING_TTS_URL=tcp://localhost:10301 +``` + +--- + +## Implementation Steps + +- [ ] Compile Whisper.cpp with Metal support +- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/` +- [ ] Install `wyoming-faster-whisper`, test STT from audio file +- [ ] Install Kokoro, test TTS to audio file +- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works +- [ ] Write launchd plists for STT and TTS services +- [ ] Load plists, verify both services start on reboot +- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301 +- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS +- [ ] Test HA Assist from browser: type query → hear spoken response +- [ ] Install openWakeWord, test wake detection with USB mic +- [ ] Write and load openWakeWord launchd plist +- [ ] Install Chatterbox, test voice clone with sample `.wav` +- [ ] Install Qwen3-TTS via MLX (fallback, lower priority) +- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test + +--- + +## Success Criteria + +- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response +- [ ] HA Voice Assistant responds to typed query with Kokoro voice +- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably +- [ ] All three launchd services auto-start after reboot +- [ ] STT latency <2s for 5-second utterances with `large-v3` +- [ ] Kokoro TTS latency <300ms for a 10-word sentence