Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Aodhan Collins
2026-03-04 01:11:37 +00:00
commit 38247d7cc4
11 changed files with 3060 additions and 0 deletions

153
CLAUDE.md Normal file
View File

@@ -0,0 +1,153 @@
# CLAUDE.md — Home AI Assistant Project
## Project Overview
A self-hosted, always-on personal AI assistant running on a **Mac Mini M4 Pro (64GB RAM, 1TB SSD)**. The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration.
---
## Hardware
| Component | Spec |
|---|---|
| Chip | Apple M4 Pro |
| CPU | 14-core |
| GPU | 20-core |
| Neural Engine | 16-core |
| RAM | 64GB unified memory |
| Storage | 1TB SSD |
| Network | Gigabit Ethernet |
All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional).
---
## Core Stack
### AI & LLM
- **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B)
- **Open WebUI** — browser-based chat interface, runs as Docker container
### Image Generation
- **ComfyUI** — primary image generation UI, node-based workflows
- Target models: SDXL, Flux.1, ControlNet
- Runs via Metal (Apple GPU API)
### Speech
- **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine
- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency)
- **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised)
- **Qwen3-TTS** — alternative voice cloning via MLX
- **openWakeWord** — always-on wake word detection
### Smart Home
- **Home Assistant** — smart home control platform (Docker)
- **Wyoming Protocol** — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant
- **Music Assistant** — self-hosted music control, integrates with Home Assistant
- **Snapcast** — multi-room synchronised audio output
### AI Agent / Orchestration
- **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality
- **n8n** — visual workflow automation (Docker), chains AI actions
- **mem0** — long-term memory layer for the AI character
### Character & Personality
- **Character Manager** (built — see `character-manager.jsx`) — single config UI for personality, prompts, models, Live2D mappings, and notes
- Character config exports to JSON, consumed by OpenClaw system prompt and pipeline
### Visual Representation
- **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android)
- VTube Studio WebSocket API used to drive expressions from the AI pipeline
- **LVGL** — simplified animated face on ESP32-S3-BOX-3 units
- Live2D model: to be sourced/commissioned (nizima.com or booth.pm)
### Room Presence (Smart Speaker Replacement)
- **ESP32-S3-BOX-3** units — one per room
- Flashed with **ESPHome**
- Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back)
- LVGL display shows animated face + status info
- Communicates over local WiFi
### Infrastructure
- **Docker Desktop for Mac** — containerises Home Assistant, Open WebUI, n8n, etc.
- **Tailscale** — secure remote access to all services, no port forwarding
- **Authelia** — 2FA authentication layer for exposed web UIs
- **Portainer** — Docker container management UI
- **Uptime Kuma** — service health monitoring and mobile alerts
- **Gitea** — self-hosted Git server for all project code and configs
- **code-server** — browser-based VS Code for remote development
---
## Voice Pipeline (End-to-End)
```
ESP32-S3-BOX-3 (room)
→ Wake word detected (openWakeWord, runs locally on device or Mac Mini)
→ Audio streamed to Mac Mini via Wyoming Satellite
→ Whisper.cpp transcribes speech to text
→ OpenClaw receives text + context
→ Ollama LLM generates response (with character persona from system prompt)
→ mem0 updates long-term memory
→ Response dispatched:
→ Kokoro/Chatterbox renders TTS audio
→ Audio sent back to ESP32-S3-BOX-3 (spoken response)
→ VTube Studio API triggered (expression + lip sync on desktop/mobile)
→ Home Assistant action called if applicable (lights, music, etc.)
```
---
## Character System
The AI assistant has a defined personality managed via the Character Manager tool.
Key config surfaces:
- **System prompt** — injected into every Ollama request
- **Voice clone reference** — `.wav` file path for Chatterbox/Qwen3-TTS
- **Live2D expression mappings** — idle, speaking, thinking, happy, error states
- **VTube Studio WebSocket triggers** — JSON map of events to expressions
- **Custom prompt rules** — trigger/response overrides for specific contexts
- **mem0** — persistent memory that evolves over time
Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components.
---
## Project Priorities
1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma)
2. **LLM** — Ollama running with target models, Open WebUI connected
3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant
4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant
5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, LVGL face
6. **Character system** — system prompt wired up, mem0 integrated, voice cloned
7. **VTube Studio** — model loaded, WebSocket API bridge written as OpenClaw skill
8. **ComfyUI** — image generation online, character-consistent model workflows
9. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
10. **Polish** — Authelia, Tailscale hardening, mobile companion, iOS widgets
---
## Key Paths & Conventions
- All Docker compose files: `~/server/docker/`
- OpenClaw skills: `~/.openclaw/skills/`
- Character configs: `~/.openclaw/characters/`
- Whisper models: `~/models/whisper/`
- Ollama models: managed by Ollama at `~/.ollama/models/`
- ComfyUI models: `~/ComfyUI/models/`
- Voice reference audio: `~/voices/`
- Gitea repos root: `~/gitea/`
---
## Notes for Planning
- All services should survive a Mac Mini reboot (launchd or Docker restart policies)
- ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini
- The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values
- OpenClaw skills are the primary extension mechanism — new capabilities = new skills
- Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only
- VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface
- mem0 memory store should be backed up as part of regular Gitea commits

371
PROJECT_PLAN.md Normal file
View File

@@ -0,0 +1,371 @@
# HomeAI — Full Project Plan
> Last updated: 2026-03-04
---
## Overview
This project builds a self-hosted, always-on AI assistant running entirely on a Mac Mini M4 Pro. It is decomposed into **8 sub-projects** that can be developed in parallel where dependencies allow, then bridged via well-defined interfaces.
The guiding principle: each sub-project exposes a clean API/config surface. No project hard-codes knowledge of another's internals.
---
## Sub-Project Map
| ID | Name | Description | Primary Language |
|---|---|---|---|
| P1 | `homeai-infra` | Docker stack, networking, monitoring, secrets | YAML / Shell |
| P2 | `homeai-llm` | Ollama + Open WebUI setup, model management | YAML / Shell |
| P3 | `homeai-voice` | STT, TTS, Wyoming bridge, wake word | Python / Shell |
| P4 | `homeai-agent` | OpenClaw config, skills, n8n workflows, mem0 | Python / JSON |
| P5 | `homeai-character` | Character Manager UI, persona JSON schema, voice clone | React / JSON |
| P6 | `homeai-esp32` | ESPHome firmware, Wyoming Satellite, LVGL face | C++ / YAML |
| P7 | `homeai-visual` | VTube Studio bridge, Live2D expression mapping | Python / JSON |
| P8 | `homeai-images` | ComfyUI workflows, model management, ControlNet | Python / JSON |
All repos live under `~/gitea/homeai/` on the Mac Mini and are mirrored to the self-hosted Gitea instance (set up in P1).
---
## Phase 1 — Foundation (P1 + P2)
**Goal:** Everything containerised, stable, accessible remotely. LLM responsive via browser.
### P1: `homeai-infra`
**Deliverables:**
- [ ] `docker-compose.yml` — master compose file (or per-service files under `~/server/docker/`)
- [ ] Services: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server
- [ ] Tailscale installed on Mac Mini, all services on Tailnet
- [ ] Gitea repos initialised, SSH keys configured
- [ ] Uptime Kuma monitors all service endpoints
- [ ] Docker restart policies: `unless-stopped` on all containers
- [ ] Documented `.env` file pattern (secrets never committed)
**Key decisions:**
- Single `docker-compose.yml` vs per-service compose files — recommend per-service files in `~/server/docker/<service>/` orchestrated by a root `Makefile`
- Tailscale as sole remote access method (no public port forwarding)
- Authelia deferred to Phase 4 polish (internal LAN services don't need 2FA immediately)
**Interface contract:** Exposes service URLs as env vars (e.g. `HA_URL`, `GITEA_URL`) written to `~/server/.env.services` — consumed by all other projects.
---
### P2: `homeai-llm`
**Deliverables:**
- [ ] Ollama installed natively on Mac Mini (not Docker — needs Metal GPU access)
- [ ] Models pulled: `llama3.3:70b`, `qwen2.5:72b` (and a fast small model: `qwen2.5:7b` for low-latency tasks)
- [ ] Open WebUI running as Docker container, connected to Ollama
- [ ] Model benchmark script — measures tokens/sec per model
- [ ] `ollama-models.txt` — pinned model manifest for reproducibility
**Key decisions:**
- Ollama runs as a launchd service (`~/Library/LaunchAgents/`) to survive reboots
- Open WebUI exposed only on Tailnet
- API endpoint: `http://localhost:11434` (Ollama default)
**Interface contract:** Ollama OpenAI-compatible API at `http://localhost:11434/v1` — used by P3, P4, P7.
---
## Phase 2 — Voice Pipeline (P3)
**Goal:** Full end-to-end voice: speak → transcribe → LLM → TTS → hear response. No ESP32 yet — test with a USB mic on Mac Mini.
### P3: `homeai-voice`
**Deliverables:**
- [ ] Whisper.cpp compiled for Apple Silicon, model downloaded (`medium.en` or `large-v3`)
- [ ] Kokoro TTS installed, tested, latency benchmarked
- [ ] Chatterbox TTS installed (MPS optimised build), voice reference `.wav` ready
- [ ] Qwen3-TTS via MLX installed as fallback
- [ ] openWakeWord running on Mac Mini, detecting wake word
- [ ] Wyoming protocol server running — bridges STT+TTS into Home Assistant
- [ ] Home Assistant `voice_assistant` pipeline configured end-to-end
- [ ] Test script: `test_voice_pipeline.sh` — mic in → spoken response out
**Sub-components:**
```
[Mic] → openWakeWord → Wyoming STT (Whisper.cpp) → [text out]
[text in] → Wyoming TTS (Kokoro) → [audio out]
```
**Key decisions:**
- Whisper.cpp runs as a Wyoming STT provider (via `wyoming-faster-whisper` or native Wyoming adapter)
- Kokoro is primary TTS; Chatterbox used when voice cloning is active (P5)
- openWakeWord runs as a launchd service
- Wyoming server port: `10300` (STT), `10301` (TTS) — standard Wyoming ports
**Interface contract:**
- Wyoming STT: `tcp://localhost:10300`
- Wyoming TTS: `tcp://localhost:10301`
- Direct Python API for P4 (agent bypasses Wyoming for non-HA calls)
---
## Phase 3 — AI Agent & Character (P4 + P5)
**Goal:** OpenClaw receives voice/text input, applies character persona, calls tools, returns rich responses.
### P4: `homeai-agent`
**Deliverables:**
- [ ] OpenClaw installed and configured
- [ ] Connected to Ollama (`llama3.3:70b` as primary model)
- [ ] Connected to Home Assistant (long-lived access token in config)
- [ ] mem0 installed, configured with local storage backend
- [ ] mem0 backup job: daily git commit to Gitea
- [ ] Core skills written:
- `home_assistant.py` — call HA services (lights, switches, scenes)
- `memory.py` — read/write mem0 memories
- `weather.py` — local weather via HA sensor data
- `timer.py` — set timers/reminders
- `music.py` — stub for Music Assistant (P9)
- [ ] n8n running as Docker container, webhook trigger from OpenClaw
- [ ] Sample n8n workflow: morning briefing (time + weather + calendar)
- [ ] System prompt template: loads character JSON from P5
**Key decisions:**
- OpenClaw config at `~/.openclaw/config.yaml`
- Skills at `~/.openclaw/skills/` — one file per skill, auto-discovered
- System prompt: `~/.openclaw/characters/<active>.json` loaded at startup
- mem0 store: local file backend at `~/.openclaw/memory/` (SQLite)
**Interface contract:**
- OpenClaw exposes a local HTTP API (default port `8080`) — used by P3 (voice pipeline hands off transcribed text here)
- Consumes character JSON from P5
---
### P5: `homeai-character`
**Deliverables:**
- [ ] Character Manager UI (`character-manager.jsx`) — already exists, needs wiring
- [ ] Character JSON schema v1 defined and documented
- [ ] Export produces `~/.openclaw/characters/<name>.json`
- [ ] Fields: name, system_prompt, voice_ref_path, tts_engine, live2d_expressions, vtube_ws_triggers, custom_rules, model_overrides
- [ ] Validation: schema validator script rejects malformed exports
- [ ] Sample character: `aria.json` (default assistant persona)
- [ ] Voice clone: reference `.wav` recorded/sourced, placed at `~/voices/<name>.wav`
**Key decisions:**
- JSON schema is versioned (`"schema_version": 1`) — pipeline components check version before loading
- Character Manager is a local React app (served by Vite dev server or built to static files)
- Single active character at a time; OpenClaw watches the file for changes (hot reload)
**Interface contract:**
- Output: `~/.openclaw/characters/<name>.json` — consumed by P4, P3 (TTS voice selection), P7 (expression mapping)
- Schema published in `homeai-character/schema/character.schema.json`
---
## Phase 4 — Hardware Satellites (P6)
**Goal:** ESP32-S3-BOX-3 units act as room presence nodes — wake word, mic input, audio output, animated face.
### P6: `homeai-esp32`
**Deliverables:**
- [ ] ESPHome config for ESP32-S3-BOX-3 (`esphome/s3-box-living-room.yaml`, etc.)
- [ ] Wyoming Satellite component configured — streams mic audio to Mac Mini Wyoming STT
- [ ] Audio playback: receives TTS audio from Mac Mini, plays via built-in speaker
- [ ] LVGL face: animated idle/speaking/thinking states
- [ ] Wake word: either on-device (microWakeWord via ESPHome) or forwarded to Mac Mini openWakeWord
- [ ] OTA update mechanism configured
- [ ] One unit per room — config templated with room name as variable
**LVGL Face States:**
| State | Animation |
|---|---|
| Idle | Slow blink, gentle sway |
| Listening | Eyes wide, mic indicator |
| Thinking | Eyes narrow, loading dots |
| Speaking | Mouth animation synced to audio |
| Error | Red eyes, shake |
**Key decisions:**
- Wake word on-device preferred (lower latency, no always-on network stream)
- microWakeWord model: `hey_jarvis` or custom trained word
- LVGL animations compiled into ESPHome firmware (no runtime asset loading)
- Each unit has a unique device name for HA entity naming
**Interface contract:**
- Wyoming Satellite → Mac Mini Wyoming STT server (`tcp://<mac-mini-ip>:10300`)
- Receives audio back via Wyoming TTS response
- LVGL state driven by Home Assistant entity state (HA → ESPHome event)
---
## Phase 5 — Visual Layer (P7)
**Goal:** VTube Studio shows Live2D model on desktop/mobile; expressions driven by AI pipeline state.
### P7: `homeai-visual`
**Deliverables:**
- [ ] VTube Studio installed on Mac Mini (macOS app)
- [ ] Live2D model loaded (sourced from nizima.com or booth.pm)
- [ ] VTube Studio WebSocket API enabled (port `8001`)
- [ ] OpenClaw skill: `vtube_studio.py`
- Connects to VTube Studio WebSocket
- Auth token exchange and persistence
- Methods: `trigger_expression(name)`, `trigger_hotkey(name)`, `set_parameter(name, value)`
- [ ] Expression map in character JSON → VTube hotkey IDs
- [ ] Lip sync: driven by audio envelope or TTS phoneme timing
- [ ] Mobile: VTube Studio on iOS/Android connected to same model via Tailscale
**Key decisions:**
- Expression trigger events: `idle`, `speaking`, `thinking`, `happy`, `sad`, `error`
- Lip sync approach: simple amplitude-based (fast) rather than phoneme-based (complex) initially
- Auth token stored at `~/.openclaw/vtube_token.json`
**Interface contract:**
- OpenClaw calls `vtube_studio.trigger_expression(event)` from within response pipeline
- Event names defined in character JSON `live2d_expressions` field
---
## Phase 6 — Image Generation (P8)
**Goal:** ComfyUI online with character-consistent image generation workflows.
### P8: `homeai-images`
**Deliverables:**
- [ ] ComfyUI installed at `~/ComfyUI/`, running via launchd
- [ ] Models downloaded: SDXL base, Flux.1-dev (or schnell), ControlNet (canny, depth)
- [ ] Character LoRA: trained on character reference images for consistent appearance
- [ ] Saved workflows:
- `workflows/portrait.json` — character portrait, controllable expression
- `workflows/scene.json` — character in scene with ControlNet pose
- `workflows/quick.json` — fast draft via Flux.1-schnell
- [ ] OpenClaw skill: `comfyui.py` — submits workflow via ComfyUI REST API, returns image path
- [ ] ComfyUI API port: `8188`
**Interface contract:**
- OpenClaw calls `comfyui.generate(workflow_name, params)` → returns local image path
- ComfyUI REST API: `http://localhost:8188`
---
## Phase 7 — Extended Integrations & Polish
**Deliverables:**
- [ ] Music Assistant — Docker container, integrated with HA, OpenClaw `music.py` skill updated
- [ ] Snapcast — server on Mac Mini, clients on ESP32 units (multi-room sync)
- [ ] Authelia — 2FA in front of all web UIs exposed via Tailscale
- [ ] n8n advanced workflows: daily briefing, calendar reminders, notification routing
- [ ] iOS Shortcuts companion: trigger OpenClaw from iPhone widget
- [ ] Uptime Kuma alerts: pushover/ntfy notifications on service down
- [ ] Backup automation: daily Gitea commits of mem0, character configs, n8n workflows
---
## Dependency Graph
```
P1 (infra) ─────────────────────────────┐
P2 (llm) ──────────────────────┐ │
P3 (voice) ────────────────┐ │ │
P5 (character) ──────┐ │ │ │
↓ ↓ ↓ ↓
P4 (agent) ─────→ HA
P6 (esp32) ← Wyoming
P7 (visual) ← vtube skill
P8 (images) ← comfyui skill
```
**Hard dependencies:**
- P4 requires P1 (HA URL), P2 (Ollama), P5 (character JSON)
- P3 requires P2 (LLM), P4 (agent endpoint)
- P6 requires P3 (Wyoming server), P1 (HA)
- P7 requires P4 (OpenClaw skill runner), P5 (expression map)
- P8 requires P4 (OpenClaw skill runner)
**Can be done in parallel:**
- P1 + P5 (infra and character manager are independent)
- P2 + P5 (LLM setup and character UI are independent)
- P7 + P8 (visual and images are both P4 dependents but independent of each other)
---
## Interface Contracts Summary
| Contract | Type | Defined In | Consumed By |
|---|---|---|---|
| `~/server/.env.services` | env file | P1 | All |
| Ollama API `localhost:11434/v1` | HTTP (OpenAI compat) | P2 | P3, P4, P7 |
| Wyoming STT `localhost:10300` | TCP/Wyoming | P3 | P6, HA |
| Wyoming TTS `localhost:10301` | TCP/Wyoming | P3 | P6, HA |
| OpenClaw API `localhost:8080` | HTTP | P4 | P3, P7, P8 |
| Character JSON `~/.openclaw/characters/` | JSON file | P5 | P4, P3, P7 |
| `character.schema.json` v1 | JSON Schema | P5 | P4, P3, P7 |
| VTube Studio WS `localhost:8001` | WebSocket | VTube Studio | P7 |
| ComfyUI API `localhost:8188` | HTTP | ComfyUI | P8 |
| Home Assistant API | HTTP/WS | P1 (HA) | P4, P6 |
---
## Repo Structure (Gitea)
```
~/gitea/homeai/
├── homeai-infra/ # P1
│ ├── docker/ # per-service compose files
│ ├── scripts/ # setup/teardown helpers
│ └── Makefile
├── homeai-llm/ # P2
│ ├── ollama-models.txt
│ └── scripts/
├── homeai-voice/ # P3
│ ├── whisper/
│ ├── tts/
│ ├── wyoming/
│ └── scripts/
├── homeai-agent/ # P4
│ ├── skills/
│ ├── workflows/ # n8n exports
│ └── config/
├── homeai-character/ # P5
│ ├── src/ # React character manager
│ ├── schema/
│ └── characters/ # exported JSONs
├── homeai-esp32/ # P6
│ └── esphome/
├── homeai-visual/ # P7
│ └── skills/
└── homeai-images/ # P8
├── workflows/ # ComfyUI workflow JSONs
└── skills/
```
---
## Suggested Build Order
| Week | Focus | Projects |
|---|---|---|
| 1 | Infrastructure up, LLM running | P1, P2 |
| 2 | Voice pipeline end-to-end (desktop mic test) | P3 |
| 3 | Character Manager wired, OpenClaw connected | P4, P5 |
| 4 | ESP32 firmware, first satellite running | P6 |
| 5 | VTube Studio live, expressions working | P7 |
| 6 | ComfyUI online, character LoRA trained | P8 |
| 7+ | Extended integrations, polish, Authelia | Phase 7 |
---
## Open Questions / Decisions Needed
- [ ] Which OpenClaw version/fork to use? (confirm it supports Ollama natively)
- [ ] Wake word: `hey_jarvis` vs custom trained word — what should the character's name be?
- [ ] Live2D model: commission custom or buy from nizima.com? Budget?
- [ ] Snapcast: output to ESP32 speakers or separate audio hardware per room?
- [ ] n8n: self-hosted Docker vs n8n Cloud (given local-first preference → Docker)
- [ ] Authelia: local user store or LDAP backend? (local store is simpler)
- [ ] mem0: local SQLite or run Qdrant vector DB for better semantic search?

189
TODO.md Normal file
View File

@@ -0,0 +1,189 @@
# HomeAI — Master TODO
> Track progress across all sub-projects. See each sub-project `PLAN.md` for detailed implementation notes.
> Status: `[ ]` pending · `[~]` in progress · `[x]` done
---
## Phase 1 — Foundation
### P1 · homeai-infra
- [ ] Install Docker Desktop for Mac, enable launch at login
- [ ] Create shared `homeai` Docker network
- [ ] Create `~/server/docker/` directory structure
- [ ] Write compose files: Home Assistant, Portainer, Uptime Kuma, Gitea, code-server, n8n
- [ ] Write `.env.secrets.example` and `Makefile`
- [ ] `make up-all` — bring all services up
- [ ] Home Assistant onboarding — generate long-lived access token
- [ ] Write `~/server/.env.services` with all service URLs
- [ ] Install Tailscale, verify all services reachable on Tailnet
- [ ] Gitea: create admin account, initialise all 8 sub-project repos, configure SSH
- [ ] Uptime Kuma: add monitors for all services, configure mobile alerts
- [ ] Verify all containers survive a cold reboot
### P2 · homeai-llm
- [ ] Install Ollama natively via brew
- [ ] Write and load launchd plist (`com.ollama.ollama.plist`)
- [ ] Write `ollama-models.txt` with model manifest
- [ ] Run `scripts/pull-models.sh` — pull all models
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
- [ ] Deploy Open WebUI via Docker compose (port 3030)
- [ ] Verify Open WebUI connected to Ollama, all models available
- [ ] Add Ollama + Open WebUI to Uptime Kuma monitors
- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services`
---
## Phase 2 — Voice Pipeline
### P3 · homeai-voice
- [ ] Compile Whisper.cpp with Metal support
- [ ] Download Whisper models (`large-v3`, `medium.en`) to `~/models/whisper/`
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
- [ ] Install Kokoro TTS, test output to audio file
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol
- [ ] Write + load launchd plists for Wyoming STT (10300) and TTS (10301)
- [ ] Connect Home Assistant Wyoming integration (STT + TTS)
- [ ] Create HA Voice Assistant pipeline
- [ ] Test HA Assist via browser: type query → hear spoken response
- [ ] Install openWakeWord, test wake detection with USB mic
- [ ] Write + load openWakeWord launchd plist
- [ ] Install Chatterbox TTS (MPS build), test with sample `.wav`
- [ ] Install Qwen3-TTS via MLX (fallback)
- [ ] Write `wyoming/test-pipeline.sh` — end-to-end smoke test
- [ ] Add Wyoming STT/TTS to Uptime Kuma monitors
---
## Phase 3 — Agent & Character
### P5 · homeai-character *(no runtime deps — can start alongside P1)*
- [ ] Define and write `schema/character.schema.json` (v1)
- [ ] Write `characters/aria.json` — default character
- [ ] Set up Vite project in `src/`, install deps
- [ ] Integrate existing `character-manager.jsx` into Vite project
- [ ] Add schema validation on export (ajv)
- [ ] Add expression mapping UI section
- [ ] Add custom rules editor
- [ ] Test full edit → export → validate → load cycle
- [ ] Record or source voice reference audio for Aria (`~/voices/aria.wav`)
- [ ] Pre-process audio with ffmpeg, test with Chatterbox
- [ ] Update `aria.json` with voice clone path if quality is good
- [ ] Write `SchemaValidator.js` as standalone utility
### P4 · homeai-agent
- [ ] Confirm OpenClaw installation method and Ollama compatibility
- [ ] Install OpenClaw, write `~/.openclaw/config.yaml`
- [ ] Verify OpenClaw responds to basic text query via `/chat`
- [ ] Write `skills/home_assistant.py` — test lights on/off via voice
- [ ] Write `skills/memory.py` — test store and recall
- [ ] Write `skills/weather.py` — verify HA weather sensor data
- [ ] Write `skills/timer.py` — test set/fire a timer
- [ ] Write skill stubs: `music.py`, `vtube_studio.py`, `comfyui.py`
- [ ] Set up mem0 with Chroma backend, test semantic recall
- [ ] Write and load memory backup launchd job
- [ ] Symlink `homeai-agent/skills/``~/.openclaw/skills/`
- [ ] Build morning briefing n8n workflow
- [ ] Build notification router n8n workflow
- [ ] Verify full voice → agent → HA action flow
- [ ] Add OpenClaw to Uptime Kuma monitors
---
## Phase 4 — Hardware Satellites
### P6 · homeai-esp32
- [ ] Install ESPHome: `pip install esphome`
- [ ] Write `esphome/secrets.yaml` (gitignored)
- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml`
- [ ] Write `s3-box-living-room.yaml` for first unit
- [ ] Flash first unit via USB
- [ ] Verify unit appears in HA device list
- [ ] Assign Wyoming voice pipeline to unit in HA
- [ ] Test full wake → STT → LLM → TTS → audio playback cycle
- [ ] Test LVGL face: idle → listening → thinking → speaking → error
- [ ] Verify OTA firmware update works wirelessly
- [ ] Flash remaining units (bedroom, kitchen, etc.)
- [ ] Document MAC address → room name mapping
---
## Phase 5 — Visual Layer
### P7 · homeai-visual
- [ ] Install VTube Studio (Mac App Store)
- [ ] Enable WebSocket API on port 8001
- [ ] Source/purchase a Live2D model (nizima.com or booth.pm)
- [ ] Load model in VTube Studio
- [ ] Create hotkeys for all 8 expression states
- [ ] Write `skills/vtube_studio.py` full implementation
- [ ] Run auth flow — click Allow in VTube Studio, save token
- [ ] Test all 8 expressions via test script
- [ ] Update `aria.json` with real VTube Studio hotkey IDs
- [ ] Write `lipsync.py` amplitude-based helper
- [ ] Integrate lip sync into OpenClaw TTS dispatch
- [ ] Symlink `skills/``~/.openclaw/skills/`
- [ ] Test full pipeline: voice → thinking expression → speaking with lip sync
- [ ] Set up VTube Studio mobile (iPhone/iPad) on Tailnet
---
## Phase 6 — Image Generation
### P8 · homeai-images
- [ ] Clone ComfyUI to `~/ComfyUI/`, install deps in venv
- [ ] Verify MPS is detected at launch
- [ ] Write and load launchd plist (`com.homeai.comfyui.plist`)
- [ ] Download SDXL base model
- [ ] Download Flux.1-schnell
- [ ] Download ControlNet models (canny, depth)
- [ ] Test generation via ComfyUI web UI (port 8188)
- [ ] Build and export `quick.json` workflow
- [ ] Build and export `portrait.json` workflow
- [ ] Build and export `scene.json` workflow (ControlNet)
- [ ] Build and export `upscale.json` workflow
- [ ] Write `skills/comfyui.py` full implementation
- [ ] Test skill: `comfyui.quick("test prompt")` → image file returned
- [ ] Collect character reference images for LoRA training
- [ ] Train SDXL LoRA with kohya_ss
- [ ] Load LoRA into `portrait.json`, verify character consistency
- [ ] Symlink `skills/``~/.openclaw/skills/`
- [ ] Test via OpenClaw: "Generate a portrait of Aria looking happy"
- [ ] Add ComfyUI to Uptime Kuma monitors
---
## Phase 7 — Extended Integrations & Polish
- [ ] Deploy Music Assistant (Docker), integrate with Home Assistant
- [ ] Complete `skills/music.py` in OpenClaw
- [ ] Deploy Snapcast server on Mac Mini
- [ ] Configure Snapcast clients on ESP32 units for multi-room audio
- [ ] Configure Authelia as 2FA layer in front of web UIs
- [ ] Build advanced n8n workflows (calendar reminders, daily briefing v2)
- [ ] Create iOS Shortcuts to trigger OpenClaw from iPhone widget
- [ ] Configure ntfy/Pushover alerts in Uptime Kuma for all services
- [ ] Automate mem0 + character config backup to Gitea (daily)
- [ ] Train custom wake word using character's name
- [ ] Document all service URLs, ports, and credentials in a private Gitea wiki
- [ ] Tailscale ACL hardening — restrict which devices can reach which services
- [ ] Stress test: reboot Mac Mini, verify all services recover in <2 minutes
---
## Open Decisions
- [ ] Confirm character name (determines wake word training)
- [ ] Confirm OpenClaw version/fork and Ollama compatibility
- [ ] Live2D model: purchase off-the-shelf or commission custom?
- [ ] mem0 backend: Chroma (simple) vs Qdrant Docker (better semantic search)?
- [ ] Snapcast output: ESP32 built-in speakers or dedicated audio hardware per room?
- [ ] Authelia user store: local file vs LDAP?

335
homeai-agent/PLAN.md Normal file
View File

@@ -0,0 +1,335 @@
# P4: homeai-agent — AI Agent, Skills & Automation
> Phase 3 | Depends on: P1 (HA), P2 (Ollama), P3 (Wyoming/TTS), P5 (character JSON)
---
## Goal
OpenClaw running as the primary AI agent: receives voice/text input, loads character persona, calls tools (skills), manages memory (mem0), dispatches responses (TTS, HA actions, VTube expressions). n8n handles scheduled/automated workflows.
---
## Architecture
```
Voice input (text from P3 Wyoming STT)
OpenClaw API (port 8080)
↓ loads character JSON from P5
System prompt construction
Ollama LLM (P2) — llama3.3:70b
↓ response + tool calls
Skill dispatcher
├── home_assistant.py → HA REST API (P1)
├── memory.py → mem0 (local)
├── vtube_studio.py → VTube WS (P7)
├── comfyui.py → ComfyUI API (P8)
├── music.py → Music Assistant (Phase 7)
└── weather.py → HA sensor data
↓ final response text
TTS dispatch:
├── Chatterbox (voice clone, if active)
└── Kokoro (via Wyoming, fallback)
Audio playback to appropriate room
```
---
## OpenClaw Setup
### Installation
```bash
# Confirm OpenClaw supports Ollama — check repo for latest install method
pip install openclaw
# or
git clone https://github.com/<openclaw-repo>/openclaw
pip install -e .
```
**Key question:** Verify OpenClaw's Ollama/OpenAI-compatible backend support before installation. If OpenClaw doesn't support local Ollama natively, use a thin adapter layer pointing its OpenAI endpoint at `http://localhost:11434/v1`.
### Config — `~/.openclaw/config.yaml`
```yaml
version: 1
llm:
provider: ollama # or openai-compatible
base_url: http://localhost:11434/v1
model: llama3.3:70b
fast_model: qwen2.5:7b # used for quick intent classification
character:
active: aria
config_dir: ~/.openclaw/characters/
memory:
provider: mem0
store_path: ~/.openclaw/memory/
embedding_model: nomic-embed-text
embedding_url: http://localhost:11434/v1
api:
host: 0.0.0.0
port: 8080
tts:
primary: chatterbox # when voice clone active
fallback: kokoro-wyoming # Wyoming TTS endpoint
wyoming_tts_url: tcp://localhost:10301
wake:
endpoint: /wake # openWakeWord POSTs here to trigger listening
```
---
## Skills
All skills live in `~/.openclaw/skills/` (symlinked from `homeai-agent/skills/`).
### `home_assistant.py`
Wraps the HA REST API for common smart home actions.
**Functions:**
- `turn_on(entity_id, **kwargs)` — lights, switches, media players
- `turn_off(entity_id)`
- `toggle(entity_id)`
- `set_light(entity_id, brightness=None, color_temp=None, rgb_color=None)`
- `run_scene(scene_id)`
- `get_state(entity_id)` → returns state + attributes
- `list_entities(domain=None)` → returns entity list
Uses `HA_URL` and `HA_TOKEN` from `.env.services`.
### `memory.py`
Wraps mem0 for persistent long-term memory.
**Functions:**
- `remember(text, category=None)` — store a memory
- `recall(query, limit=5)` — semantic search over memories
- `forget(memory_id)` — delete a specific memory
- `list_recent(n=10)` — list most recent memories
mem0 uses `nomic-embed-text` via Ollama for embeddings.
### `weather.py`
Pulls weather data from Home Assistant sensors (local weather station or HA weather integration).
**Functions:**
- `get_current()` → temp, humidity, conditions
- `get_forecast(days=3)` → forecast array
### `timer.py`
Simple timer/reminder management.
**Functions:**
- `set_timer(duration_seconds, label=None)` → fires HA notification/TTS on expiry
- `set_reminder(datetime_str, message)` → schedules future TTS playback
- `list_timers()`
- `cancel_timer(timer_id)`
### `music.py` (stub — completed in Phase 7)
```python
def play(query: str): ... # "play jazz" → Music Assistant
def pause(): ...
def skip(): ...
def set_volume(level: int): ... # 0-100
```
### `vtube_studio.py` (implemented in P7)
Stub in P4, full implementation in P7:
```python
def trigger_expression(event: str): ... # "thinking", "happy", etc.
def set_parameter(name: str, value: float): ...
```
### `comfyui.py` (implemented in P8)
Stub in P4, full implementation in P8:
```python
def generate(workflow: str, params: dict) -> str: ... # returns image path
```
---
## mem0 — Long-Term Memory
### Setup
```bash
pip install mem0ai
```
### Config
```python
from mem0 import Memory
config = {
"llm": {
"provider": "ollama",
"config": {
"model": "llama3.3:70b",
"ollama_base_url": "http://localhost:11434",
}
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text",
"ollama_base_url": "http://localhost:11434",
}
},
"vector_store": {
"provider": "chroma",
"config": {
"collection_name": "homeai_memory",
"path": "~/.openclaw/memory/chroma",
}
}
}
memory = Memory.from_config(config)
```
> **Decision point:** Start with Chroma (local file-based). If semantic recall quality is poor, migrate to Qdrant (Docker container).
### Backup
Daily cron (via launchd) commits mem0 data to Gitea:
```bash
#!/usr/bin/env bash
cd ~/.openclaw/memory
git add .
git commit -m "mem0 backup $(date +%Y-%m-%d)"
git push origin main
```
---
## n8n Workflows
n8n runs in Docker (deployed in P1). Workflows exported as JSON and stored in `homeai-agent/workflows/`.
### Starter Workflows
**`morning-briefing.json`**
- Trigger: time-based (e.g., 7:30 AM on weekdays)
- Steps: fetch weather → fetch calendar events → compose briefing → POST to OpenClaw TTS → speak aloud
**`notification-router.json`**
- Trigger: HA webhook (new notification)
- Steps: classify urgency → if high: TTS immediately; if low: queue for next interaction
**`memory-backup.json`**
- Trigger: daily schedule
- Steps: commit mem0 data to Gitea
### n8n ↔ OpenClaw Integration
OpenClaw exposes a webhook endpoint that n8n can call to trigger TTS or run a skill:
```
POST http://localhost:8080/speak
{
"text": "Good morning. It is 7:30 and the weather is...",
"room": "all"
}
```
---
## API Surface (OpenClaw)
Key endpoints consumed by other projects:
| Endpoint | Method | Description |
|---|---|---|
| `/chat` | POST | Send text, get response (+ fires skills) |
| `/wake` | POST | Wake word trigger from openWakeWord |
| `/speak` | POST | TTS only — no LLM, just speak text |
| `/skill/<name>` | POST | Call a specific skill directly |
| `/memory` | GET/POST | Read/write memories |
| `/status` | GET | Health check |
---
## Directory Layout
```
homeai-agent/
├── skills/
│ ├── home_assistant.py
│ ├── memory.py
│ ├── weather.py
│ ├── timer.py
│ ├── music.py # stub
│ ├── vtube_studio.py # stub
│ └── comfyui.py # stub
├── workflows/
│ ├── morning-briefing.json
│ ├── notification-router.json
│ └── memory-backup.json
└── config/
├── config.yaml.example
└── mem0-config.py
```
---
## Interface Contracts
**Consumes:**
- Ollama API: `http://localhost:11434/v1`
- HA API: `$HA_URL` with `$HA_TOKEN`
- Wyoming TTS: `tcp://localhost:10301`
- Character JSON: `~/.openclaw/characters/<active>.json` (from P5)
**Exposes:**
- OpenClaw HTTP API: `http://localhost:8080` — consumed by P3 (voice), P7 (visual triggers), P8 (image skill)
**Add to `.env.services`:**
```dotenv
OPENCLAW_URL=http://localhost:8080
```
---
## Implementation Steps
- [ ] Confirm OpenClaw installation method and Ollama compatibility
- [ ] Install OpenClaw, write `config.yaml` pointing at Ollama and HA
- [ ] Verify OpenClaw responds to a basic text query via `/chat`
- [ ] Write `home_assistant.py` skill — test lights on/off via voice
- [ ] Write `memory.py` skill — test store and recall
- [ ] Write `weather.py` skill — verify HA weather sensor data
- [ ] Write `timer.py` skill — test set/fire a timer
- [ ] Write skill stubs: `music.py`, `vtube_studio.py`, `comfyui.py`
- [ ] Set up mem0 with Chroma backend, test semantic recall
- [ ] Write and test memory backup launchd job
- [ ] Deploy n8n via Docker (P1 task if not done)
- [ ] Build morning briefing n8n workflow
- [ ] Symlink `homeai-agent/skills/``~/.openclaw/skills/`
- [ ] Verify full voice → agent → HA action flow (with P3 pipeline)
---
## Success Criteria
- [ ] "Turn on the living room lights" → lights turn on via HA
- [ ] "Remember that I prefer jazz in the mornings" → mem0 stores it; "What do I like in the mornings?" → recalls it
- [ ] Morning briefing n8n workflow fires on schedule and speaks via TTS
- [ ] OpenClaw `/status` returns healthy
- [ ] OpenClaw survives Mac Mini reboot (launchd or Docker — TBD based on OpenClaw's preferred run method)

300
homeai-character/PLAN.md Normal file
View File

@@ -0,0 +1,300 @@
# P5: homeai-character — Character System & Persona Config
> Phase 3 | No hard runtime dependencies | Consumed by: P3, P4, P7
---
## Goal
A single, authoritative character configuration that defines the AI assistant's personality, voice, visual expressions, and prompt rules. The Character Manager UI (already started as `character-manager.jsx`) provides a friendly editor. The exported JSON is the single source of truth for all pipeline components.
---
## Character JSON Schema v1
File: `schema/character.schema.json`
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "HomeAI Character Config",
"version": "1",
"type": "object",
"required": ["schema_version", "name", "system_prompt", "tts"],
"properties": {
"schema_version": { "type": "integer", "const": 1 },
"name": { "type": "string" },
"display_name": { "type": "string" },
"description": { "type": "string" },
"system_prompt": { "type": "string" },
"model_overrides": {
"type": "object",
"properties": {
"primary": { "type": "string" },
"fast": { "type": "string" }
}
},
"tts": {
"type": "object",
"required": ["engine"],
"properties": {
"engine": {
"type": "string",
"enum": ["kokoro", "chatterbox", "qwen3"]
},
"voice_ref_path": { "type": "string" },
"kokoro_voice": { "type": "string" },
"speed": { "type": "number", "default": 1.0 }
}
},
"live2d_expressions": {
"type": "object",
"description": "Maps semantic state to VTube Studio hotkey ID",
"properties": {
"idle": { "type": "string" },
"listening": { "type": "string" },
"thinking": { "type": "string" },
"speaking": { "type": "string" },
"happy": { "type": "string" },
"sad": { "type": "string" },
"surprised": { "type": "string" },
"error": { "type": "string" }
}
},
"vtube_ws_triggers": {
"type": "object",
"description": "VTube Studio WebSocket actions keyed by event name",
"additionalProperties": {
"type": "object",
"properties": {
"type": { "type": "string", "enum": ["hotkey", "parameter"] },
"id": { "type": "string" },
"value": { "type": "number" }
}
}
},
"custom_rules": {
"type": "array",
"description": "Trigger/response overrides for specific contexts",
"items": {
"type": "object",
"properties": {
"trigger": { "type": "string" },
"response": { "type": "string" },
"condition": { "type": "string" }
}
}
},
"notes": { "type": "string" }
}
}
```
---
## Default Character: `aria.json`
File: `characters/aria.json`
```json
{
"schema_version": 1,
"name": "aria",
"display_name": "Aria",
"description": "Default HomeAI assistant persona",
"system_prompt": "You are Aria, a warm, curious, and helpful AI assistant living in the home. You speak naturally and conversationally — never robotic. You are knowledgeable but never condescending. You remember the people you live with and build on those memories over time. Keep responses concise when controlling smart home devices; be more expressive in casual conversation. Never break character.",
"model_overrides": {
"primary": "llama3.3:70b",
"fast": "qwen2.5:7b"
},
"tts": {
"engine": "kokoro",
"kokoro_voice": "af_heart",
"voice_ref_path": null,
"speed": 1.0
},
"live2d_expressions": {
"idle": "expr_idle",
"listening": "expr_listening",
"thinking": "expr_thinking",
"speaking": "expr_speaking",
"happy": "expr_happy",
"sad": "expr_sad",
"surprised": "expr_surprised",
"error": "expr_error"
},
"vtube_ws_triggers": {
"thinking": { "type": "hotkey", "id": "expr_thinking" },
"speaking": { "type": "hotkey", "id": "expr_speaking" },
"idle": { "type": "hotkey", "id": "expr_idle" }
},
"custom_rules": [
{
"trigger": "good morning",
"response": "Good morning! How did you sleep?",
"condition": "time_of_day == morning"
}
],
"notes": "Default persona. Voice clone to be added once reference audio recorded."
}
```
---
## Character Manager UI
### Status
`character-manager.jsx` already exists — needs:
1. Schema validation before export (reject malformed JSONs)
2. File system integration: save/load from `characters/` directory
3. Live preview of system prompt
4. Expression mapping UI for Live2D states
### Tech Stack
- React + Vite (local dev server, not deployed)
- Tailwind CSS (or minimal CSS)
- Runs at `http://localhost:5173` during editing
### File Structure
```
homeai-character/
├── src/
│ ├── character-manager.jsx ← existing, extend here
│ ├── SchemaValidator.js ← validate against character.schema.json
│ ├── ExpressionMapper.jsx ← UI for Live2D expression mapping
│ └── main.jsx
├── schema/
│ └── character.schema.json
├── characters/
│ ├── aria.json ← default character
│ └── .gitkeep
├── package.json
└── vite.config.js
```
### Character Manager Features
| Feature | Description |
|---|---|
| Basic info | name, display name, description |
| System prompt | Multi-line editor with char count |
| Model overrides | Dropdown: primary + fast model |
| TTS config | Engine picker, voice selector, speed slider, voice ref path |
| Expression mapping | Table: state → VTube hotkey ID |
| VTube WS triggers | JSON editor for advanced triggers |
| Custom rules | Add/edit/delete trigger-response pairs |
| Notes | Free-text notes field |
| Export | Validates schema, writes to `characters/<name>.json` |
| Import | Load existing character JSON for editing |
### Schema Validation
```javascript
import Ajv from 'ajv'
import schema from '../schema/character.schema.json'
const ajv = new Ajv()
const validate = ajv.compile(schema)
export function validateCharacter(config) {
const valid = validate(config)
if (!valid) throw new Error(ajv.errorsText(validate.errors))
return true
}
```
---
## Voice Clone Workflow
1. Record 3060 seconds of clean speech at `~/voices/<name>-raw.wav`
- Quiet room, consistent mic distance, natural conversational tone
2. Pre-process: `ffmpeg -i raw.wav -ar 22050 -ac 1 aria.wav`
3. Place at `~/voices/aria.wav`
4. Update character JSON: `"voice_ref_path": "~/voices/aria.wav"`, `"engine": "chatterbox"`
5. Test: run Chatterbox with the reference, verify voice quality
6. If unsatisfactory, try Qwen3-TTS as alternative
---
## Pipeline Integration
### How P4 (OpenClaw) loads the character
```python
import json
from pathlib import Path
def load_character(name: str) -> dict:
path = Path.home() / ".openclaw" / "characters" / f"{name}.json"
config = json.loads(path.read_text())
assert config["schema_version"] == 1, "Unsupported schema version"
return config
# System prompt injection
character = load_character("aria")
system_prompt = character["system_prompt"]
# Pass to Ollama as system message
```
OpenClaw hot-reloads the character JSON on file change — no restart required.
### How P3 selects TTS engine
```python
character = load_character(active_name)
tts_cfg = character["tts"]
if tts_cfg["engine"] == "chatterbox":
tts = ChatterboxTTS(voice_ref=tts_cfg["voice_ref_path"])
elif tts_cfg["engine"] == "qwen3":
tts = Qwen3TTS()
else: # kokoro (default)
tts = KokoroWyomingClient(voice=tts_cfg.get("kokoro_voice", "af_heart"))
```
---
## Implementation Steps
- [ ] Define and write `schema/character.schema.json` (v1)
- [ ] Write `characters/aria.json` — default character with placeholder expression IDs
- [ ] Set up Vite project in `src/` (install deps: `npm install`)
- [ ] Integrate existing `character-manager.jsx` into new Vite project
- [ ] Add schema validation on export (`ajv`)
- [ ] Add expression mapping UI section
- [ ] Add custom rules editor
- [ ] Test full edit → export → validate → load cycle
- [ ] Record or source voice reference audio for Aria
- [ ] Pre-process audio and test with Chatterbox
- [ ] Update `aria.json` with voice clone path if quality is good
- [ ] Write `SchemaValidator.js` as standalone utility (used by P4 at runtime too)
- [ ] Document schema in `schema/README.md`
---
## Success Criteria
- [ ] `aria.json` validates against `character.schema.json` without errors
- [ ] Character Manager UI can load, edit, and export `aria.json`
- [ ] OpenClaw loads `aria.json` system prompt and applies it to Ollama requests
- [ ] P3 TTS engine selection correctly follows `tts.engine` field
- [ ] Schema version check in P4 fails gracefully with a clear error message
- [ ] Voice clone sounds natural (if Chatterbox path taken)

357
homeai-esp32/PLAN.md Normal file
View File

@@ -0,0 +1,357 @@
# P6: homeai-esp32 — Room Satellite Hardware
> Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)
---
## Goal
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.
---
## Hardware: ESP32-S3-BOX-3
| Feature | Spec |
|---|---|
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
| RAM | 512KB SRAM + 16MB PSRAM |
| Flash | 16MB |
| Display | 2.4" IPS LCD, 320×240, touchscreen |
| Mic | Dual microphone array |
| Speaker | Built-in 1W speaker |
| Connectivity | WiFi 802.11b/g/n, BT 5.0 |
| USB | USB-C (programming + power) |
---
## Architecture Per Unit
```
ESP32-S3-BOX-3
├── microWakeWord (on-device, always listening)
│ └── triggers Wyoming Satellite on wake detection
├── Wyoming Satellite
│ ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
│ └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
├── LVGL Display
│ └── animated face, driven by HA entity state
└── ESPHome OTA
└── firmware updates over WiFi
```
---
## ESPHome Configuration
### Base Config Template
`esphome/base.yaml` — shared across all units:
```yaml
esphome:
name: homeai-${room}
friendly_name: "HomeAI ${room_display}"
platform: esp32
board: esp32-s3-box-3
wifi:
ssid: !secret wifi_ssid
password: !secret wifi_password
ap:
ssid: "HomeAI Fallback"
api:
encryption:
key: !secret api_key
ota:
password: !secret ota_password
logger:
level: INFO
```
### Room-Specific Config
`esphome/s3-box-living-room.yaml`:
```yaml
substitutions:
room: living-room
room_display: "Living Room"
mac_mini_ip: "192.168.1.x" # or Tailscale IP
packages:
base: !include base.yaml
voice: !include voice.yaml
display: !include display.yaml
```
One file per room, only the substitutions change.
### Voice / Wyoming Satellite — `esphome/voice.yaml`
```yaml
microphone:
- platform: esp_adf
id: mic
speaker:
- platform: esp_adf
id: spk
micro_wake_word:
model: hey_jarvis # or custom model path
on_wake_word_detected:
- voice_assistant.start:
voice_assistant:
microphone: mic
speaker: spk
noise_suppression_level: 2
auto_gain: 31dBFS
volume_multiplier: 2.0
on_listening:
- display.page.show: page_listening
- script.execute: animate_face_listening
on_stt_vad_end:
- display.page.show: page_thinking
- script.execute: animate_face_thinking
on_tts_start:
- display.page.show: page_speaking
- script.execute: animate_face_speaking
on_end:
- display.page.show: page_idle
- script.execute: animate_face_idle
on_error:
- display.page.show: page_error
- script.execute: animate_face_error
```
**Note:** ESPHome's `voice_assistant` component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.
### LVGL Display — `esphome/display.yaml`
```yaml
display:
- platform: ili9xxx
model: ILI9341
id: lcd
cs_pin: GPIO5
dc_pin: GPIO4
reset_pin: GPIO48
touchscreen:
- platform: tt21100
id: touch
lvgl:
displays:
- lcd
touchscreens:
- touch
# Face widget — centered on screen
widgets:
- obj:
id: face_container
width: 320
height: 240
bg_color: 0x000000
children:
# Eyes (two circles)
- obj:
id: eye_left
x: 90
y: 90
width: 50
height: 50
radius: 25
bg_color: 0xFFFFFF
- obj:
id: eye_right
x: 180
y: 90
width: 50
height: 50
radius: 25
bg_color: 0xFFFFFF
# Mouth (line/arc)
- arc:
id: mouth
x: 110
y: 160
width: 100
height: 40
start_angle: 180
end_angle: 360
arc_color: 0xFFFFFF
pages:
- id: page_idle
- id: page_listening
- id: page_thinking
- id: page_speaking
- id: page_error
```
### LVGL Face State Animations — `esphome/animations.yaml`
```yaml
script:
- id: animate_face_idle
then:
- lvgl.widget.modify:
id: eye_left
height: 50 # normal open
- lvgl.widget.modify:
id: eye_right
height: 50
- lvgl.widget.modify:
id: mouth
arc_color: 0xFFFFFF
- id: animate_face_listening
then:
- lvgl.widget.modify:
id: eye_left
height: 60 # wider eyes
- lvgl.widget.modify:
id: eye_right
height: 60
- lvgl.widget.modify:
id: mouth
arc_color: 0x00BFFF # blue tint
- id: animate_face_thinking
then:
- lvgl.widget.modify:
id: eye_left
height: 20 # squinting
- lvgl.widget.modify:
id: eye_right
height: 20
- id: animate_face_speaking
then:
- lvgl.widget.modify:
id: mouth
arc_color: 0x00FF88 # green speaking indicator
- id: animate_face_error
then:
- lvgl.widget.modify:
id: eye_left
bg_color: 0xFF2200 # red eyes
- lvgl.widget.modify:
id: eye_right
bg_color: 0xFF2200
```
> **Note:** True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.
---
## Secrets File
`esphome/secrets.yaml` (gitignored):
```yaml
wifi_ssid: "YourNetwork"
wifi_password: "YourPassword"
api_key: "<32-byte base64 key>"
ota_password: "YourOTAPassword"
```
---
## Flash & Deployment Workflow
```bash
# Install ESPHome
pip install esphome
# Compile + flash via USB (first time)
esphome run esphome/s3-box-living-room.yaml
# OTA update (subsequent)
esphome upload esphome/s3-box-living-room.yaml --device <device-ip>
# View logs
esphome logs esphome/s3-box-living-room.yaml
```
---
## Home Assistant Integration
After flashing:
1. HA discovers ESP32 automatically via mDNS
2. Add device in HA → Settings → Devices
3. Assign Wyoming voice assistant pipeline to the device
4. Set up room-specific automations (e.g., "Living Room" light control from that satellite)
---
## Directory Layout
```
homeai-esp32/
└── esphome/
├── base.yaml
├── voice.yaml
├── display.yaml
├── animations.yaml
├── s3-box-living-room.yaml
├── s3-box-bedroom.yaml # template, fill in when hardware available
├── s3-box-kitchen.yaml # template
└── secrets.yaml # gitignored
```
---
## Wake Word Decisions
| Option | Latency | Privacy | Effort |
|---|---|---|---|
| `hey_jarvis` (built-in microWakeWord) | ~200ms | On-device | Zero |
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium |
**Recommendation:** Start with `hey_jarvis`. Train a custom word (character's name) once character name is finalised.
---
## Implementation Steps
- [ ] Install ESPHome: `pip install esphome`
- [ ] Write `esphome/secrets.yaml` (gitignored)
- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml`
- [ ] Write `s3-box-living-room.yaml` for first unit
- [ ] Flash first unit via USB: `esphome run s3-box-living-room.yaml`
- [ ] Verify unit appears in HA device list
- [ ] Assign Wyoming voice pipeline to unit in HA
- [ ] Test: speak wake word → transcription → LLM response → spoken reply
- [ ] Test: LVGL face cycles through idle → listening → thinking → speaking
- [ ] Verify OTA update works: change LVGL color, deploy wirelessly
- [ ] Write config templates for remaining rooms (bedroom, kitchen)
- [ ] Flash remaining units, verify each works independently
- [ ] Document final MAC address → room name mapping
---
## Success Criteria
- [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance
- [ ] STT transcription accuracy >90% for clear speech in quiet room
- [ ] TTS audio plays clearly through ESP32 speaker
- [ ] LVGL face shows correct state for idle / listening / thinking / speaking / error
- [ ] OTA firmware updates work without USB cable
- [ ] Unit reconnects automatically after WiFi drop
- [ ] Unit survives power cycle and resumes normal operation

393
homeai-images/PLAN.md Normal file
View File

@@ -0,0 +1,393 @@
# P8: homeai-images — Image Generation
> Phase 6 | Depends on: P4 (OpenClaw skill runner) | Independent of P6, P7
---
## Goal
ComfyUI running natively on Mac Mini with SDXL and Flux.1 models. A character LoRA trained for consistent appearance. OpenClaw skill exposes image generation as a callable tool. Saved workflows cover the most common use cases.
---
## Why Native (not Docker)
Same reasoning as Ollama: ComfyUI needs Metal GPU acceleration. Docker on Mac can't access the GPU. ComfyUI runs natively as a launchd service.
---
## Installation
```bash
# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI ~/ComfyUI
cd ~/ComfyUI
# Install dependencies (Python 3.11+, venv recommended)
python3 -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install -r requirements.txt
# Launch
python main.py --listen 0.0.0.0 --port 8188
```
**Note:** Use the PyTorch MPS backend for Apple Silicon:
```python
# ComfyUI auto-detects MPS — no extra config needed
# Verify by checking ComfyUI startup logs for "Using device: mps"
```
### launchd plist — `com.homeai.comfyui.plist`
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.homeai.comfyui</string>
<key>ProgramArguments</key>
<array>
<string>/Users/<username>/ComfyUI/venv/bin/python</string>
<string>/Users/<username>/ComfyUI/main.py</string>
<string>--listen</string>
<string>0.0.0.0</string>
<string>--port</string>
<string>8188</string>
</array>
<key>WorkingDirectory</key>
<string>/Users/<username>/ComfyUI</string>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/comfyui.log</string>
<key>StandardErrorPath</key>
<string>/tmp/comfyui.err</string>
</dict>
</plist>
```
---
## Model Downloads
### Model Manifest
`~/ComfyUI/models/` structure:
```
checkpoints/
├── sd_xl_base_1.0.safetensors # SDXL base
├── flux1-dev.safetensors # Flux.1-dev (high quality)
└── flux1-schnell.safetensors # Flux.1-schnell (fast drafts)
vae/
├── sdxl_vae.safetensors
└── ae.safetensors # Flux VAE
clip/
├── clip_l.safetensors
└── t5xxl_fp16.safetensors # Flux text encoder
controlnet/
├── controlnet-canny-sdxl.safetensors
└── controlnet-depth-sdxl.safetensors
loras/
└── aria-v1.safetensors # Character LoRA (trained locally)
```
### Download Script — `scripts/download-models.sh`
```bash
#!/usr/bin/env bash
MODELS_DIR=~/ComfyUI/models
# HuggingFace downloads (requires huggingface-cli or wget)
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
import os
downloads = [
('stabilityai/stable-diffusion-xl-base-1.0', 'sd_xl_base_1.0.safetensors', 'checkpoints'),
('black-forest-labs/FLUX.1-schnell', 'flux1-schnell.safetensors', 'checkpoints'),
]
for repo, filename, subdir in downloads:
hf_hub_download(
repo_id=repo,
filename=filename,
local_dir=f'{os.path.expanduser(\"~/ComfyUI/models\")}/{subdir}'
)
"
```
> Flux.1-dev requires accepting HuggingFace license agreement. Download manually if script fails.
---
## Saved Workflows
All workflows stored as ComfyUI JSON in `homeai-images/workflows/`.
### `portrait.json` — Character Portrait
Standard character portrait with expression control.
Key nodes:
- **CheckpointLoader:** SDXL base
- **LoraLoader:** aria character LoRA
- **CLIPTextEncode:** positive prompt includes character description + expression
- **KSampler:** 25 steps, DPM++ 2M, CFG 7
- **VAEDecode → SaveImage**
Positive prompt template:
```
aria, (character lora), 1girl, solo, portrait, looking at viewer,
soft lighting, detailed face, high quality, masterpiece,
<EXPRESSION_PLACEHOLDER>
```
### `scene.json` — Character in Scene with ControlNet
Uses ControlNet depth/canny for pose control.
Key nodes:
- **LoadImage:** input pose reference image
- **ControlNetLoader:** canny or depth model
- **ControlNetApply:** apply to conditioning
- **KSampler** with ControlNet guidance
### `quick.json` — Fast Draft via Flux.1-schnell
Low-step, fast generation for quick previews.
Key nodes:
- **CheckpointLoader:** flux1-schnell
- **KSampler:** 4 steps, Euler, CFG 1 (Flux uses CFG=1)
- Output: 512×512 or 768×768
### `upscale.json` — 2× Upscale
Takes existing image, upscales 2× with detail enhancement.
Key nodes:
- **LoadImage**
- **UpscaleModelLoader:** `4x_NMKD-Siax_200k.pth` (download separately)
- **ImageUpscaleWithModel**
- **KSampler img2img** for detail pass
---
## `comfyui.py` Skill — OpenClaw Integration
Full implementation (replaces stub from P4).
File: `homeai-images/skills/comfyui.py`
```python
"""
ComfyUI image generation skill for OpenClaw.
Submits workflow JSON via ComfyUI REST API and returns generated image path.
"""
import json
import time
import uuid
import requests
from pathlib import Path
COMFYUI_URL = "http://localhost:8188"
WORKFLOWS_DIR = Path(__file__).parent.parent / "workflows"
OUTPUT_DIR = Path.home() / "ComfyUI" / "output"
def generate(workflow_name: str, params: dict = None) -> str:
"""
Submit a named workflow to ComfyUI.
Returns the path of the generated image.
Args:
workflow_name: Name of workflow JSON (without .json extension)
params: Dict of node overrides, e.g. {"positive_prompt": "...", "steps": 20}
Returns:
Absolute path to generated image file
"""
workflow_path = WORKFLOWS_DIR / f"{workflow_name}.json"
if not workflow_path.exists():
raise ValueError(f"Workflow '{workflow_name}' not found at {workflow_path}")
workflow = json.loads(workflow_path.read_text())
# Apply param overrides
if params:
workflow = _apply_params(workflow, params)
# Submit to ComfyUI queue
client_id = str(uuid.uuid4())
prompt_id = _queue_prompt(workflow, client_id)
# Poll for completion
image_path = _wait_for_output(prompt_id, client_id)
return str(image_path)
def _queue_prompt(workflow: dict, client_id: str) -> str:
resp = requests.post(
f"{COMFYUI_URL}/prompt",
json={"prompt": workflow, "client_id": client_id}
)
resp.raise_for_status()
return resp.json()["prompt_id"]
def _wait_for_output(prompt_id: str, client_id: str, timeout: int = 120) -> Path:
start = time.time()
while time.time() - start < timeout:
resp = requests.get(f"{COMFYUI_URL}/history/{prompt_id}")
history = resp.json()
if prompt_id in history:
outputs = history[prompt_id]["outputs"]
for node_output in outputs.values():
if "images" in node_output:
img = node_output["images"][0]
return OUTPUT_DIR / img["subfolder"] / img["filename"]
time.sleep(2)
raise TimeoutError(f"ComfyUI generation timed out after {timeout}s")
def _apply_params(workflow: dict, params: dict) -> dict:
"""
Apply parameter overrides to workflow nodes.
Expects workflow nodes to have a 'title' field for addressing.
e.g., params={"positive_prompt": "new prompt"} updates node titled "positive_prompt"
"""
for node_id, node in workflow.items():
title = node.get("_meta", {}).get("title", "")
if title in params:
node["inputs"]["text"] = params[title]
return workflow
# Convenience wrappers for OpenClaw
def portrait(expression: str = "neutral", extra_prompt: str = "") -> str:
return generate("portrait", {"positive_prompt": f"aria, {expression}, {extra_prompt}"})
def quick(prompt: str) -> str:
return generate("quick", {"positive_prompt": prompt})
def scene(prompt: str, controlnet_image_path: str = None) -> str:
params = {"positive_prompt": prompt}
if controlnet_image_path:
params["controlnet_image"] = controlnet_image_path
return generate("scene", params)
```
---
## Character LoRA Training
A LoRA trains the model to consistently generate the character's appearance.
### Dataset Preparation
1. Collect 2050 reference images of the character (or commission a character sheet)
2. Consistent style, multiple angles/expressions
3. Resize to 1024×1024, square crop
4. Write captions: `aria, 1girl, solo, <specific description>`
5. Store in `~/lora-training/aria/`
### Training
Use **kohya_ss** or **SimpleTuner** for LoRA training on Apple Silicon:
```bash
# kohya_ss (SDXL LoRA)
git clone https://github.com/bmaltais/kohya_ss
pip install -r requirements.txt
# Training config — key params for MPS
python train_network.py \
--pretrained_model_name_or_path=~/ComfyUI/models/checkpoints/sd_xl_base_1.0.safetensors \
--train_data_dir=~/lora-training/aria \
--output_dir=~/ComfyUI/models/loras \
--output_name=aria-v1 \
--network_module=networks.lora \
--network_dim=32 \
--network_alpha=16 \
--max_train_epochs=10 \
--learning_rate=1e-4
```
> Training on M4 Pro via MPS: expect 14 hours for a 20-image dataset at 10 epochs.
---
## Directory Layout
```
homeai-images/
├── workflows/
│ ├── portrait.json
│ ├── scene.json
│ ├── quick.json
│ └── upscale.json
└── skills/
└── comfyui.py
```
---
## Interface Contracts
**Consumes:**
- ComfyUI REST API: `http://localhost:8188`
- Workflows from `homeai-images/workflows/`
- Character LoRA from `~/ComfyUI/models/loras/aria-v1.safetensors`
**Exposes:**
- `comfyui.generate(workflow, params)` → image path — called by P4 OpenClaw
**Add to `.env.services`:**
```dotenv
COMFYUI_URL=http://localhost:8188
```
---
## Implementation Steps
- [ ] Clone ComfyUI to `~/ComfyUI/`, install deps in venv
- [ ] Verify MPS is detected at launch (`Using device: mps` in logs)
- [ ] Write and load launchd plist
- [ ] Download SDXL base model via `scripts/download-models.sh`
- [ ] Download Flux.1-schnell
- [ ] Test basic generation via ComfyUI web UI (browse to port 8188)
- [ ] Build and save `quick.json` workflow in ComfyUI UI, export JSON
- [ ] Build and save `portrait.json` workflow, export JSON
- [ ] Build and save `scene.json` workflow with ControlNet, export JSON
- [ ] Write `skills/comfyui.py` full implementation
- [ ] Test skill: `comfyui.quick("a cat sitting on a couch")` → image file
- [ ] Collect character reference images for LoRA training
- [ ] Train SDXL LoRA with kohya_ss
- [ ] Load LoRA in `portrait.json` workflow, verify character consistency
- [ ] Symlink `skills/` to `~/.openclaw/skills/`
- [ ] Test via OpenClaw: "Generate a portrait of Aria looking happy"
---
## Success Criteria
- [ ] ComfyUI UI accessible at `http://localhost:8188` after reboot
- [ ] `quick.json` workflow generates an image in <30s on M4 Pro
- [ ] `portrait.json` with character LoRA produces consistent character appearance
- [ ] `comfyui.generate("quick", {"positive_prompt": "test"})` returns a valid image path
- [ ] Generated images are saved to `~/ComfyUI/output/`
- [ ] ComfyUI survives Mac Mini reboot via launchd

191
homeai-infra/PLAN.md Normal file
View File

@@ -0,0 +1,191 @@
# P1: homeai-infra — Infrastructure & Foundation
> Phase 1 | No hard dependencies | Must complete before all other projects
---
## Goal
Get the Mac Mini running a stable, self-healing Docker stack accessible over Tailscale. All services should survive a reboot with no manual intervention.
---
## Deliverables
### 1. Directory Layout
```
~/server/
├── docker/
│ ├── home-assistant/
│ │ └── docker-compose.yml
│ ├── open-webui/
│ │ └── docker-compose.yml
│ ├── portainer/
│ │ └── docker-compose.yml
│ ├── uptime-kuma/
│ │ └── docker-compose.yml
│ ├── gitea/
│ │ └── docker-compose.yml
│ ├── n8n/
│ │ └── docker-compose.yml
│ └── code-server/
│ └── docker-compose.yml
├── .env.services ← shared service URLs, written by this project
├── .env.secrets ← secrets, never committed
└── Makefile ← up/down/restart/logs per service
```
### 2. Services to Deploy
| Service | Image | Port | Purpose |
|---|---|---|---|
| Home Assistant | `ghcr.io/home-assistant/home-assistant:stable` | 8123 | Smart home platform |
| Portainer | `portainer/portainer-ce` | 9443 | Docker management UI |
| Uptime Kuma | `louislam/uptime-kuma` | 3001 | Service health monitoring |
| Gitea | `gitea/gitea` | 3000 (HTTP), 2222 (SSH) | Self-hosted Git |
| code-server | `codercom/code-server` | 8080 | Browser VS Code |
| n8n | `n8nio/n8n` | 5678 | Workflow automation |
> Open WebUI deployed in P2 (depends on Ollama being up first).
### 3. Docker Configuration Standards
Each compose file follows this pattern:
```yaml
services:
<service>:
image: <image>
container_name: <service>
restart: unless-stopped
env_file:
- ../../.env.secrets
volumes:
- ./<service>-data:/data
networks:
- homeai
ports:
- "<port>:<port>"
networks:
homeai:
external: true
```
- Shared `homeai` Docker network created once: `docker network create homeai`
- All data volumes stored in service subdirectory (e.g., `home-assistant/ha-data/`)
- Never use `network_mode: host` unless required by service
### 4. `.env.services` — Interface Contract
Written by this project, sourced by all others:
```dotenv
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived access token>
PORTAINER_URL=https://localhost:9443
GITEA_URL=http://localhost:3000
N8N_URL=http://localhost:5678
CODE_SERVER_URL=http://localhost:8080
UPTIME_KUMA_URL=http://localhost:3001
```
### 5. `.env.secrets` (template, not committed)
```dotenv
HA_TOKEN=
GITEA_ADMIN_PASSWORD=
CODE_SERVER_PASSWORD=
N8N_ENCRYPTION_KEY=
```
Committed as `.env.secrets.example` with blank values.
### 6. Tailscale Setup
- Install Tailscale on Mac Mini: `brew install tailscale`
- Run `tailscale up --accept-routes`
- All service URLs are LAN-only; Tailscale provides remote access without port forwarding
- No ports opened in router/firewall
### 7. Makefile Targets
```makefile
up-ha: # docker compose -f docker/home-assistant/docker-compose.yml up -d
down-ha:
logs-ha:
up-all: # bring up all services in dependency order
down-all:
restart-all:
status: # docker ps --format table
```
### 8. Gitea Initialisation
- Admin user created, SSH key added
- Repos created for all 8 sub-projects
- SSH remote added to each local repo
- `.gitignore` templates: exclude `*.env.secrets`, `*-data/`, `__pycache__/`
### 9. Uptime Kuma Monitors
One monitor per service:
- Home Assistant HTTP check → `http://localhost:8123`
- Portainer HTTPS check → `https://localhost:9443`
- Gitea HTTP check → `http://localhost:3000`
- n8n HTTP check → `http://localhost:5678`
- Ollama HTTP check → `http://localhost:11434` (set up after P2)
- Wyoming STT TCP check → port 10300 (set up after P3)
Alerts: configure ntfy or Pushover for mobile notifications.
### 10. Reboot Survival
- Docker Desktop for Mac: set to launch at login
- Docker containers: `restart: unless-stopped` on all
- Ollama: launchd plist (configured in P2)
- Wyoming: launchd plist (configured in P3)
- ComfyUI: launchd plist (configured in P8)
---
## Home Assistant Setup
After container is running:
1. Complete onboarding at `http://localhost:8123`
2. Create a long-lived access token: Profile → Long-Lived Access Tokens
3. Write token to `.env.secrets` as `HA_TOKEN`
4. Install HACS (Home Assistant Community Store) — needed for custom integrations
5. Enable advanced mode in user profile
---
## Implementation Steps
- [ ] Install Docker Desktop for Mac, enable at login
- [ ] Create `homeai` Docker network
- [ ] Create `~/server/` directory structure
- [ ] Write compose files for all services
- [ ] Write `.env.secrets.example`
- [ ] Write `Makefile` with up/down/logs targets
- [ ] `make up-all` — bring all services up
- [ ] Home Assistant onboarding — generate HA_TOKEN
- [ ] Write `.env.services`
- [ ] Install Tailscale, connect all services accessible on Tailnet
- [ ] Gitea: create admin account, initialise repos, push initial commits
- [ ] Uptime Kuma: add all monitors, configure alerts
- [ ] Verify all containers restart cleanly after `docker restart` test
- [ ] Verify all containers survive a Mac Mini reboot
---
## Success Criteria
- [ ] `docker ps` shows all services running after a cold reboot
- [ ] Home Assistant UI reachable at `http://localhost:8123`
- [ ] Gitea accessible, SSH push/pull working
- [ ] Uptime Kuma showing green for all services
- [ ] All services reachable via Tailscale IP from a remote device
- [ ] `.env.services` exists and all URLs are valid

202
homeai-llm/PLAN.md Normal file
View File

@@ -0,0 +1,202 @@
# P2: homeai-llm — Local LLM Runtime
> Phase 1 | Depends on: P1 (infra up) | Blocked by: nothing
---
## Goal
Ollama running natively on Mac Mini with target models available. Open WebUI connected and accessible. LLM API ready for all downstream consumers (P3, P4, P7).
---
## Why Native (not Docker)
Ollama must run natively — not in Docker — because:
- Docker on Mac cannot access Apple Metal GPU (runs in a Linux VM)
- Native Ollama uses Metal for GPU acceleration, giving 35× faster inference
- Ollama's launchd integration keeps it alive across reboots
---
## Deliverables
### 1. Ollama Installation
```bash
# Install
brew install ollama
# Or direct install
curl -fsSL https://ollama.com/install.sh | sh
```
Ollama runs as a background process. Configure as a launchd service for reboot survival.
**launchd plist:** `~/Library/LaunchAgents/com.ollama.ollama.plist`
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.ollama.ollama</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/ollama</string>
<string>serve</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/ollama.log</string>
<key>StandardErrorPath</key>
<string>/tmp/ollama.err</string>
</dict>
</plist>
```
Load: `launchctl load ~/Library/LaunchAgents/com.ollama.ollama.plist`
### 2. Model Manifest — `ollama-models.txt`
Pinned models pulled to Mac Mini:
```
# Primary — high quality responses
llama3.3:70b
qwen2.5:72b
# Fast — low-latency tasks (timers, quick queries, TTS pre-processing)
qwen2.5:7b
# Code — for n8n/skill writing assistance
qwen2.5-coder:32b
# Embedding — for mem0 semantic search
nomic-embed-text
```
Pull script (`scripts/pull-models.sh`):
```bash
#!/usr/bin/env bash
while IFS= read -r model; do
[[ "$model" =~ ^#.*$ || -z "$model" ]] && continue
echo "Pulling $model..."
ollama pull "$model"
done < ../ollama-models.txt
```
### 3. Open WebUI — Docker
Open WebUI connects to Ollama over the Docker-to-host bridge (`host.docker.internal`):
**`docker/open-webui/docker-compose.yml`:**
```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
volumes:
- ./open-webui-data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
ports:
- "3030:8080"
networks:
- homeai
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
homeai:
external: true
```
Port `3030` chosen to avoid conflict with Gitea (3000).
### 4. Benchmark Script — `scripts/benchmark.sh`
Measures tokens/sec for each model to inform model selection per task:
```bash
#!/usr/bin/env bash
PROMPT="Tell me a joke about computers."
for model in llama3.3:70b qwen2.5:72b qwen2.5:7b; do
echo "=== $model ==="
time ollama run "$model" "$PROMPT" --nowordwrap
done
```
Results documented in `scripts/benchmark-results.md`.
### 5. API Verification
```bash
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test OpenAI-compatible endpoint (used by P3, P4)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Hello"}]
}'
```
### 6. Model Selection Guide
Document in `scripts/benchmark-results.md` after benchmarking:
| Task | Model | Reason |
|---|---|---|
| Main conversation | `llama3.3:70b` | Best quality |
| Quick/real-time tasks | `qwen2.5:7b` | Lowest latency |
| Code generation (skills) | `qwen2.5-coder:32b` | Best code quality |
| Embeddings (mem0) | `nomic-embed-text` | Compact, fast |
---
## Interface Contract
- **Ollama API:** `http://localhost:11434` (native Ollama)
- **OpenAI-compatible API:** `http://localhost:11434/v1` — used by P3, P4, P7
- **Open WebUI:** `http://localhost:3030`
Add to `~/server/.env.services`:
```dotenv
OLLAMA_URL=http://localhost:11434
OLLAMA_API_URL=http://localhost:11434/v1
OPEN_WEBUI_URL=http://localhost:3030
```
---
## Implementation Steps
- [ ] Install Ollama via brew
- [ ] Verify `ollama serve` starts and responds at port 11434
- [ ] Write launchd plist, load it, verify auto-start on reboot
- [ ] Write `ollama-models.txt` with model list
- [ ] Run `scripts/pull-models.sh` — pull all models (allow time for large downloads)
- [ ] Run `scripts/benchmark.sh` — record results in `benchmark-results.md`
- [ ] Deploy Open WebUI via Docker compose
- [ ] Verify Open WebUI can chat with all models
- [ ] Add `OLLAMA_URL` and `OPEN_WEBUI_URL` to `.env.services`
- [ ] Add Ollama and Open WebUI monitors to Uptime Kuma
---
## Success Criteria
- [ ] `curl http://localhost:11434/api/tags` returns all expected models
- [ ] `llama3.3:70b` generates a coherent response in Open WebUI
- [ ] Ollama survives Mac Mini reboot without manual intervention
- [ ] Benchmark results documented — at least one model achieving >10 tok/s
- [ ] Open WebUI accessible at `http://localhost:3030` via Tailscale

322
homeai-visual/PLAN.md Normal file
View File

@@ -0,0 +1,322 @@
# P7: homeai-visual — VTube Studio Visual Layer
> Phase 5 | Depends on: P4 (OpenClaw skill runner), P5 (character expression map)
---
## Goal
VTube Studio displays a Live2D model on Mac Mini desktop and mobile. Expressions are driven by the AI pipeline state (thinking, speaking, happy, etc.) via an OpenClaw skill that talks to VTube Studio's WebSocket API. Lip sync follows audio amplitude.
---
## Architecture
```
OpenClaw pipeline state
↓ (during LLM response generation)
vtube_studio.py skill
↓ WebSocket (port 8001)
VTube Studio (macOS app)
Live2D model renders expression
Displayed on:
- Mac Mini desktop (primary)
- iPhone/iPad (VTube Studio mobile, same model via Tailscale)
```
---
## VTube Studio Setup
### Installation
1. Download VTube Studio from the Mac App Store
2. Launch, go through initial setup
3. Enable WebSocket API: Settings → WebSocket API → Enable (port 8001)
4. Load Live2D model (see Model section below)
### WebSocket API Authentication
VTube Studio uses a token-based auth flow:
```python
import asyncio
import websockets
import json
async def authenticate():
async with websockets.connect("ws://localhost:8001") as ws:
# Step 1: request authentication token
await ws.send(json.dumps({
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "auth-req",
"messageType": "AuthenticationTokenRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"pluginIcon": None
}
}))
response = json.loads(await ws.recv())
token = response["data"]["authenticationToken"]
# User must click "Allow" in VTube Studio UI
# Step 2: authenticate with token
await ws.send(json.dumps({
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "auth",
"messageType": "AuthenticationRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"authenticationToken": token
}
}))
auth_resp = json.loads(await ws.recv())
print("Authenticated:", auth_resp["data"]["authenticated"])
return token
```
Token is persisted to `~/.openclaw/vtube_token.json`.
---
## `vtube_studio.py` Skill
Full implementation (replaces the stub from P4).
File: `homeai-visual/skills/vtube_studio.py` (symlinked to `~/.openclaw/skills/`)
```python
"""
VTube Studio WebSocket skill for OpenClaw.
Drives Live2D model expressions based on AI pipeline state.
"""
import asyncio
import json
import websockets
from pathlib import Path
VTUBE_WS_URL = "ws://localhost:8001"
TOKEN_PATH = Path.home() / ".openclaw" / "vtube_token.json"
class VTubeStudioSkill:
def __init__(self, character_config: dict):
self.expression_map = character_config.get("live2d_expressions", {})
self.ws_triggers = character_config.get("vtube_ws_triggers", {})
self.token = self._load_token()
self._ws = None
def _load_token(self) -> str | None:
if TOKEN_PATH.exists():
return json.loads(TOKEN_PATH.read_text()).get("token")
return None
def _save_token(self, token: str):
TOKEN_PATH.write_text(json.dumps({"token": token}))
async def connect(self):
self._ws = await websockets.connect(VTUBE_WS_URL)
if self.token:
await self._authenticate()
else:
await self._request_new_token()
async def _authenticate(self):
await self._send({
"messageType": "AuthenticationRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"authenticationToken": self.token
}
})
resp = await self._recv()
if not resp["data"].get("authenticated"):
# Token expired — request a new one
await self._request_new_token()
async def _request_new_token(self):
await self._send({
"messageType": "AuthenticationTokenRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"pluginIcon": None
}
})
resp = await self._recv()
token = resp["data"]["authenticationToken"]
self._save_token(token)
self.token = token
await self._authenticate()
async def trigger_expression(self, event: str):
"""Trigger a named expression state (idle, thinking, speaking, etc.)"""
hotkey_id = self.expression_map.get(event)
if not hotkey_id:
return
await self._trigger_hotkey(hotkey_id)
async def _trigger_hotkey(self, hotkey_id: str):
await self._send({
"messageType": "HotkeyTriggerRequest",
"data": {"hotkeyID": hotkey_id}
})
await self._recv()
async def set_parameter(self, name: str, value: float):
"""Set a VTube Studio parameter (e.g., mouth open for lip sync)"""
await self._send({
"messageType": "InjectParameterDataRequest",
"data": {
"parameterValues": [
{"id": name, "value": value}
]
}
})
await self._recv()
async def _send(self, payload: dict):
full = {
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "homeai",
**payload
}
await self._ws.send(json.dumps(full))
async def _recv(self) -> dict:
return json.loads(await self._ws.recv())
async def close(self):
if self._ws:
await self._ws.close()
# OpenClaw skill entry point — synchronous wrapper
def trigger_expression(event: str, character_config: dict):
skill = VTubeStudioSkill(character_config)
asyncio.run(_run(skill, event))
async def _run(skill, event):
await skill.connect()
await skill.trigger_expression(event)
await skill.close()
```
---
## Lip Sync
### Phase 1: Amplitude-Based (Simple)
During TTS audio playback, sample audio amplitude and map to mouth open parameter:
```python
import numpy as np
import sounddevice as sd
def stream_with_lipsync(audio_data: np.ndarray, sample_rate: int, vtube: VTubeStudioSkill):
chunk_size = 1024
for i in range(0, len(audio_data), chunk_size):
chunk = audio_data[i:i+chunk_size]
amplitude = float(np.abs(chunk).mean()) / 32768.0 # normalise 16-bit PCM
mouth_value = min(amplitude * 10, 1.0) # scale to 01
asyncio.run(vtube.set_parameter("MouthOpen", mouth_value))
sd.play(chunk, sample_rate, blocking=True)
asyncio.run(vtube.set_parameter("MouthOpen", 0.0)) # close mouth after
```
### Phase 2: Phoneme-Based (Future)
Parse TTS phoneme timing from Kokoro/Chatterbox output and drive expression per phoneme. More accurate but significantly more complex. Defer to after Phase 5.
---
## Live2D Model
### Options
| Option | Cost | Effort | Quality |
|---|---|---|---|
| Free models (VTube Studio sample packs) | Free | Low | Generic |
| Purchase from nizima.com or booth.pm | ¥3,000¥30,000 | Low | High |
| Commission custom model | ¥50,000¥200,000+ | Low (for you) | Unique |
**Recommendation:** Start with a purchased model from nizima.com or booth.pm that matches the character's aesthetic. Commission custom later once personality is locked in.
### Model Setup
1. Download `.vtube.model3.json` + associated assets
2. Place in `~/Documents/Live2DModels/` (VTube Studio default)
3. Load in VTube Studio: Model tab → Add Model
4. Map hotkeys: VTube Studio → Hotkeys → create one per expression state
5. Record hotkey IDs, update `aria.json` `live2d_expressions` mapping
---
## Expression Hotkey Mapping Workflow
1. Launch VTube Studio, load model
2. Go to Hotkeys → add hotkeys for each state: idle, listening, thinking, speaking, happy, sad, surprised, error
3. VTube Studio assigns a UUID to each hotkey — copy these
4. Open Character Manager (P5), paste UUIDs into expression mapping UI
5. Export updated `aria.json`
6. Restart OpenClaw — new expression map loaded
---
## Mobile Setup
1. Install VTube Studio on iPhone/iPad
2. On same Tailscale network, VTube Studio mobile discovers Mac Mini model
3. Mirror mode: mobile shows same model as desktop
4. Useful for bedside or kitchen display while Mac Mini desktop is the primary
---
## Directory Layout
```
homeai-visual/
└── skills/
├── vtube_studio.py ← full implementation
├── lipsync.py ← amplitude-based lip sync helper
└── auth.py ← token management utility
```
---
## Implementation Steps
- [ ] Install VTube Studio (Mac App Store)
- [ ] Enable WebSocket API on port 8001
- [ ] Source/purchase a Live2D model
- [ ] Load model in VTube Studio, verify it renders
- [ ] Create hotkeys in VTube Studio for all 8 expression states
- [ ] Write `vtube_studio.py` full implementation
- [ ] Run auth flow — click "Allow" in VTube Studio UI, save token
- [ ] Test `trigger_expression("thinking")` → model shows expression
- [ ] Test all 8 expressions via a simple test script
- [ ] Update `aria.json` with real VTube Studio hotkey IDs
- [ ] Write `lipsync.py` amplitude-based helper
- [ ] Integrate lip sync into TTS dispatch in OpenClaw
- [ ] Symlink `skills/``~/.openclaw/skills/`
- [ ] Test full pipeline: voice query → thinking expression → LLM → speaking expression with lip sync
- [ ] Set up VTube Studio on iPhone (optional, do last)
---
## Success Criteria
- [ ] All 8 expression states trigger correctly via `trigger_expression()`
- [ ] Lip sync is visibly responding to TTS audio (even if imperfect)
- [ ] VTube Studio token survives app restart (token file persists)
- [ ] Expression triggers are fast enough to feel responsive (<100ms from call to render)
- [ ] Model stays loaded and connected after Mac Mini sleep/wake

247
homeai-voice/PLAN.md Normal file
View File

@@ -0,0 +1,247 @@
# P3: homeai-voice — Speech Pipeline
> Phase 2 | Depends on: P1 (HA running), P2 (Ollama running) | Consumed by: P4, P6
---
## Goal
Full end-to-end voice pipeline running on Mac Mini: wake word detection → speech-to-text → (handoff to P4 agent) → text-to-speech → audio out. Wyoming protocol bridges STT and TTS into Home Assistant.
Test with a desktop USB mic before ESP32 hardware arrives (P6).
---
## Pipeline Architecture
```
[USB Mic / ESP32 satellite]
openWakeWord (always-on, local)
↓ wake detected
Wyoming Satellite / Audio capture
↓ raw audio stream
Wyoming STT Server (Whisper.cpp)
↓ transcribed text
Home Assistant Voice Pipeline
↓ text
OpenClaw Agent (P4) ← intent + LLM response
↓ response text
Wyoming TTS Server (Kokoro)
↓ audio
[Speaker / ESP32 satellite]
```
---
## Components
### 1. Whisper.cpp — Speech-to-Text
**Why Whisper.cpp over Python Whisper:**
- Native Apple Silicon build — uses Neural Engine + Metal
- Significantly lower latency than Python implementation
- Runs as a server process, not one-shot per request
**Installation:**
```bash
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j$(sysctl -n hw.logicalcpu) # compiles with Metal support on macOS
# Download model
bash ./models/download-ggml-model.sh large-v3
# Also grab medium.en for faster fallback
bash ./models/download-ggml-model.sh medium.en
```
Models stored at `~/models/whisper/`.
**Wyoming-Whisper adapter:**
Use `wyoming-faster-whisper` or the Wyoming-compatible Whisper.cpp server:
```bash
pip install wyoming-faster-whisper
wyoming-faster-whisper \
--model large-v3 \
--language en \
--uri tcp://0.0.0.0:10300 \
--data-dir ~/models/whisper \
--download-dir ~/models/whisper
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-stt.plist`
### 2. Kokoro TTS — Primary Text-to-Speech
**Why Kokoro:**
- Very low latency (~200ms for short phrases)
- High quality voice output
- Runs efficiently on Apple Silicon
- No GPU required (MPS optional)
**Installation:**
```bash
pip install kokoro-onnx
```
**Wyoming-Kokoro adapter:**
```bash
pip install wyoming-kokoro # community adapter, or write thin wrapper
wyoming-kokoro \
--uri tcp://0.0.0.0:10301 \
--voice af_heart \ # default voice; overridden by character config
--speed 1.0
```
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wyoming-tts.plist`
### 3. Chatterbox TTS — Voice Cloning Engine
Used when a character voice clone is active (character config from P5 sets `tts_engine: chatterbox`).
```bash
# Install Chatterbox (MPS-optimised for Apple Silicon)
pip install chatterbox-tts
# Test voice clone
python -c "
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device='mps')
wav = model.generate('Hello, I am your assistant.', audio_prompt_path='~/voices/aria.wav')
"
```
Chatterbox is invoked directly by the OpenClaw TTS skill (P4), bypassing Wyoming when voice cloning is needed. Wyoming (Kokoro) remains for HA pipeline.
### 4. Qwen3-TTS — MLX Fallback
```bash
pip install mlx mlx-lm
# Pull Qwen3-TTS model via mlx-lm or HuggingFace
```
Used as a fallback if Chatterbox quality is insufficient. Activated via character config `tts_engine: qwen3`.
### 5. openWakeWord — Always-On Detection
Runs continuously, listens for wake word, triggers pipeline.
```bash
pip install openwakeword
# Test with default "hey_jarvis" model
python -c "
import openwakeword
model = openwakeword.Model(wakeword_models=['hey_jarvis'])
# ... audio loop
"
```
**Custom wake word (later):**
- Record 3050 utterances of the character's name
- Train via openWakeWord training toolkit
- Drop model file into `~/models/wakeword/`
**launchd plist:** `~/Library/LaunchAgents/com.homeai.wakeword.plist`
Wake word trigger → sends HTTP POST to OpenClaw (P4) or Wyoming handoff.
### 6. Wyoming Protocol Server
Wyoming is Home Assistant's standard for local STT/TTS. Both Whisper and Kokoro run as Wyoming services so HA can use them directly.
**HA integration:**
1. Home Assistant → Settings → Add Integration → Wyoming Protocol
2. Add STT: host `<mac-mini-ip>`, port `10300`
3. Add TTS: host `<mac-mini-ip>`, port `10301`
4. Create Voice Assistant pipeline in HA using these providers
5. Assign pipeline to Assist dashboard and later to ESP32 satellites (P6)
---
## launchd Services
Three launchd plists under `~/Library/LaunchAgents/`:
| Plist | Service | Port |
|---|---|---|
| `com.homeai.wyoming-stt.plist` | Whisper.cpp Wyoming | 10300 |
| `com.homeai.wyoming-tts.plist` | Kokoro Wyoming | 10301 |
| `com.homeai.wakeword.plist` | openWakeWord | (no port, triggers internally) |
Templates stored in `scripts/launchd/`.
---
## Directory Layout
```
homeai-voice/
├── whisper/
│ ├── install.sh # clone, compile whisper.cpp, download models
│ └── README.md
├── tts/
│ ├── install-kokoro.sh
│ ├── install-chatterbox.sh
│ ├── install-qwen3.sh
│ └── test-tts.sh # quick audio playback test
├── wyoming/
│ ├── install.sh
│ └── test-pipeline.sh # end-to-end text→audio test
└── scripts/
├── launchd/
│ ├── com.homeai.wyoming-stt.plist
│ ├── com.homeai.wyoming-tts.plist
│ └── com.homeai.wakeword.plist
└── load-all-launchd.sh
```
---
## Interface Contracts
**Exposes:**
- Wyoming STT: `tcp://0.0.0.0:10300` — consumed by HA, P6 (ESP32 satellites)
- Wyoming TTS: `tcp://0.0.0.0:10301` — consumed by HA, P6
- Chatterbox: Python API, invoked directly by P4 skills
- openWakeWord: triggers HTTP POST to `http://localhost:8080/wake` (P4 OpenClaw)
**Add to `.env.services`:**
```dotenv
WYOMING_STT_URL=tcp://localhost:10300
WYOMING_TTS_URL=tcp://localhost:10301
```
---
## Implementation Steps
- [ ] Compile Whisper.cpp with Metal support
- [ ] Download `large-v3` and `medium.en` Whisper models to `~/models/whisper/`
- [ ] Install `wyoming-faster-whisper`, test STT from audio file
- [ ] Install Kokoro, test TTS to audio file
- [ ] Install Wyoming-Kokoro adapter, verify Wyoming protocol works
- [ ] Write launchd plists for STT and TTS services
- [ ] Load plists, verify both services start on reboot
- [ ] Connect HA Wyoming integration — STT port 10300, TTS port 10301
- [ ] Create HA Voice Assistant pipeline with Whisper STT + Kokoro TTS
- [ ] Test HA Assist from browser: type query → hear spoken response
- [ ] Install openWakeWord, test wake detection with USB mic
- [ ] Write and load openWakeWord launchd plist
- [ ] Install Chatterbox, test voice clone with sample `.wav`
- [ ] Install Qwen3-TTS via MLX (fallback, lower priority)
- [ ] Write `wyoming/test-pipeline.sh` — full end-to-end smoke test
---
## Success Criteria
- [ ] `wyoming/test-pipeline.sh` passes: audio file → transcribed text → spoken response
- [ ] HA Voice Assistant responds to typed query with Kokoro voice
- [ ] openWakeWord detects "hey jarvis" (or chosen wake word) reliably
- [ ] All three launchd services auto-start after reboot
- [ ] STT latency <2s for 5-second utterances with `large-v3`
- [ ] Kokoro TTS latency <300ms for a 10-word sentence