Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.5 KiB
6.5 KiB
CLAUDE.md — Home AI Assistant Project
Project Overview
A self-hosted, always-on personal AI assistant running on a Mac Mini M4 Pro (64GB RAM, 1TB SSD). The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration.
Hardware
| Component | Spec |
|---|---|
| Chip | Apple M4 Pro |
| CPU | 14-core |
| GPU | 20-core |
| Neural Engine | 16-core |
| RAM | 64GB unified memory |
| Storage | 1TB SSD |
| Network | Gigabit Ethernet |
All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional).
Core Stack
AI & LLM
- Ollama — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B)
- Open WebUI — browser-based chat interface, runs as Docker container
Image Generation
- ComfyUI — primary image generation UI, node-based workflows
- Target models: SDXL, Flux.1, ControlNet
- Runs via Metal (Apple GPU API)
Speech
- Whisper.cpp — speech-to-text, optimised for Apple Silicon/Neural Engine
- Kokoro TTS — fast, lightweight text-to-speech (primary, low-latency)
- Chatterbox TTS — voice cloning engine (Apple Silicon MPS optimised)
- Qwen3-TTS — alternative voice cloning via MLX
- openWakeWord — always-on wake word detection
Smart Home
- Home Assistant — smart home control platform (Docker)
- Wyoming Protocol — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant
- Music Assistant — self-hosted music control, integrates with Home Assistant
- Snapcast — multi-room synchronised audio output
AI Agent / Orchestration
- OpenClaw — primary AI agent layer; receives voice commands, calls tools, manages personality
- n8n — visual workflow automation (Docker), chains AI actions
- mem0 — long-term memory layer for the AI character
Character & Personality
- Character Manager (built — see
character-manager.jsx) — single config UI for personality, prompts, models, Live2D mappings, and notes - Character config exports to JSON, consumed by OpenClaw system prompt and pipeline
Visual Representation
- VTube Studio — Live2D model display on desktop (macOS) and mobile (iOS/Android)
- VTube Studio WebSocket API used to drive expressions from the AI pipeline
- LVGL — simplified animated face on ESP32-S3-BOX-3 units
- Live2D model: to be sourced/commissioned (nizima.com or booth.pm)
Room Presence (Smart Speaker Replacement)
- ESP32-S3-BOX-3 units — one per room
- Flashed with ESPHome
- Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back)
- LVGL display shows animated face + status info
- Communicates over local WiFi
Infrastructure
- Docker Desktop for Mac — containerises Home Assistant, Open WebUI, n8n, etc.
- Tailscale — secure remote access to all services, no port forwarding
- Authelia — 2FA authentication layer for exposed web UIs
- Portainer — Docker container management UI
- Uptime Kuma — service health monitoring and mobile alerts
- Gitea — self-hosted Git server for all project code and configs
- code-server — browser-based VS Code for remote development
Voice Pipeline (End-to-End)
ESP32-S3-BOX-3 (room)
→ Wake word detected (openWakeWord, runs locally on device or Mac Mini)
→ Audio streamed to Mac Mini via Wyoming Satellite
→ Whisper.cpp transcribes speech to text
→ OpenClaw receives text + context
→ Ollama LLM generates response (with character persona from system prompt)
→ mem0 updates long-term memory
→ Response dispatched:
→ Kokoro/Chatterbox renders TTS audio
→ Audio sent back to ESP32-S3-BOX-3 (spoken response)
→ VTube Studio API triggered (expression + lip sync on desktop/mobile)
→ Home Assistant action called if applicable (lights, music, etc.)
Character System
The AI assistant has a defined personality managed via the Character Manager tool.
Key config surfaces:
- System prompt — injected into every Ollama request
- Voice clone reference —
.wavfile path for Chatterbox/Qwen3-TTS - Live2D expression mappings — idle, speaking, thinking, happy, error states
- VTube Studio WebSocket triggers — JSON map of events to expressions
- Custom prompt rules — trigger/response overrides for specific contexts
- mem0 — persistent memory that evolves over time
Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components.
Project Priorities
- Foundation — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma)
- LLM — Ollama running with target models, Open WebUI connected
- Voice pipeline — Whisper → Ollama → Kokoro → Wyoming → Home Assistant
- OpenClaw — installed, onboarded, connected to Ollama and Home Assistant
- ESP32-S3-BOX-3 — ESPHome flash, Wyoming Satellite, LVGL face
- Character system — system prompt wired up, mem0 integrated, voice cloned
- VTube Studio — model loaded, WebSocket API bridge written as OpenClaw skill
- ComfyUI — image generation online, character-consistent model workflows
- Extended integrations — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
- Polish — Authelia, Tailscale hardening, mobile companion, iOS widgets
Key Paths & Conventions
- All Docker compose files:
~/server/docker/ - OpenClaw skills:
~/.openclaw/skills/ - Character configs:
~/.openclaw/characters/ - Whisper models:
~/models/whisper/ - Ollama models: managed by Ollama at
~/.ollama/models/ - ComfyUI models:
~/ComfyUI/models/ - Voice reference audio:
~/voices/ - Gitea repos root:
~/gitea/
Notes for Planning
- All services should survive a Mac Mini reboot (launchd or Docker restart policies)
- ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini
- The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values
- OpenClaw skills are the primary extension mechanism — new capabilities = new skills
- Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only
- VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface
- mem0 memory store should be backed up as part of regular Gitea commits