Character schema v2: background, dialogue_style, appearance, skills, gaze_presets with automatic v1→v2 migration. LLM-assisted character creation via Character MCP server. Two-tier memory system (personal per-character + general shared) with budget-based injection into LLM system prompt. Per-character TTS voice routing via state file — Wyoming TTS server reads active config to route between Kokoro (local) and ElevenLabs (cloud PCM 24kHz). Dashboard: memories page, conversation history, character profile on cards, auto-TTS engine selection from character config. Also includes VTube Studio expression bridge and ComfyUI API guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
194 lines
9.1 KiB
Markdown
194 lines
9.1 KiB
Markdown
# CLAUDE.md — Home AI Assistant Project
|
|
|
|
## Project Overview
|
|
|
|
A self-hosted, always-on personal AI assistant running on a **Mac Mini M4 Pro (64GB RAM, 1TB SSD)**. The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration.
|
|
|
|
---
|
|
|
|
## Hardware
|
|
|
|
| Component | Spec |
|
|
|---|---|
|
|
| Chip | Apple M4 Pro |
|
|
| CPU | 14-core |
|
|
| GPU | 20-core |
|
|
| Neural Engine | 16-core |
|
|
| RAM | 64GB unified memory |
|
|
| Storage | 1TB SSD |
|
|
| Network | Gigabit Ethernet |
|
|
|
|
All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional).
|
|
|
|
---
|
|
|
|
## Core Stack
|
|
|
|
### AI & LLM
|
|
- **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B)
|
|
- **Model keep-warm daemon** — `preload-models.sh` runs as a loop, checks every 5 min, re-pins evicted models with `keep_alive=-1`. Keeps `qwen2.5:7b` (small/fast) and `$HOMEAI_MEDIUM_MODEL` (default: `qwen3.5:35b-a3b`) always loaded in VRAM. Medium model is configurable via env var for per-persona model assignment.
|
|
- **Open WebUI** — browser-based chat interface, runs as Docker container
|
|
|
|
### Image Generation
|
|
- **ComfyUI** — primary image generation UI, node-based workflows
|
|
- Target models: SDXL, Flux.1, ControlNet
|
|
- Runs via Metal (Apple GPU API)
|
|
|
|
### Speech
|
|
- **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine
|
|
- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency, local)
|
|
- **ElevenLabs TTS** — cloud voice cloning/synthesis (per-character voice ID, routed via state file)
|
|
- **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised)
|
|
- **Qwen3-TTS** — alternative voice cloning via MLX
|
|
- **openWakeWord** — always-on wake word detection
|
|
|
|
### Smart Home
|
|
- **Home Assistant** — smart home control platform (Docker)
|
|
- **Wyoming Protocol** — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant
|
|
- **Music Assistant** — self-hosted music control, integrates with Home Assistant
|
|
- **Snapcast** — multi-room synchronised audio output
|
|
|
|
### AI Agent / Orchestration
|
|
- **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality
|
|
- **n8n** — visual workflow automation (Docker), chains AI actions
|
|
- **Character Memory System** — two-tier JSON-based memories (personal per-character + general shared), injected into LLM system prompt with budget truncation
|
|
|
|
### Character & Personality
|
|
- **Character Schema v2** — JSON spec with background, dialogue_style, appearance, skills, gaze_presets (v1 auto-migrated)
|
|
- **HomeAI Dashboard** — unified web app: character editor, chat, memory manager, service dashboard
|
|
- **Character MCP Server** — LLM-assisted character creation via Fandom wiki/Wikipedia lookup (Docker)
|
|
- Character config stored as JSON files in `~/homeai-data/characters/`, consumed by bridge for system prompt construction
|
|
|
|
### Visual Representation
|
|
- **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android)
|
|
- VTube Studio WebSocket API used to drive expressions from the AI pipeline
|
|
- **LVGL** — simplified animated face on ESP32-S3-BOX-3 units
|
|
- Live2D model: to be sourced/commissioned (nizima.com or booth.pm)
|
|
|
|
### Room Presence (Smart Speaker Replacement)
|
|
- **ESP32-S3-BOX-3** units — one per room
|
|
- Flashed with **ESPHome**
|
|
- Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back)
|
|
- LVGL display shows animated face + status info
|
|
- Communicates over local WiFi
|
|
|
|
### Infrastructure
|
|
- **Docker Desktop for Mac** — containerises Home Assistant, Open WebUI, n8n, etc.
|
|
- **Tailscale** — secure remote access to all services, no port forwarding
|
|
- **Authelia** — 2FA authentication layer for exposed web UIs
|
|
- **Portainer** — Docker container management UI
|
|
- **Uptime Kuma** — service health monitoring and mobile alerts
|
|
- **Gitea** — self-hosted Git server for all project code and configs
|
|
- **code-server** — browser-based VS Code for remote development
|
|
|
|
---
|
|
|
|
## Voice Pipeline (End-to-End)
|
|
|
|
```
|
|
ESP32-S3-BOX-3 (room)
|
|
→ Wake word detected (openWakeWord, runs locally on device or Mac Mini)
|
|
→ Audio streamed to Mac Mini via Wyoming Satellite
|
|
→ Whisper MLX transcribes speech to text
|
|
→ HA conversation agent → OpenClaw HTTP Bridge
|
|
→ Bridge resolves character (satellite_id → character mapping)
|
|
→ Bridge builds system prompt (profile + memories) and writes TTS config to state file
|
|
→ OpenClaw CLI → Ollama LLM generates response
|
|
→ Response dispatched:
|
|
→ Wyoming TTS reads state file → routes to Kokoro (local) or ElevenLabs (cloud)
|
|
→ Audio sent back to ESP32-S3-BOX-3 (spoken response)
|
|
→ VTube Studio API triggered (expression + lip sync on desktop/mobile)
|
|
→ Home Assistant action called if applicable (lights, music, etc.)
|
|
```
|
|
|
|
### Timeout Strategy
|
|
|
|
The HTTP bridge checks Ollama `/api/ps` before each request to determine if the LLM is already loaded:
|
|
|
|
| Layer | Warm (model loaded) | Cold (model loading) |
|
|
|---|---|---|
|
|
| HA conversation component | 200s | 200s |
|
|
| OpenClaw HTTP bridge | 60s | 180s |
|
|
| OpenClaw agent | 60s | 60s |
|
|
|
|
The keep-warm daemon ensures models stay loaded, so cold starts should be rare (only after Ollama restarts or VRAM pressure).
|
|
|
|
---
|
|
|
|
## Character System
|
|
|
|
The AI assistant has a defined personality managed via the HomeAI Dashboard (character editor + memory manager).
|
|
|
|
### Character Schema v2
|
|
|
|
Each character is a JSON file in `~/homeai-data/characters/` with:
|
|
- **System prompt** — core personality, injected into every LLM request
|
|
- **Profile fields** — background, appearance, dialogue_style, skills array
|
|
- **TTS config** — engine (kokoro/elevenlabs), kokoro_voice, elevenlabs_voice_id, elevenlabs_model, speed
|
|
- **GAZE presets** — array of `{preset, trigger}` for image generation styles
|
|
- **Custom prompt rules** — trigger/response overrides for specific contexts
|
|
|
|
### Memory System
|
|
|
|
Two-tier memory stored as JSON in `~/homeai-data/memories/`:
|
|
- **Personal memories** (`personal/{character_id}.json`) — per-character, about user interactions
|
|
- **General memories** (`general.json`) — shared operational knowledge (tool usage, device info, routines)
|
|
|
|
Memories are injected into the system prompt by the bridge with budget truncation (personal: 4000 chars, general: 3000 chars, newest first).
|
|
|
|
### TTS Voice Routing
|
|
|
|
The bridge writes the active character's TTS config to `~/homeai-data/active-tts-voice.json` before each request. The Wyoming TTS server reads this state file to determine which engine/voice to use:
|
|
- **Kokoro** — local, fast, uses `kokoro_voice` field (e.g., `af_heart`)
|
|
- **ElevenLabs** — cloud, uses `elevenlabs_voice_id` + `elevenlabs_model`, returns PCM 24kHz
|
|
|
|
This works for both ESP32/HA pipeline and dashboard chat.
|
|
|
|
---
|
|
|
|
## Project Priorities
|
|
|
|
1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma) ✅
|
|
2. **LLM** — Ollama running with target models, Open WebUI connected ✅
|
|
3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant ✅
|
|
4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant ✅
|
|
5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, display faces ✅
|
|
6. **Character system** — schema v2, dashboard editor, memory system, per-character TTS routing ✅
|
|
7. **Animated visual** — PNG/GIF character visual for the web assistant (initial visual layer)
|
|
8. **Android app** — companion app for mobile access to the assistant
|
|
9. **ComfyUI** — image generation online, character-consistent model workflows
|
|
10. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
|
|
11. **Polish** — Authelia, Tailscale hardening, iOS widgets
|
|
|
|
### Stretch Goals
|
|
- **Live2D / VTube Studio** — full Live2D model with WebSocket API bridge (requires learning Live2D tooling)
|
|
|
|
---
|
|
|
|
## Key Paths & Conventions
|
|
|
|
- All Docker compose files: `~/server/docker/`
|
|
- OpenClaw skills: `~/.openclaw/skills/`
|
|
- Character configs: `~/homeai-data/characters/`
|
|
- Character memories: `~/homeai-data/memories/`
|
|
- Conversation history: `~/homeai-data/conversations/`
|
|
- Active TTS state: `~/homeai-data/active-tts-voice.json`
|
|
- Satellite → character map: `~/homeai-data/satellite-map.json`
|
|
- Whisper models: `~/models/whisper/`
|
|
- Ollama models: managed by Ollama at `~/.ollama/models/`
|
|
- ComfyUI models: `~/ComfyUI/models/`
|
|
- Voice reference audio: `~/voices/`
|
|
- Gitea repos root: `~/gitea/`
|
|
|
|
---
|
|
|
|
## Notes for Planning
|
|
|
|
- All services should survive a Mac Mini reboot (launchd or Docker restart policies)
|
|
- ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini
|
|
- The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values
|
|
- OpenClaw skills are the primary extension mechanism — new capabilities = new skills
|
|
- Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only
|
|
- VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface
|
|
- mem0 memory store should be backed up as part of regular Gitea commits
|