feat: character system v2 — schema upgrade, memory system, per-character TTS routing

Character schema v2: background, dialogue_style, appearance, skills, gaze_presets with automatic v1→v2 migration. LLM-assisted character creation via Character MCP server. Two-tier memory system (personal per-character + general shared) with budget-based injection into LLM system prompt. Per-character TTS voice routing via state file — Wyoming TTS server reads active config to route between Kokoro (local) and ElevenLabs (cloud PCM 24kHz). Dashboard: memories page, conversation history, character profile on cards, auto-TTS engine selection from character config. Also includes VTube Studio expression bridge and ComfyUI API guide. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 19:15:46 +00:00
parent 1e52c002c2
commit 60eb89ea42
39 changed files with 3846 additions and 409 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -26,6 +26,7 @@ All AI inference runs locally on this machine. No cloud dependency required (clo

 ### AI & LLM
 - **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B)
+- **Model keep-warm daemon** — `preload-models.sh` runs as a loop, checks every 5 min, re-pins evicted models with `keep_alive=-1`. Keeps `qwen2.5:7b` (small/fast) and `$HOMEAI_MEDIUM_MODEL` (default: `qwen3.5:35b-a3b`) always loaded in VRAM. Medium model is configurable via env var for per-persona model assignment.
 - **Open WebUI** — browser-based chat interface, runs as Docker container

 ### Image Generation
@@ -35,7 +36,8 @@ All AI inference runs locally on this machine. No cloud dependency required (clo

 ### Speech
 - **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine
- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency)
+- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency, local)
+- **ElevenLabs TTS** — cloud voice cloning/synthesis (per-character voice ID, routed via state file)
 - **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised)
 - **Qwen3-TTS** — alternative voice cloning via MLX
 - **openWakeWord** — always-on wake word detection
@@ -49,11 +51,13 @@ All AI inference runs locally on this machine. No cloud dependency required (clo
 ### AI Agent / Orchestration
 - **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality
 - **n8n** — visual workflow automation (Docker), chains AI actions
- **mem0** — long-term memory layer for the AI character
+- **Character Memory System** — two-tier JSON-based memories (personal per-character + general shared), injected into LLM system prompt with budget truncation

 ### Character & Personality
- **Character Manager** (built — see `character-manager.jsx`) — single config UI for personality, prompts, models, Live2D mappings, and notes
- Character config exports to JSON, consumed by OpenClaw system prompt and pipeline
+- **Character Schema v2** — JSON spec with background, dialogue_style, appearance, skills, gaze_presets (v1 auto-migrated)
+- **HomeAI Dashboard** — unified web app: character editor, chat, memory manager, service dashboard
+- **Character MCP Server** — LLM-assisted character creation via Fandom wiki/Wikipedia lookup (Docker)
+- Character config stored as JSON files in `~/homeai-data/characters/`, consumed by bridge for system prompt construction

 ### Visual Representation
 - **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android)
@@ -85,47 +89,79 @@ All AI inference runs locally on this machine. No cloud dependency required (clo
 ESP32-S3-BOX-3 (room)
  → Wake word detected (openWakeWord, runs locally on device or Mac Mini)
  → Audio streamed to Mac Mini via Wyoming Satellite
-  → Whisper.cpp transcribes speech to text
-  → OpenClaw receives text + context
-  → Ollama LLM generates response (with character persona from system prompt)
-  → mem0 updates long-term memory
+  → Whisper MLX transcribes speech to text
+  → HA conversation agent → OpenClaw HTTP Bridge
+  → Bridge resolves character (satellite_id → character mapping)
+  → Bridge builds system prompt (profile + memories) and writes TTS config to state file
+  → OpenClaw CLI → Ollama LLM generates response
  → Response dispatched:
-      → Kokoro/Chatterbox renders TTS audio
+      → Wyoming TTS reads state file → routes to Kokoro (local) or ElevenLabs (cloud)
      → Audio sent back to ESP32-S3-BOX-3 (spoken response)
      → VTube Studio API triggered (expression + lip sync on desktop/mobile)
      → Home Assistant action called if applicable (lights, music, etc.)
 ```

+### Timeout Strategy
+
+The HTTP bridge checks Ollama `/api/ps` before each request to determine if the LLM is already loaded:
+
+| Layer | Warm (model loaded) | Cold (model loading) |
+|---|---|---|
+| HA conversation component | 200s | 200s |
+| OpenClaw HTTP bridge | 60s | 180s |
+| OpenClaw agent | 60s | 60s |
+
+The keep-warm daemon ensures models stay loaded, so cold starts should be rare (only after Ollama restarts or VRAM pressure).
+
 ---

 ## Character System

-The AI assistant has a defined personality managed via the Character Manager tool.
+The AI assistant has a defined personality managed via the HomeAI Dashboard (character editor + memory manager).

-Key config surfaces:
- **System prompt** — injected into every Ollama request
- **Voice clone reference** — `.wav` file path for Chatterbox/Qwen3-TTS
- **Live2D expression mappings** — idle, speaking, thinking, happy, error states
- **VTube Studio WebSocket triggers** — JSON map of events to expressions
+### Character Schema v2
+
+Each character is a JSON file in `~/homeai-data/characters/` with:
+- **System prompt** — core personality, injected into every LLM request
+- **Profile fields** — background, appearance, dialogue_style, skills array
+- **TTS config** — engine (kokoro/elevenlabs), kokoro_voice, elevenlabs_voice_id, elevenlabs_model, speed
+- **GAZE presets** — array of `{preset, trigger}` for image generation styles
 - **Custom prompt rules** — trigger/response overrides for specific contexts
- **mem0** — persistent memory that evolves over time

-Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components.
+### Memory System
+
+Two-tier memory stored as JSON in `~/homeai-data/memories/`:
+- **Personal memories** (`personal/{character_id}.json`) — per-character, about user interactions
+- **General memories** (`general.json`) — shared operational knowledge (tool usage, device info, routines)
+
+Memories are injected into the system prompt by the bridge with budget truncation (personal: 4000 chars, general: 3000 chars, newest first).
+
+### TTS Voice Routing
+
+The bridge writes the active character's TTS config to `~/homeai-data/active-tts-voice.json` before each request. The Wyoming TTS server reads this state file to determine which engine/voice to use:
+- **Kokoro** — local, fast, uses `kokoro_voice` field (e.g., `af_heart`)
+- **ElevenLabs** — cloud, uses `elevenlabs_voice_id` + `elevenlabs_model`, returns PCM 24kHz
+
+This works for both ESP32/HA pipeline and dashboard chat.

 ---

 ## Project Priorities

-1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma)
-2. **LLM** — Ollama running with target models, Open WebUI connected
-3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant
-4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant
-5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, LVGL face
-6. **Character system** — system prompt wired up, mem0 integrated, voice cloned
-7. **VTube Studio** — model loaded, WebSocket API bridge written as OpenClaw skill
-8. **ComfyUI** — image generation online, character-consistent model workflows
-9. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
-10. **Polish** — Authelia, Tailscale hardening, mobile companion, iOS widgets
+1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma) ✅
+2. **LLM** — Ollama running with target models, Open WebUI connected ✅
+3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant ✅
+4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant ✅
+5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, display faces ✅
+6. **Character system** — schema v2, dashboard editor, memory system, per-character TTS routing ✅
+7. **Animated visual** — PNG/GIF character visual for the web assistant (initial visual layer)
+8. **Android app** — companion app for mobile access to the assistant
+9. **ComfyUI** — image generation online, character-consistent model workflows
+10. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
+11. **Polish** — Authelia, Tailscale hardening, iOS widgets
+
+### Stretch Goals
+- **Live2D / VTube Studio** — full Live2D model with WebSocket API bridge (requires learning Live2D tooling)

 ---

@@ -133,7 +169,11 @@ Character config JSON (exported from Character Manager) is the single source of

 - All Docker compose files: `~/server/docker/`
 - OpenClaw skills: `~/.openclaw/skills/`
- Character configs: `~/.openclaw/characters/`
+- Character configs: `~/homeai-data/characters/`
+- Character memories: `~/homeai-data/memories/`
+- Conversation history: `~/homeai-data/conversations/`
+- Active TTS state: `~/homeai-data/active-tts-voice.json`
+- Satellite → character map: `~/homeai-data/satellite-map.json`
 - Whisper models: `~/models/whisper/`
 - Ollama models: managed by Ollama at `~/.ollama/models/`
 - ComfyUI models: `~/ComfyUI/models/`