Files

Aodhan Collins 38247d7cc4 Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-04 01:11:37 +00:00

10 KiB

Raw Permalink Blame History

P7: homeai-visual — VTube Studio Visual Layer

Phase 5 | Depends on: P4 (OpenClaw skill runner), P5 (character expression map)

Goal

VTube Studio displays a Live2D model on Mac Mini desktop and mobile. Expressions are driven by the AI pipeline state (thinking, speaking, happy, etc.) via an OpenClaw skill that talks to VTube Studio's WebSocket API. Lip sync follows audio amplitude.

Architecture

OpenClaw pipeline state
        ↓ (during LLM response generation)
vtube_studio.py skill
        ↓ WebSocket (port 8001)
VTube Studio (macOS app)
        ↓
Live2D model renders expression
        ↓
Displayed on:
  - Mac Mini desktop (primary)
  - iPhone/iPad (VTube Studio mobile, same model via Tailscale)

VTube Studio Setup

Installation

Download VTube Studio from the Mac App Store
Launch, go through initial setup
Enable WebSocket API: Settings → WebSocket API → Enable (port 8001)
Load Live2D model (see Model section below)

WebSocket API Authentication

VTube Studio uses a token-based auth flow:

import asyncio
import websockets
import json

async def authenticate():
    async with websockets.connect("ws://localhost:8001") as ws:
        # Step 1: request authentication token
        await ws.send(json.dumps({
            "apiName": "VTubeStudioPublicAPI",
            "apiVersion": "1.0",
            "requestID": "auth-req",
            "messageType": "AuthenticationTokenRequest",
            "data": {
                "pluginName": "HomeAI",
                "pluginDeveloper": "HomeAI",
                "pluginIcon": None
            }
        }))
        response = json.loads(await ws.recv())
        token = response["data"]["authenticationToken"]
        # User must click "Allow" in VTube Studio UI

        # Step 2: authenticate with token
        await ws.send(json.dumps({
            "apiName": "VTubeStudioPublicAPI",
            "apiVersion": "1.0",
            "requestID": "auth",
            "messageType": "AuthenticationRequest",
            "data": {
                "pluginName": "HomeAI",
                "pluginDeveloper": "HomeAI",
                "authenticationToken": token
            }
        }))
        auth_resp = json.loads(await ws.recv())
        print("Authenticated:", auth_resp["data"]["authenticated"])
        return token

Token is persisted to ~/.openclaw/vtube_token.json.

`vtube_studio.py` Skill

Full implementation (replaces the stub from P4).

File: homeai-visual/skills/vtube_studio.py (symlinked to ~/.openclaw/skills/)

"""
VTube Studio WebSocket skill for OpenClaw.
Drives Live2D model expressions based on AI pipeline state.
"""

import asyncio
import json
import websockets
from pathlib import Path

VTUBE_WS_URL = "ws://localhost:8001"
TOKEN_PATH = Path.home() / ".openclaw" / "vtube_token.json"

class VTubeStudioSkill:
    def __init__(self, character_config: dict):
        self.expression_map = character_config.get("live2d_expressions", {})
        self.ws_triggers = character_config.get("vtube_ws_triggers", {})
        self.token = self._load_token()
        self._ws = None

    def _load_token(self) -> str | None:
        if TOKEN_PATH.exists():
            return json.loads(TOKEN_PATH.read_text()).get("token")
        return None

    def _save_token(self, token: str):
        TOKEN_PATH.write_text(json.dumps({"token": token}))

    async def connect(self):
        self._ws = await websockets.connect(VTUBE_WS_URL)
        if self.token:
            await self._authenticate()
        else:
            await self._request_new_token()

    async def _authenticate(self):
        await self._send({
            "messageType": "AuthenticationRequest",
            "data": {
                "pluginName": "HomeAI",
                "pluginDeveloper": "HomeAI",
                "authenticationToken": self.token
            }
        })
        resp = await self._recv()
        if not resp["data"].get("authenticated"):
            # Token expired — request a new one
            await self._request_new_token()

    async def _request_new_token(self):
        await self._send({
            "messageType": "AuthenticationTokenRequest",
            "data": {
                "pluginName": "HomeAI",
                "pluginDeveloper": "HomeAI",
                "pluginIcon": None
            }
        })
        resp = await self._recv()
        token = resp["data"]["authenticationToken"]
        self._save_token(token)
        self.token = token
        await self._authenticate()

    async def trigger_expression(self, event: str):
        """Trigger a named expression state (idle, thinking, speaking, etc.)"""
        hotkey_id = self.expression_map.get(event)
        if not hotkey_id:
            return
        await self._trigger_hotkey(hotkey_id)

    async def _trigger_hotkey(self, hotkey_id: str):
        await self._send({
            "messageType": "HotkeyTriggerRequest",
            "data": {"hotkeyID": hotkey_id}
        })
        await self._recv()

    async def set_parameter(self, name: str, value: float):
        """Set a VTube Studio parameter (e.g., mouth open for lip sync)"""
        await self._send({
            "messageType": "InjectParameterDataRequest",
            "data": {
                "parameterValues": [
                    {"id": name, "value": value}
                ]
            }
        })
        await self._recv()

    async def _send(self, payload: dict):
        full = {
            "apiName": "VTubeStudioPublicAPI",
            "apiVersion": "1.0",
            "requestID": "homeai",
            **payload
        }
        await self._ws.send(json.dumps(full))

    async def _recv(self) -> dict:
        return json.loads(await self._ws.recv())

    async def close(self):
        if self._ws:
            await self._ws.close()


# OpenClaw skill entry point — synchronous wrapper
def trigger_expression(event: str, character_config: dict):
    skill = VTubeStudioSkill(character_config)
    asyncio.run(_run(skill, event))

async def _run(skill, event):
    await skill.connect()
    await skill.trigger_expression(event)
    await skill.close()

Lip Sync

Phase 1: Amplitude-Based (Simple)

During TTS audio playback, sample audio amplitude and map to mouth open parameter:

import numpy as np
import sounddevice as sd

def stream_with_lipsync(audio_data: np.ndarray, sample_rate: int, vtube: VTubeStudioSkill):
    chunk_size = 1024
    for i in range(0, len(audio_data), chunk_size):
        chunk = audio_data[i:i+chunk_size]
        amplitude = float(np.abs(chunk).mean()) / 32768.0  # normalise 16-bit PCM
        mouth_value = min(amplitude * 10, 1.0)  # scale to 0–1
        asyncio.run(vtube.set_parameter("MouthOpen", mouth_value))
        sd.play(chunk, sample_rate, blocking=True)
    asyncio.run(vtube.set_parameter("MouthOpen", 0.0))  # close mouth after

Phase 2: Phoneme-Based (Future)

Parse TTS phoneme timing from Kokoro/Chatterbox output and drive expression per phoneme. More accurate but significantly more complex. Defer to after Phase 5.

Live2D Model

Options

Option	Cost	Effort	Quality
Free models (VTube Studio sample packs)	Free	Low	Generic
Purchase from nizima.com or booth.pm	¥3,000–¥30,000	Low	High
Commission custom model	¥50,000–¥200,000+	Low (for you)	Unique

Recommendation: Start with a purchased model from nizima.com or booth.pm that matches the character's aesthetic. Commission custom later once personality is locked in.

Model Setup

Download .vtube.model3.json + associated assets
Place in ~/Documents/Live2DModels/ (VTube Studio default)
Load in VTube Studio: Model tab → Add Model
Map hotkeys: VTube Studio → Hotkeys → create one per expression state
Record hotkey IDs, update aria.json live2d_expressions mapping

Expression Hotkey Mapping Workflow

Launch VTube Studio, load model
Go to Hotkeys → add hotkeys for each state: idle, listening, thinking, speaking, happy, sad, surprised, error
VTube Studio assigns a UUID to each hotkey — copy these
Open Character Manager (P5), paste UUIDs into expression mapping UI
Export updated aria.json
Restart OpenClaw — new expression map loaded

Mobile Setup

Install VTube Studio on iPhone/iPad
On same Tailscale network, VTube Studio mobile discovers Mac Mini model
Mirror mode: mobile shows same model as desktop
Useful for bedside or kitchen display while Mac Mini desktop is the primary

Directory Layout

homeai-visual/
└── skills/
    ├── vtube_studio.py      ← full implementation
    ├── lipsync.py           ← amplitude-based lip sync helper
    └── auth.py              ← token management utility

Implementation Steps

Install VTube Studio (Mac App Store)
Enable WebSocket API on port 8001
Source/purchase a Live2D model
Load model in VTube Studio, verify it renders
Create hotkeys in VTube Studio for all 8 expression states
Write vtube_studio.py full implementation
Run auth flow — click "Allow" in VTube Studio UI, save token
Test trigger_expression("thinking") → model shows expression
Test all 8 expressions via a simple test script
Update aria.json with real VTube Studio hotkey IDs
Write lipsync.py amplitude-based helper
Integrate lip sync into TTS dispatch in OpenClaw
Symlink skills/ → ~/.openclaw/skills/
Test full pipeline: voice query → thinking expression → LLM → speaking expression with lip sync
Set up VTube Studio on iPhone (optional, do last)

Success Criteria

All 8 expression states trigger correctly via trigger_expression()
Lip sync is visibly responding to TTS audio (even if imperfect)
VTube Studio token survives app restart (token file persists)
Expression triggers are fast enough to feel responsive (<100ms from call to render)
Model stays loaded and connected after Mac Mini sleep/wake

10 KiB Raw Permalink Blame History Unescape Escape