homeai/homeai-esp32/PLAN.md

# P6: homeai-esp32 — Room Satellite Hardware

> Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)

---

## Goal

Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.

---

## Hardware: ESP32-S3-BOX-3

| Feature | Spec |
|---|---|
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
| RAM | 512KB SRAM + 16MB PSRAM |
| Flash | 16MB |
| Display | 2.4" IPS LCD, 320×240, touchscreen |
| Mic | Dual microphone array |
| Speaker | Built-in 1W speaker |
| Connectivity | WiFi 802.11b/g/n, BT 5.0 |
| USB | USB-C (programming + power) |

---

## Architecture Per Unit

```
ESP32-S3-BOX-3
├── microWakeWord (on-device, always listening)
│   └── triggers Wyoming Satellite on wake detection
├── Wyoming Satellite
│   ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
│   └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
├── LVGL Display
│   └── animated face, driven by HA entity state
└── ESPHome OTA
    └── firmware updates over WiFi
```

---

## ESPHome Configuration

### Base Config Template

`esphome/base.yaml` — shared across all units:

```yaml
esphome:
  name: homeai-${room}
  friendly_name: "HomeAI ${room_display}"
  platform: esp32
  board: esp32-s3-box-3

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  ap:
    ssid: "HomeAI Fallback"

api:
  encryption:
    key: !secret api_key

ota:
  password: !secret ota_password

logger:
  level: INFO
```

### Room-Specific Config

`esphome/s3-box-living-room.yaml`:

```yaml
substitutions:
  room: living-room
  room_display: "Living Room"
  mac_mini_ip: "192.168.1.x"    # or Tailscale IP

packages:
  base: !include base.yaml
  voice: !include voice.yaml
  display: !include display.yaml
```

One file per room, only the substitutions change.

### Voice / Wyoming Satellite — `esphome/voice.yaml`

```yaml
microphone:
  - platform: esp_adf
    id: mic

speaker:
  - platform: esp_adf
    id: spk

micro_wake_word:
  model: hey_jarvis            # or custom model path
  on_wake_word_detected:
    - voice_assistant.start:

voice_assistant:
  microphone: mic
  speaker: spk
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 2.0

  on_listening:
    - display.page.show: page_listening
    - script.execute: animate_face_listening

  on_stt_vad_end:
    - display.page.show: page_thinking
    - script.execute: animate_face_thinking

  on_tts_start:
    - display.page.show: page_speaking
    - script.execute: animate_face_speaking

  on_end:
    - display.page.show: page_idle
    - script.execute: animate_face_idle

  on_error:
    - display.page.show: page_error
    - script.execute: animate_face_error
```

**Note:** ESPHome's `voice_assistant` component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.

### LVGL Display — `esphome/display.yaml`

```yaml
display:
  - platform: ili9xxx
    model: ILI9341
    id: lcd
    cs_pin: GPIO5
    dc_pin: GPIO4
    reset_pin: GPIO48

touchscreen:
  - platform: tt21100
    id: touch

lvgl:
  displays:
    - lcd
  touchscreens:
    - touch

  # Face widget — centered on screen
  widgets:
    - obj:
        id: face_container
        width: 320
        height: 240
        bg_color: 0x000000
        children:
          # Eyes (two circles)
          - obj:
              id: eye_left
              x: 90
              y: 90
              width: 50
              height: 50
              radius: 25
              bg_color: 0xFFFFFF
          - obj:
              id: eye_right
              x: 180
              y: 90
              width: 50
              height: 50
              radius: 25
              bg_color: 0xFFFFFF
          # Mouth (line/arc)
          - arc:
              id: mouth
              x: 110
              y: 160
              width: 100
              height: 40
              start_angle: 180
              end_angle: 360
              arc_color: 0xFFFFFF

  pages:
    - id: page_idle
    - id: page_listening
    - id: page_thinking
    - id: page_speaking
    - id: page_error
```

### LVGL Face State Animations — `esphome/animations.yaml`

```yaml
script:
  - id: animate_face_idle
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 50     # normal open
      - lvgl.widget.modify:
          id: eye_right
          height: 50
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0xFFFFFF

  - id: animate_face_listening
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 60     # wider eyes
      - lvgl.widget.modify:
          id: eye_right
          height: 60
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0x00BFFF  # blue tint

  - id: animate_face_thinking
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 20     # squinting
      - lvgl.widget.modify:
          id: eye_right
          height: 20

  - id: animate_face_speaking
    then:
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0x00FF88  # green speaking indicator

  - id: animate_face_error
    then:
      - lvgl.widget.modify:
          id: eye_left
          bg_color: 0xFF2200  # red eyes
      - lvgl.widget.modify:
          id: eye_right
          bg_color: 0xFF2200
```

> **Note:** True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.

---

## Secrets File

`esphome/secrets.yaml` (gitignored):

```yaml
wifi_ssid: "YourNetwork"
wifi_password: "YourPassword"
api_key: "<32-byte base64 key>"
ota_password: "YourOTAPassword"
```

---

## Flash & Deployment Workflow

```bash
# Install ESPHome
pip install esphome

# Compile + flash via USB (first time)
esphome run esphome/s3-box-living-room.yaml

# OTA update (subsequent)
esphome upload esphome/s3-box-living-room.yaml --device <device-ip>

# View logs
esphome logs esphome/s3-box-living-room.yaml
```

---

## Home Assistant Integration

After flashing:
1. HA discovers ESP32 automatically via mDNS
2. Add device in HA → Settings → Devices
3. Assign Wyoming voice assistant pipeline to the device
4. Set up room-specific automations (e.g., "Living Room" light control from that satellite)

---

## Directory Layout

```
homeai-esp32/
└── esphome/
    ├── base.yaml
    ├── voice.yaml
    ├── display.yaml
    ├── animations.yaml
    ├── s3-box-living-room.yaml
    ├── s3-box-bedroom.yaml       # template, fill in when hardware available
    ├── s3-box-kitchen.yaml       # template
    └── secrets.yaml              # gitignored
```

---

## Wake Word Decisions

| Option | Latency | Privacy | Effort |
|---|---|---|---|
| `hey_jarvis` (built-in microWakeWord) | ~200ms | On-device | Zero |
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium |

**Recommendation:** Start with `hey_jarvis`. Train a custom word (character's name) once character name is finalised.

---

## Implementation Steps

- [ ] Install ESPHome: `pip install esphome`
- [ ] Write `esphome/secrets.yaml` (gitignored)
- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml`
- [ ] Write `s3-box-living-room.yaml` for first unit
- [ ] Flash first unit via USB: `esphome run s3-box-living-room.yaml`
- [ ] Verify unit appears in HA device list
- [ ] Assign Wyoming voice pipeline to unit in HA
- [ ] Test: speak wake word → transcription → LLM response → spoken reply
- [ ] Test: LVGL face cycles through idle → listening → thinking → speaking
- [ ] Verify OTA update works: change LVGL color, deploy wirelessly
- [ ] Write config templates for remaining rooms (bedroom, kitchen)
- [ ] Flash remaining units, verify each works independently
- [ ] Document final MAC address → room name mapping

---

## Success Criteria

- [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance
- [ ] STT transcription accuracy >90% for clear speech in quiet room
- [ ] TTS audio plays clearly through ESP32 speaker
- [ ] LVGL face shows correct state for idle / listening / thinking / speaking / error
- [ ] OTA firmware updates work without USB cable
- [ ] Unit reconnects automatically after WiFi drop
- [ ] Unit survives power cycle and resumes normal operation