Files
homeai/homeai-esp32/PLAN.md
Aodhan Collins 38247d7cc4 Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00

8.4 KiB
Raw Blame History

P6: homeai-esp32 — Room Satellite Hardware

Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)


Goal

Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.


Hardware: ESP32-S3-BOX-3

Feature Spec
SoC ESP32-S3 (dual-core Xtensa, 240MHz)
RAM 512KB SRAM + 16MB PSRAM
Flash 16MB
Display 2.4" IPS LCD, 320×240, touchscreen
Mic Dual microphone array
Speaker Built-in 1W speaker
Connectivity WiFi 802.11b/g/n, BT 5.0
USB USB-C (programming + power)

Architecture Per Unit

ESP32-S3-BOX-3
├── microWakeWord (on-device, always listening)
│   └── triggers Wyoming Satellite on wake detection
├── Wyoming Satellite
│   ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
│   └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
├── LVGL Display
│   └── animated face, driven by HA entity state
└── ESPHome OTA
    └── firmware updates over WiFi

ESPHome Configuration

Base Config Template

esphome/base.yaml — shared across all units:

esphome:
  name: homeai-${room}
  friendly_name: "HomeAI ${room_display}"
  platform: esp32
  board: esp32-s3-box-3

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  ap:
    ssid: "HomeAI Fallback"

api:
  encryption:
    key: !secret api_key

ota:
  password: !secret ota_password

logger:
  level: INFO

Room-Specific Config

esphome/s3-box-living-room.yaml:

substitutions:
  room: living-room
  room_display: "Living Room"
  mac_mini_ip: "192.168.1.x"    # or Tailscale IP

packages:
  base: !include base.yaml
  voice: !include voice.yaml
  display: !include display.yaml

One file per room, only the substitutions change.

Voice / Wyoming Satellite — esphome/voice.yaml

microphone:
  - platform: esp_adf
    id: mic

speaker:
  - platform: esp_adf
    id: spk

micro_wake_word:
  model: hey_jarvis            # or custom model path
  on_wake_word_detected:
    - voice_assistant.start:

voice_assistant:
  microphone: mic
  speaker: spk
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 2.0

  on_listening:
    - display.page.show: page_listening
    - script.execute: animate_face_listening

  on_stt_vad_end:
    - display.page.show: page_thinking
    - script.execute: animate_face_thinking

  on_tts_start:
    - display.page.show: page_speaking
    - script.execute: animate_face_speaking

  on_end:
    - display.page.show: page_idle
    - script.execute: animate_face_idle

  on_error:
    - display.page.show: page_error
    - script.execute: animate_face_error

Note: ESPHome's voice_assistant component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.

LVGL Display — esphome/display.yaml

display:
  - platform: ili9xxx
    model: ILI9341
    id: lcd
    cs_pin: GPIO5
    dc_pin: GPIO4
    reset_pin: GPIO48

touchscreen:
  - platform: tt21100
    id: touch

lvgl:
  displays:
    - lcd
  touchscreens:
    - touch

  # Face widget — centered on screen
  widgets:
    - obj:
        id: face_container
        width: 320
        height: 240
        bg_color: 0x000000
        children:
          # Eyes (two circles)
          - obj:
              id: eye_left
              x: 90
              y: 90
              width: 50
              height: 50
              radius: 25
              bg_color: 0xFFFFFF
          - obj:
              id: eye_right
              x: 180
              y: 90
              width: 50
              height: 50
              radius: 25
              bg_color: 0xFFFFFF
          # Mouth (line/arc)
          - arc:
              id: mouth
              x: 110
              y: 160
              width: 100
              height: 40
              start_angle: 180
              end_angle: 360
              arc_color: 0xFFFFFF

  pages:
    - id: page_idle
    - id: page_listening
    - id: page_thinking
    - id: page_speaking
    - id: page_error

LVGL Face State Animations — esphome/animations.yaml

script:
  - id: animate_face_idle
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 50     # normal open
      - lvgl.widget.modify:
          id: eye_right
          height: 50
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0xFFFFFF

  - id: animate_face_listening
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 60     # wider eyes
      - lvgl.widget.modify:
          id: eye_right
          height: 60
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0x00BFFF  # blue tint

  - id: animate_face_thinking
    then:
      - lvgl.widget.modify:
          id: eye_left
          height: 20     # squinting
      - lvgl.widget.modify:
          id: eye_right
          height: 20

  - id: animate_face_speaking
    then:
      - lvgl.widget.modify:
          id: mouth
          arc_color: 0x00FF88  # green speaking indicator

  - id: animate_face_error
    then:
      - lvgl.widget.modify:
          id: eye_left
          bg_color: 0xFF2200  # red eyes
      - lvgl.widget.modify:
          id: eye_right
          bg_color: 0xFF2200

Note: True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.


Secrets File

esphome/secrets.yaml (gitignored):

wifi_ssid: "YourNetwork"
wifi_password: "YourPassword"
api_key: "<32-byte base64 key>"
ota_password: "YourOTAPassword"

Flash & Deployment Workflow

# Install ESPHome
pip install esphome

# Compile + flash via USB (first time)
esphome run esphome/s3-box-living-room.yaml

# OTA update (subsequent)
esphome upload esphome/s3-box-living-room.yaml --device <device-ip>

# View logs
esphome logs esphome/s3-box-living-room.yaml

Home Assistant Integration

After flashing:

  1. HA discovers ESP32 automatically via mDNS
  2. Add device in HA → Settings → Devices
  3. Assign Wyoming voice assistant pipeline to the device
  4. Set up room-specific automations (e.g., "Living Room" light control from that satellite)

Directory Layout

homeai-esp32/
└── esphome/
    ├── base.yaml
    ├── voice.yaml
    ├── display.yaml
    ├── animations.yaml
    ├── s3-box-living-room.yaml
    ├── s3-box-bedroom.yaml       # template, fill in when hardware available
    ├── s3-box-kitchen.yaml       # template
    └── secrets.yaml              # gitignored

Wake Word Decisions

Option Latency Privacy Effort
hey_jarvis (built-in microWakeWord) ~200ms On-device Zero
Custom word (trained model) ~200ms On-device High — requires 50+ recordings
Mac Mini openWakeWord (stream audio) ~500ms On Mac Medium

Recommendation: Start with hey_jarvis. Train a custom word (character's name) once character name is finalised.


Implementation Steps

  • Install ESPHome: pip install esphome
  • Write esphome/secrets.yaml (gitignored)
  • Write base.yaml, voice.yaml, display.yaml, animations.yaml
  • Write s3-box-living-room.yaml for first unit
  • Flash first unit via USB: esphome run s3-box-living-room.yaml
  • Verify unit appears in HA device list
  • Assign Wyoming voice pipeline to unit in HA
  • Test: speak wake word → transcription → LLM response → spoken reply
  • Test: LVGL face cycles through idle → listening → thinking → speaking
  • Verify OTA update works: change LVGL color, deploy wirelessly
  • Write config templates for remaining rooms (bedroom, kitchen)
  • Flash remaining units, verify each works independently
  • Document final MAC address → room name mapping

Success Criteria

  • Wake word "hey jarvis" triggers pipeline reliably from 3m distance
  • STT transcription accuracy >90% for clear speech in quiet room
  • TTS audio plays clearly through ESP32 speaker
  • LVGL face shows correct state for idle / listening / thinking / speaking / error
  • OTA firmware updates work without USB cable
  • Unit reconnects automatically after WiFi drop
  • Unit survives power cycle and resumes normal operation