Files

Aodhan Collins c4cecbd8dc feat: ESP32-S3-BOX-3 room satellite — ESPHome config, OTA deploy, placeholder faces

Living room unit fully working: on-device wake word (hey_jarvis), voice pipeline
via HA (Wyoming STT → OpenClaw → Wyoming TTS), static PNG display states, OTA
updates. Includes deploy.sh for quick OTA with custom image support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-13 20:48:03 +00:00

7.3 KiB

Raw Blame History

P6: homeai-esp32 — Room Satellite Hardware

Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)

Goal

Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.

Hardware: ESP32-S3-BOX-3

Feature	Spec
SoC	ESP32-S3 (dual-core Xtensa, 240MHz)
RAM	512KB SRAM + 16MB PSRAM
Flash	16MB
Display	2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX)
Audio ADC	ES7210 (dual mic array, 16kHz 16-bit)
Audio DAC	ES8311 (speaker output, 48kHz 16-bit)
Speaker	Built-in 1W
Connectivity	WiFi 802.11b/g/n (2.4GHz only), BT 5.0
USB	USB-C (programming + power, native USB JTAG serial)

Architecture Per Unit

ESP32-S3-BOX-3
├── micro_wake_word (on-device, always listening)
│   └── "hey_jarvis" — triggers voice_assistant on wake detection
├── voice_assistant (ESPHome component)
│   ├── connects to Home Assistant via ESPHome API
│   ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
│   ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
│   └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
├── Display (ili9xxx, model S3BOX, 320×240)
│   └── static PNG faces per state (idle, listening, thinking, replying, error)
└── ESPHome OTA
    └── firmware updates over WiFi

Pin Map (ESP32-S3-BOX-3)

Function	Pin(s)	Notes
I2S LRCLK	GPIO45	strapping pin — warning ignored
I2S BCLK	GPIO17
I2S MCLK	GPIO2
I2S DIN (mic)	GPIO16	ES7210 ADC input
I2S DOUT (speaker)	GPIO15	ES8311 DAC output
Speaker enable	GPIO46	strapping pin — warning ignored
I2C SCL	GPIO18	audio codec control bus
I2C SDA	GPIO8	audio codec control bus
SPI CLK (display)	GPIO7
SPI MOSI (display)	GPIO6
Display CS	GPIO5
Display DC	GPIO4
Display Reset	GPIO48	inverted
Backlight	GPIO47	LEDC PWM
Left top button	GPIO0	strapping pin — mute toggle / factory reset

ESPHome Configuration

Platform & Framework

esp32:
  board: esp32s3box
  flash_size: 16MB
  cpu_frequency: 240MHz
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"

psram:
  mode: octal
  speed: 80MHz

Audio Stack

Uses i2s_audio platform with external ADC/DAC codec chips:

Microphone: ES7210 ADC via I2S, 16kHz 16-bit mono
Speaker: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
Media player: wraps speaker with volume control (min 50%, max 85%)

Wake Word

On-device micro_wake_word component with hey_jarvis model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.

Display

ili9xxx platform with model S3BOX. Uses update_interval: never — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware.

Voice Assistant

ESPHome's voice_assistant component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:

Audio → Wyoming STT (Mac Mini) → text
Text → OpenClaw conversation agent → response
Response → Wyoming TTS (Mac Mini) → audio back to ESP32

Directory Layout

homeai-esp32/
├── PLAN.md
├── setup.sh                          # env check + flash/ota/logs commands
└── esphome/
    ├── secrets.yaml                  # gitignored — WiFi + API key
    ├── homeai-living-room.yaml       # first unit (full config)
    ├── homeai-bedroom.yaml           # future: copy + change substitutions
    ├── homeai-kitchen.yaml           # future: copy + change substitutions
    └── illustrations/                # 320×240 PNG face images
        ├── idle.png
        ├── loading.png
        ├── listening.png
        ├── thinking.png
        ├── replying.png
        ├── error.png
        └── timer_finished.png

ESPHome Environment

# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
~/homeai-esphome-env/bin/esphome version  # ESPHome 2026.2.4+

# Quick commands
cd ~/gitea/homeai/homeai-esp32
~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml     # compile + flash
~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml    # stream logs

# Or use the setup script
./setup.sh flash    # compile + USB flash
./setup.sh ota      # compile + OTA update
./setup.sh logs     # stream device logs
./setup.sh validate # check YAML without compiling

Wake Word Options

Option	Latency	Privacy	Effort
`hey_jarvis` (built-in micro_wake_word)	~200ms	On-device	Zero
Custom word (trained model)	~200ms	On-device	High — requires 50+ recordings
HA streaming wake word	~500ms	On Mac Mini	Medium — stream all audio

Current: hey_jarvis on-device. Train a custom word (character's name) once finalised.

Implementation Steps

Install ESPHome in ~/homeai-esphome-env (Python 3.12)
Write esphome/secrets.yaml (gitignored)
Write homeai-living-room.yaml (based on official S3-BOX-3 reference config)
Generate placeholder face illustrations (7 PNGs, 320×240)
Write setup.sh with flash/ota/logs/validate commands
Write deploy.sh with OTA deploy, image management, multi-unit support
Flash first unit via USB (living room)
Verify unit appears in HA device list
Assign Wyoming voice pipeline to unit in HA
Test: speak wake word → transcription → LLM response → spoken reply
Test: display cycles through idle → listening → thinking → replying
Verify OTA update works: change config, deploy wirelessly
Write config templates for remaining rooms (bedroom, kitchen)
Flash remaining units, verify each works independently
Document final MAC address → room name mapping

Success Criteria

Wake word "hey jarvis" triggers pipeline reliably from 3m distance
STT transcription accuracy >90% for clear speech in quiet room
TTS audio plays clearly through ESP32 speaker
Display shows correct state for idle / listening / thinking / replying / error / muted
OTA firmware updates work without USB cable
Unit reconnects automatically after WiFi drop
Unit survives power cycle and resumes normal operation

Known Constraints

Memory: voice_assistant + micro_wake_word + display is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
WiFi: 2.4GHz only. 5GHz networks are not supported.
Speaker: 1W built-in. Volume capped at 85% to avoid distortion.
Display: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
First compile: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.

7.3 KiB Raw Blame History Unescape Escape