Files
homeai/homeai-esp32/PLAN.md
Aodhan Collins 1e52c002c2 feat: Raspberry Pi 5 kitchen satellite — Wyoming voice satellite with ReSpeaker pHAT
Add full Pi 5 satellite setup with ReSpeaker 2-Mics pHAT for kitchen
voice control via Wyoming protocol. Includes satellite_wrapper.py that
monkey-patches WakeStreamingSatellite to fix three compounding bugs:

- TTS echo suppression: mutes wake word detection while speaker plays
- Server writer race fix: checks _writer before streaming, re-arms on None
- Streaming timeout: auto-recovers after 30s if pipeline hangs
- Error recovery: resets streaming state on server Error events

Also includes Pi 5 hardware workarounds (wm8960 overlay, stereo-only
audio wrappers, ALSA mixer calibration) and deploy.sh with fast
iteration commands (--push-wrapper, --test-logs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 20:09:47 +00:00

8.1 KiB
Raw Blame History

P6: homeai-esp32 — Room Satellite Hardware

Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)


Goal

Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.


Hardware: ESP32-S3-BOX-3

Feature Spec
SoC ESP32-S3 (dual-core Xtensa, 240MHz)
RAM 512KB SRAM + 16MB PSRAM
Flash 16MB
Display 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX)
Audio ADC ES7210 (dual mic array, 16kHz 16-bit)
Audio DAC ES8311 (speaker output, 48kHz 16-bit)
Speaker Built-in 1W
Connectivity WiFi 802.11b/g/n (2.4GHz only), BT 5.0
USB USB-C (programming + power, native USB JTAG serial)

Architecture Per Unit

ESP32-S3-BOX-3
├── micro_wake_word (on-device, always listening)
│   └── "hey_jarvis" — triggers voice_assistant on wake detection
├── voice_assistant (ESPHome component)
│   ├── connects to Home Assistant via ESPHome API
│   ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
│   ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
│   └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
├── Display (ili9xxx, model S3BOX, 320×240)
│   └── static PNG faces per state (idle, listening, thinking, replying, error)
└── ESPHome OTA
    └── firmware updates over WiFi

Pin Map (ESP32-S3-BOX-3)

Function Pin(s) Notes
I2S LRCLK GPIO45 strapping pin — warning ignored
I2S BCLK GPIO17
I2S MCLK GPIO2
I2S DIN (mic) GPIO16 ES7210 ADC input
I2S DOUT (speaker) GPIO15 ES8311 DAC output
Speaker enable GPIO46 strapping pin — warning ignored
I2C SCL GPIO18 audio codec control bus
I2C SDA GPIO8 audio codec control bus
SPI CLK (display) GPIO7
SPI MOSI (display) GPIO6
Display CS GPIO5
Display DC GPIO4
Display Reset GPIO48 inverted
Backlight GPIO47 LEDC PWM
Left top button GPIO0 strapping pin — mute toggle / factory reset
Sensor dock I2C SCL GPIO40 sensor bus (AHT-30, AT581x radar)
Sensor dock I2C SDA GPIO41 sensor bus (AHT-30, AT581x radar)
Radar presence output GPIO21 AT581x digital detection pin

ESPHome Configuration

Platform & Framework

esp32:
  board: esp32s3box
  flash_size: 16MB
  cpu_frequency: 240MHz
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"

psram:
  mode: octal
  speed: 80MHz

Audio Stack

Uses i2s_audio platform with external ADC/DAC codec chips:

  • Microphone: ES7210 ADC via I2S, 16kHz 16-bit mono
  • Speaker: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
  • Media player: wraps speaker with volume control (min 50%, max 85%)

Wake Word

On-device micro_wake_word component with hey_jarvis model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.

Display

ili9xxx platform with model S3BOX. Uses update_interval: never — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware. No text overlays — voice-only interaction.

Screen auto-dims after a configurable idle timeout (default 1 min, adjustable 160 min via HA entity). Wakes on voice activity or radar presence detection.

Sensor Dock (ESP32-S3-BOX-3-SENSOR)

Optional accessory dock connected via secondary I2C bus (GPIO40/41, 100kHz):

  • AHT-30 (temp/humidity) — aht10 component with variant AHT20, 30s update interval
  • AT581x mmWave radar — presence detection via GPIO21, I2C for settings config
  • Radar RF switch — toggle radar on/off from HA
  • Radar configured on boot: sensing_distance=600, trigger_keep=5s, hw_frontend_reset=true

Voice Assistant

ESPHome's voice_assistant component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:

  1. Audio → Wyoming STT (Mac Mini) → text
  2. Text → OpenClaw conversation agent → response
  3. Response → Wyoming TTS (Mac Mini) → audio back to ESP32

Directory Layout

homeai-esp32/
├── PLAN.md
├── setup.sh                          # env check + flash/ota/logs commands
└── esphome/
    ├── secrets.yaml                  # gitignored — WiFi + API key
    ├── homeai-living-room.yaml       # first unit (full config)
    ├── homeai-bedroom.yaml           # future: copy + change substitutions
    ├── homeai-kitchen.yaml           # future: copy + change substitutions
    └── illustrations/                # 320×240 PNG face images
        ├── idle.png
        ├── loading.png
        ├── listening.png
        ├── thinking.png
        ├── replying.png
        ├── error.png
        └── timer_finished.png

ESPHome Environment

# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
~/homeai-esphome-env/bin/esphome version  # ESPHome 2026.2.4+

# Quick commands
cd ~/gitea/homeai/homeai-esp32
~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml     # compile + flash
~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml    # stream logs

# Or use the setup script
./setup.sh flash    # compile + USB flash
./setup.sh ota      # compile + OTA update
./setup.sh logs     # stream device logs
./setup.sh validate # check YAML without compiling

Wake Word Options

Option Latency Privacy Effort
hey_jarvis (built-in micro_wake_word) ~200ms On-device Zero
Custom word (trained model) ~200ms On-device High — requires 50+ recordings
HA streaming wake word ~500ms On Mac Mini Medium — stream all audio

Current: hey_jarvis on-device. Train a custom word (character's name) once finalised.


Implementation Steps

  • Install ESPHome in ~/homeai-esphome-env (Python 3.12)
  • Write esphome/secrets.yaml (gitignored)
  • Write homeai-living-room.yaml (based on official S3-BOX-3 reference config)
  • Generate placeholder face illustrations (7 PNGs, 320×240)
  • Write setup.sh with flash/ota/logs/validate commands
  • Write deploy.sh with OTA deploy, image management, multi-unit support
  • Flash first unit via USB (living room)
  • Verify unit appears in HA device list
  • Assign Wyoming voice pipeline to unit in HA
  • Test: speak wake word → transcription → LLM response → spoken reply
  • Test: display cycles through idle → listening → thinking → replying
  • Verify OTA update works: change config, deploy wirelessly
  • Write config templates for remaining rooms (bedroom, kitchen)
  • Flash remaining units, verify each works independently
  • Document final MAC address → room name mapping

Success Criteria

  • Wake word "hey jarvis" triggers pipeline reliably from 3m distance
  • STT transcription accuracy >90% for clear speech in quiet room
  • TTS audio plays clearly through ESP32 speaker
  • Display shows correct state for idle / listening / thinking / replying / error / muted
  • OTA firmware updates work without USB cable
  • Unit reconnects automatically after WiFi drop
  • Unit survives power cycle and resumes normal operation

Known Constraints

  • Memory: voice_assistant + micro_wake_word + display + sensor dock is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
  • WiFi: 2.4GHz only. 5GHz networks are not supported.
  • Speaker: 1W built-in. Volume capped at 85% to avoid distortion.
  • Display: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
  • First compile: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.