# P6: homeai-esp32 — Room Satellite Hardware > Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running) --- ## Goal Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini. --- ## Hardware: ESP32-S3-BOX-3 | Feature | Spec | |---|---| | SoC | ESP32-S3 (dual-core Xtensa, 240MHz) | | RAM | 512KB SRAM + 16MB PSRAM | | Flash | 16MB | | Display | 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX) | | Audio ADC | ES7210 (dual mic array, 16kHz 16-bit) | | Audio DAC | ES8311 (speaker output, 48kHz 16-bit) | | Speaker | Built-in 1W | | Connectivity | WiFi 802.11b/g/n (2.4GHz only), BT 5.0 | | USB | USB-C (programming + power, native USB JTAG serial) | --- ## Architecture Per Unit ``` ESP32-S3-BOX-3 ├── micro_wake_word (on-device, always listening) │ └── "hey_jarvis" — triggers voice_assistant on wake detection ├── voice_assistant (ESPHome component) │ ├── connects to Home Assistant via ESPHome API │ ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300) │ ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081) │ └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301) ├── Display (ili9xxx, model S3BOX, 320×240) │ └── static PNG faces per state (idle, listening, thinking, replying, error) └── ESPHome OTA └── firmware updates over WiFi ``` --- ## Pin Map (ESP32-S3-BOX-3) | Function | Pin(s) | Notes | |---|---|---| | I2S LRCLK | GPIO45 | strapping pin — warning ignored | | I2S BCLK | GPIO17 | | | I2S MCLK | GPIO2 | | | I2S DIN (mic) | GPIO16 | ES7210 ADC input | | I2S DOUT (speaker) | GPIO15 | ES8311 DAC output | | Speaker enable | GPIO46 | strapping pin — warning ignored | | I2C SCL | GPIO18 | audio codec control bus | | I2C SDA | GPIO8 | audio codec control bus | | SPI CLK (display) | GPIO7 | | | SPI MOSI (display) | GPIO6 | | | Display CS | GPIO5 | | | Display DC | GPIO4 | | | Display Reset | GPIO48 | inverted | | Backlight | GPIO47 | LEDC PWM | | Left top button | GPIO0 | strapping pin — mute toggle / factory reset | | Sensor dock I2C SCL | GPIO40 | sensor bus (AHT-30, AT581x radar) | | Sensor dock I2C SDA | GPIO41 | sensor bus (AHT-30, AT581x radar) | | Radar presence output | GPIO21 | AT581x digital detection pin | --- ## ESPHome Configuration ### Platform & Framework ```yaml esp32: board: esp32s3box flash_size: 16MB cpu_frequency: 240MHz framework: type: esp-idf sdkconfig_options: CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y" CONFIG_ESP32S3_DATA_CACHE_64KB: "y" CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y" psram: mode: octal speed: 80MHz ``` ### Audio Stack Uses `i2s_audio` platform with external ADC/DAC codec chips: - **Microphone**: ES7210 ADC via I2S, 16kHz 16-bit mono - **Speaker**: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel) - **Media player**: wraps speaker with volume control (min 50%, max 85%) ### Wake Word On-device `micro_wake_word` component with `hey_jarvis` model. Can optionally be switched to Home Assistant streaming wake word via a selector entity. ### Display `ili9xxx` platform with model `S3BOX`. Uses `update_interval: never` — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware. No text overlays — voice-only interaction. Screen auto-dims after a configurable idle timeout (default 1 min, adjustable 1–60 min via HA entity). Wakes on voice activity or radar presence detection. ### Sensor Dock (ESP32-S3-BOX-3-SENSOR) Optional accessory dock connected via secondary I2C bus (GPIO40/41, 100kHz): - **AHT-30** (temp/humidity) — `aht10` component with variant AHT20, 30s update interval - **AT581x mmWave radar** — presence detection via GPIO21, I2C for settings config - **Radar RF switch** — toggle radar on/off from HA - Radar configured on boot: sensing_distance=600, trigger_keep=5s, hw_frontend_reset=true ### Voice Assistant ESPHome's `voice_assistant` component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline: 1. Audio → Wyoming STT (Mac Mini) → text 2. Text → OpenClaw conversation agent → response 3. Response → Wyoming TTS (Mac Mini) → audio back to ESP32 --- ## Directory Layout ``` homeai-esp32/ ├── PLAN.md ├── setup.sh # env check + flash/ota/logs commands └── esphome/ ├── secrets.yaml # gitignored — WiFi + API key ├── homeai-living-room.yaml # first unit (full config) ├── homeai-bedroom.yaml # future: copy + change substitutions ├── homeai-kitchen.yaml # future: copy + change substitutions └── illustrations/ # 320×240 PNG face images ├── idle.png ├── loading.png ├── listening.png ├── thinking.png ├── replying.png ├── error.png └── timer_finished.png ``` --- ## ESPHome Environment ```bash # Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs ~/homeai-esphome-env/bin/esphome version # ESPHome 2026.2.4+ # Quick commands cd ~/gitea/homeai/homeai-esp32 ~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml # compile + flash ~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml # stream logs # Or use the setup script ./setup.sh flash # compile + USB flash ./setup.sh ota # compile + OTA update ./setup.sh logs # stream device logs ./setup.sh validate # check YAML without compiling ``` --- ## Wake Word Options | Option | Latency | Privacy | Effort | |---|---|---|---| | `hey_jarvis` (built-in micro_wake_word) | ~200ms | On-device | Zero | | Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings | | HA streaming wake word | ~500ms | On Mac Mini | Medium — stream all audio | **Current**: `hey_jarvis` on-device. Train a custom word (character's name) once finalised. --- ## Implementation Steps - [x] Install ESPHome in `~/homeai-esphome-env` (Python 3.12) - [x] Write `esphome/secrets.yaml` (gitignored) - [x] Write `homeai-living-room.yaml` (based on official S3-BOX-3 reference config) - [x] Generate placeholder face illustrations (7 PNGs, 320×240) - [x] Write `setup.sh` with flash/ota/logs/validate commands - [x] Write `deploy.sh` with OTA deploy, image management, multi-unit support - [x] Flash first unit via USB (living room) - [x] Verify unit appears in HA device list - [x] Assign Wyoming voice pipeline to unit in HA - [x] Test: speak wake word → transcription → LLM response → spoken reply - [x] Test: display cycles through idle → listening → thinking → replying - [x] Verify OTA update works: change config, deploy wirelessly - [ ] Write config templates for remaining rooms (bedroom, kitchen) - [ ] Flash remaining units, verify each works independently - [ ] Document final MAC address → room name mapping --- ## Success Criteria - [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance - [ ] STT transcription accuracy >90% for clear speech in quiet room - [ ] TTS audio plays clearly through ESP32 speaker - [ ] Display shows correct state for idle / listening / thinking / replying / error / muted - [ ] OTA firmware updates work without USB cable - [ ] Unit reconnects automatically after WiFi drop - [ ] Unit survives power cycle and resumes normal operation --- ## Known Constraints - **Memory**: voice_assistant + micro_wake_word + display + sensor dock is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes. - **WiFi**: 2.4GHz only. 5GHz networks are not supported. - **Speaker**: 1W built-in. Volume capped at 85% to avoid distortion. - **Display**: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min). - **First compile**: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.