Skip to content

feat: Sprout Huddles — voice conversations with AI agents#299

Open
tlongwell-block wants to merge 36 commits intomainfrom
feat/huddles-plan
Open

feat: Sprout Huddles — voice conversations with AI agents#299
tlongwell-block wants to merge 36 commits intomainfrom
feat/huddles-plan

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

@tlongwell-block tlongwell-block commented Apr 11, 2026

Summary

Adds real-time voice huddles to Sprout. Humans talk via LiveKit WebRTC; each desktop GUI locally transcribes speech and posts it to an ephemeral Nostr text channel where agents read, respond, and get read aloud. Agents never touch audio — they see only text. Multiple humans can join the same huddle.

Human speaks → AudioWorklet → Rust STT pipeline → text → relay → agent
Agent responds → text → Rust TTS pipeline → speakers → human hears
LiveKit SFU handles human-to-human audio mixing automatically

Push-to-talk (Ctrl+Space) is the default voice input mode. Voice activity detection (VAD) with barge-in remains as a switchable option.

What's New

Kokoro TTS (replaced Supertonic)

  • Kokoro-82M ONNX — Apache-2.0, 54 voices, 9 languages, single ONNX session (was 4 with Supertonic)
  • GPL-free G2P — three-tier dictionary lookup (misaki 178K words + CMUdict 126K words + morphological suffix rules), no espeak-ng
  • CoreML acceleration — automatic fallback to CPU when CoreML can't handle quantized ops
  • Sentence-by-sentence lookahead — BATCH_SIZE=1 for ~200ms TTFA with zero-gap playback
  • Runtime model download — model_q8f16.onnx (86MB), voice (510KB), lexicons (10MB), all SHA-256 verified

Multi-Human Huddles

  • Join existing huddlesHuddleIndicator detects active huddles via dedicated kind:48100-48103 subscription, shows green glow headphone icon with participant count badge
  • Active speaker highlighting — green ring on speaking participants via LiveKit events
  • Auto-end lifecycle — last human leaving auto-ends the huddle (archives ephemeral channel)
  • Creator-only endend_huddle requires is_creator (or explicit force flag for crash recovery). UI hides "End all" for non-creators.
  • Safe joiner cleanup — joiner's mic failure only calls leave_huddle, never end_huddle
  • Reconnect handlingisReconnecting state shown in HuddleBar
  • Accessibility — ARIA live region, animate-pulse on PTT dot, aria-label/role on avatars

Crossfire Fixes (from multi-model review: Opus + Codex + Gemini)

  • C1: end_huddle creator check + UI guard
  • C2: tts_starting sentinel prevents TOCTOU race in pipeline creation
  • C3: tts_active flag set after first player.append, not before synthesis
  • I1: AudioWorklet setMode() isolates VAD from PTT events
  • I4: voiceInputModeRef prevents stale closure in connectAndSetupMedia
  • Bug: tokio::spawntauri::async_runtime::spawn (panic during Tauri setup)
  • Bug: earshot VAD clamp (panic on out-of-range samples after resampling)

Relay

  • HuddleService wired into AppState — enabled by default with dev credentials
  • POST /api/huddles/{channel_id}/token — LiveKit token endpoint with auth + scope + membership checks
  • Huddle lifecycle events (kind 48100–48103, 48106) added to ingest allowlist + channel routing

SDK

  • 4 huddle lifecycle event builders with 8 tests

Desktop — Rust (desktop/src-tauri/src/huddle/)

  • kokoro.rs — Kokoro-82M ONNX TTS: single ort session, CoreML with CPU fallback, three-tier G2P (misaki + CMUdict + morphological rules), ARPAbet→IPA conversion, contraction handling
  • mod.rs — HuddleManager state machine, 16 Tauri commands, VoiceInputMode (PTT/VAD), session generation guard, tts_starting sentinel, auto-end on last human leave
  • stt.rs — STT pipeline: rubato resample → earshot VAD (clamped) → sherpa-onnx Moonshine. PTT gating, barge-in with 320ms debounce
  • tts.rs — TTS pipeline: Kokoro sentence-by-sentence lookahead → rodio playback. Cancel flag for PTT/barge-in, tts_active set after first append
  • models.rs — Background model download (Moonshine + Kokoro + CMUdict), SHA-256 verification, atomic swap install
  • preprocessing.rs — Text cleanup for TTS: strip markdown/code/URLs, numbers→words, unified split_sentences. 18 unit tests
  • agents.rs — Agent enrollment, voice-mode guidelines (kind:48106)

Desktop — WebView (desktop/src/features/huddle/)

  • HuddleContext — React context with startHuddle, joinHuddle, shared connectAndSetupMedia, voiceInputModeRef, fail-closed agent TTS filtering
  • HuddleIndicator — Subscribes to huddle lifecycle events, causal reconstruction, single icon (start/join)
  • HuddleBar — PTT/VAD indicator + toggle, active speaker rings, creator-only "End all"
  • AudioWorklet — PTT gating with setMode() for VAD/PTT isolation

Push-to-Talk Design

Ctrl+Space (global shortcut, works when app not focused)

Pressed:  ptt_active=true, cancel TTS, emit to frontend
Released: 200ms delay, flush STT, emit release

STT: PTT mode = vad AND ptt_active (entire key-hold = one utterance)
     VAD mode = vad only (continuous, barge-in enabled)

New Dependencies

Crate/Package Version License Purpose
sherpa-onnx 1.12 Apache-2.0 STT (Moonshine)
ort =2.0.0-rc.12 MIT/Apache-2.0 ONNX Runtime (Kokoro TTS) + CoreML
ndarray 0.17 MIT/Apache-2.0 Tensor operations
earshot 1.0 MIT/Apache-2.0 Pure Rust VAD
rubato 2.0 MIT Audio resampling
rodio 0.22 MIT/Apache-2.0 Audio playback
livekit-client 2.18.1 Apache-2.0 LiveKit JS SDK

Model licenses: Kokoro (Apache-2.0), misaki dicts (Apache-2.0), CMUdict (BSD), Moonshine (MIT).
License audit: Zero GPL dependencies.

Testing

  • 128 desktop tests pass, SDK + relay tests pass
  • TypeScript typecheck clean, biome clean, clippy clean
  • All 7 pre-push hooks green
  • Live-tested: huddle with agent TTS/STT, PTT flow, Kokoro G2P, CoreML fallback

Follow-up

  • Delete supertonic.rs (dead code, zero references)
  • Multi-human live testing with 2+ desktop instances
  • Voice selection dropdown (54 Kokoro voices available)
  • Investigate CoreML-compatible model variant (fp16 instead of q8f16)
  • Parakeet STT evaluation (blocked on ort/sherpa-onnx version conflict)
  • Full AEC via webrtc-audio-processing for laptop speakers
  • Push-based huddle state events (replace polling)

Humans can voice chat via LiveKit WebRTC. Ephemeral text channel is
created for the huddle transcript. No STT/TTS yet (Phase 2+).

Relay:
- Wire HuddleService into AppState (optional, env-var gated)
- POST /api/huddles/{channel_id}/token endpoint with auth + scope +
  channel access checks
- Verify huddle kinds 48100-48105 stored/fanned-out (no changes needed)

SDK:
- 4 huddle lifecycle event builders (48100-48103) with tests

Desktop Rust:
- HuddleManager state machine (Idle → Creating → Active → Leaving)
- 5 Tauri commands: start/join/leave/end_huddle, get_huddle_state
- push_audio_pcm stub (Phase 1 placeholder for Phase 2 STT)
- 4 huddle event builders in events.rs
- Bulletproof rollback: all error paths reset to Idle, orphaned
  channels archived, HUDDLE_STARTED emitted only after token success
- NSMicrophoneUsageDescription in Info.plist

Desktop WebView:
- LiveKit JS SDK integration (livekit-client 2.18.1)
- AudioWorklet mic tap with 100ms PCM batching → Rust via InvokeBody::Raw
- HuddleBar floating UI (mute, leave, participant list)
- Resource cleanup on all failure paths

Crossfired: 3 rounds, Opus + Codex, both APPROVE 10/10 on final round.
Phase 0 spikes validated: AudioWorklet IPC, sherpa-onnx compilation,
LiveKit JS in WKWebView — all green.
Human speech is transcribed and posted to the ephemeral channel.
Speak in huddle → text appears in channel → agents can see it.

STT Pipeline (stt.rs):
- PCM f32 48kHz from AudioWorklet → rubato resample to 16kHz mono
  → earshot VAD → sherpa-onnx Moonshine → transcribed text
- Dedicated std::thread (CPU-bound, not async)
- Bounded audio queue (sync_channel, 50 slots, try_send drops on backpressure)
- Shutdown flag + Drop impl joins worker thread cleanly
- Final flush: buffered speech transcribed on shutdown
- Merged decoder layout (v2) matches Moonshine tiny int8 model

Model Download Manager (models.rs):
- Background download of Moonshine tiny (~26MB) from sherpa-onnx releases
- Atomic extraction: temp dir → verify → rename-backup swap (old model preserved on failure)
- Platform-gated: #[cfg(unix)] tar extraction, clear error on non-Unix
- Race-safe: status set to Downloading before spawn
- OnceLock singleton, no unwrap() on mutex (poison recovery)

Pipeline Integration (mod.rs):
- push_audio_pcm feeds SttPipeline when active
- Transcribed text posted as kind:9 with agent p-tags (read at post time, not snapshot)
- Auto-start on huddle join/start if models ready (is_moonshine_ready)
- Pipeline shutdown called before state clear on leave/end
- recv_timeout replaces busy polling

New dependencies: sherpa-onnx 1.12, earshot 1.0, rubato 2.0, audioadapter-buffers 3.0

Crossfired: Codex 3/10 → 8/10 → 10/10 APPROVE after 3 fix rounds.
Agents participate in huddles via text. They hear all human speech,
respond when relevant, and get interrupted by new speech.

Agent Enrollment (agents.rs):
- add_agent_to_huddle: dual channel add (ephemeral + parent)
- Ephemeral add is required; parent add is best-effort
- Returns structured AgentAddResult with parent_error detail
- Only successfully enrolled agents get p-tagged on transcriptions
- Voice-mode guidelines posted as kind:9 [System] message (not kind:40099)

HuddleManager Integration (mod.rs):
- add_agent_to_huddle Tauri command with active phase check
- start_huddle tracks successful_agents (failed adds not enrolled)
- Voice-mode system message posted after HUDDLE_STARTED
- Agent pubkeys stored in Arc<Mutex<Vec<String>>> for live p-tag reads

Agent Add UI:
- '+' button on HuddleBar opens AddAgentDialog
- Dialog fetches list_managed_agents, filters to running agents only
- Structured error handling: hard failures shown as red, parent_error as amber warning
- Dialog stays open on warning so user can see the message

Design decisions:
- No separate ACP process spawn needed — existing managed agent auto-subscribes via kind:9000 membership notification
- ACP system prompt injection deferred to post-MVP (requires SubscriptionRule changes)
- Client does NOT mint kind:40099 (relay-signed only) — uses kind:9 instead

Crossfired: Codex 4/10 → 8/10 → 10/10 APPROVE after 3 rounds.
Agent responses are read aloud. Full end-to-end conversation loop:
speak → transcribe → agent responds → read aloud → human interrupts.

TTS Pipeline (tts.rs):
- sherpa-onnx Kokoro + rodio playback on dedicated thread
- Bounded text queue (8 slots, try_send drops on backpressure)
- Cancel flag for barge-in (clears queue + stops rodio Player)
- Shutdown checks in playback loop (cancel || shutdown)
- tts_active AtomicBool shared with STT for echo gating
- speak_agent_message Tauri command for WebView to feed agent text

Text Preprocessing (preprocessing.rs):
- Strip fenced code blocks (both ``` and ~~~) → 'code block omitted'
- Strip inline code, URLs → 'link omitted'
- Smart underscore handling (emphasis stripped, snake_case preserved)
- Numbers → words (0-999), times → spoken ('nine oh five')
- Emoji stripped, whitespace collapsed
- 10 unit tests

Barge-In + STT Gating (stt.rs):
- VAD speech onset during TTS → set tts_cancel flag → immediate TTS stop
- speech_buf NOT accumulated during TTS (echo prevention)
- 200ms cooldown after TTS stops before STT re-enables
- tts_cancel shared between STT and TTS pipelines

Model Download (models.rs):
- Kokoro int8 model download (~92MB) with atomic extraction
- Same verify-swap pattern as Moonshine

UI (HuddleBar.tsx):
- TTS toggle button (Volume2/VolumeX icons)

New dependency: rodio 0.22

Crossfired: Codex 3/10 → 10/10 APPROVE after 2 rounds.
Adds a Headphones icon button next to the workflow Zap button in
ChannelMembersBar. Clicking starts a huddle for the current channel
via the start_huddle Tauri command.
Eliminates the espeak-ng GPL-3.0 dependency from the TTS pipeline.
sherpa-onnx bundled espeak-ng for Kokoro phonemization; the kokoro-tts
crate uses cmudict-fast (Apache-2.0) instead — zero GPL exposure.

Changes:
- tts.rs: swap sherpa-onnx Kokoro → kokoro-tts KokoroTts API
  - Async API driven by single-threaded tokio Runtime in worker thread
  - Voice::AfHeart(1.0) for American English female voice
  - Fixed 24kHz sample rate output
- models.rs: download from HuggingFace (individual files, not tar.bz2)
  - kokoro-v0_19.onnx + voices.bin only (no espeak-ng-data)
  - is_kokoro_ready() uses is_file() for stricter validation
- Cargo.toml: add kokoro-tts + pin ort=2.0.0-rc.11, keep sherpa-onnx for STT

ort pinned to rc.11 because kokoro-tts 0.3.2 is incompatible with rc.12
(generic Error<R> breaking change). No vendoring needed.

License audit: cargo tree shows ZERO GPL deps in the resolved graph.
…HuddleBar

Connect the headphones button to the full huddle pipeline:
  Button click → Rust start_huddle → LiveKit connect → AudioWorklet → HuddleBar

New file:
- HuddleContext.tsx: React context managing the huddle lifecycle with
  concurrency guards (busyRef), operation tokens (tokenRef) for
  start/leave race protection, and rustActiveRef for Rust state tracking.
  Every cleanup step is independently try/caught for resilience.

Modified:
- AppShell.tsx: Wrap with HuddleProvider, render HuddleBar (+5 lines)
- ChannelMembersBar.tsx: Use useHuddle() instead of raw invoke
- HuddleBar.tsx: Pull localAudioTrack + leaveHuddle from context,
  preserve active state on transient poll errors
- audioWorklet.ts: Clear onmessage before disconnect to prevent
  stale PCM sends
- index.ts: Export HuddleProvider and useHuddle

Crossfire reviewed: Opus 9/10 APPROVE, Codex 8/10 (3 rounds).
The relay's ingest pipeline rejected huddle lifecycle events as
'restricted: unknown event kind'. Add kinds 48100-48103 to:
- required_scope_for_kind: maps to Scope::ChannelsWrite
- requires_h_channel_scope: routes events to parent channel via h-tag

Without this, start_huddle fails after creating the ephemeral channel
because the HUDDLE_STARTED event (kind 48100) is rejected, triggering
rollback and returning an error to the frontend.
start_huddle was setting participants = member_pubkeys (always []),
never including the current user's pubkey. add_agent_to_huddle only
pushed to agent_pubkeys (for p-tags), not participants (for UI).

Now:
- start_huddle inserts the user's own pubkey at position 0
- add_agent_to_huddle also appends to participants
In dev mode, getUserMedia is silently denied because the terminal app
needs microphone permission in System Settings. For production builds,
macOS 13+ requires an Entitlements.plist with the audio-input entitlement.

- Create Entitlements.plist with com.apple.security.device.audio-input
- Wire it into tauri.conf.json bundle.macOS.entitlements
- Info.plist already had NSMicrophoneUsageDescription (no change needed)

Dev mode workaround: grant mic permission to Terminal.app in
System Settings → Privacy & Security → Microphone.
Voice models (Moonshine STT, Kokoro TTS) are now auto-downloaded on
first huddle start/join. Downloads are idempotent — no-op if already
on disk or in progress. First huddle runs voice-only; subsequent
huddles get STT+TTS once download completes.

HuddleBar now shows a voice activity indicator:
- Green pulsing dot: mic is live and picking up audio
- Gray dot: mic connected but silent
- 'no mic' text: LiveKit/mic connection failed

Also adds diagnostic console logs to connectToHuddle for tracing
the mic → LiveKit → publish flow.
Moonshine archive now contains split decoder layout (v1 int8):
  preprocess.onnx, encode.int8.onnx, cached_decode.int8.onnx,
  uncached_decode.int8.onnx — not the merged v2 layout.

Kokoro model files moved from hexgrad/Kokoro-82M (404) to
thewh1teagle/kokoro-onnx/releases/download/model-files/.

- Update MOONSHINE_EXPECTED_FILES to match actual archive contents
- Update OfflineMoonshineModelConfig to use split decoder fields
- Update KOKORO_MODEL_URL and KOKORO_VOICES_URL to working URLs
Agents now see which channel the huddle is attached to:
'This huddle is attached to channel <uuid> — that's the main channel.'

Changed VOICE_MODE_GUIDELINES from a const to voice_mode_guidelines(parent_channel_id).
The previous kokoro-v0_19.onnx + voices.bin from thewh1teagle/kokoro-onnx
used a ZIP/numpy voices format incompatible with the kokoro-tts 0.3 crate
(which expects bincode). The hexgrad/Kokoro-82M HuggingFace URLs also 404'd.

Switch to the crate author's own releases (mzdk100/kokoro V1.0):
- kokoro-v1.0.int8.onnx (92MB, quantized — smaller than v0.19's 325MB)
- voices.bin (bincode format, compatible with kokoro-tts 0.3)
…messages

Two fixes to complete the TTS pipeline:

Rust (speak_agent_message):
- Now async — lazily starts the TTS pipeline if models finished
  downloading after the huddle began (first huddle race condition).

JS (HuddleContext):
- Store ephemeralChannelId + selfPubkey after start_huddle
- Subscribe to ephemeral channel via relayClient
- Filter for kind:9 messages from non-self pubkeys (skip [System])
- Call invoke('speak_agent_message') for each agent message
- Clean up subscription on leave/unmount
TTS now splits text into sentences before synthesis. Kokoro handles
short chunks better — fewer spelling artifacts from CMU dictionary
fallback, better prosody, and allows barge-in between sentences.

Also fixes TTS subscription:
- Skip historical messages (created_at < subscription time)
- Skip empty/whitespace-only content
- Remove debug console.log statements
Lookahead: pre-synthesize sentence N+1 while sentence N plays via
rodio (separate audio thread). Eliminates inter-sentence gaps when
synthesis is faster than playback.

Sentence splitting: don't split on '1.' '2.' etc (digit + period).
Only split on period/exclamation/question preceded by a letter.
Removed semicolon as a split point (too aggressive).

This should fix the first-word spelling issue (fragments too short
for Kokoro's phonemizer) and the gap between sentences.
…espeak)

Supertonic TTS: 66M params, flow-matching architecture, MIT license,
Unicode tokenizer (no espeak/GPL), 10 voices (F1-F5, M1-M5).
167× real-time factor on M4 Pro — synthesis faster than playback.

New files:
- supertonic.rs: adapted from official helper.rs (637 lines), ort rc.11
  compatible, 4 ONNX sessions (duration/encoder/estimator/vocoder)

Changed:
- tts.rs: Supertonic engine replaces Kokoro, sentence-split lookahead
  synthesis (synth N+1 while rodio plays N), new_with_voice() for
  per-agent voice selection, F1 default voice
- models.rs: Supertonic model downloads from HuggingFace (~253MB total),
  7 files (4 ONNX + config + tokenizer + F1 voice style)
- Cargo.toml: removed kokoro-tts, added ndarray 0.17 + rayon,
  unicode-normalization, regex, rand, rand_distr. ort gets ndarray feature.
- mod.rs: kokoro_* → supertonic_* throughout
7 files × 89 = 623 overflows u8 (max 255). Use u32 arithmetic.
The tts.json config shows ae.sample_rate = 44100. Playing 44.1kHz audio
at 24kHz makes it sound slow and low-pitched.
Tier 1:
1. Word-skipping fix (.max(1) on latent_lengths) — prevents dropping
   monosyllabic words like 'a', 'the', 'hello' (supertonic PR#33)
2. Single persistent rodio Player — eliminates OS audio device setup
   gap between sentences (was creating new Player per sentence)
3. Batch 3 sentences per synth call — model sees more context, better
   prosody across sentence boundaries
4. Volume boost 2.5× with clamp — Supertonic output is quiet

Tier 2:
5. 8ms fade in/out at chunk boundaries — eliminates clicks/pops
6. Inter-sentence silence 0.15s (was 0.3s default) — less robotic
7. Pre-buffer via batching — all batches synthesized and queued before
   playback wait, rodio plays them gaplessly
Comprehensive quality pass on the huddles implementation, driven by
iterative crossfire review (codex CLI + opus subagents, 11 rounds).

## Correctness
- Fix barge-in startup order: TTS starts before STT so tts_cancel is available
- Shared tts_cancel in HuddleState: survives pipeline restarts and TTS toggle
- Pipeline replacement leak: shutdown old STT pipeline before replacing
- Participant state: only successfully enrolled agents shown
- Sentence batching: join with space, not period-space (fixes garbled TTS)
- Agent prompt re-delivery: guidelines re-posted when agents join mid-huddle
- TTS duplicate-start prevention: re-check before storing new pipeline
- Two-phase activation: Connected → Active lifecycle with confirm_huddle_active
- Agent enrollment with role=bot: relay membership API correctly identifies agents

## Voice Quality
- Silence threshold: 300ms → 450ms (reduces sentence fragmentation)
- Barge-in debounce: 5 consecutive VAD frames (~80ms) required during TTS
- STT state reset: clean segment state across TTS transitions and cooldown
- STT hot-start: auto-starts when models finish downloading mid-huddle

## Agent Identity (authoritative, fail-closed)
- Relay membership API: fetch_agent_pubkeys_from_relay with role=bot filter
- get_huddle_agent_pubkeys Tauri command for frontend
- Backend periodic refresh in check_pipeline_hotstart (every 5s)
- Frontend periodic refresh (every 10s) with fail-closed semantics
- Result-based error propagation: fetch failures keep TTS mute
- Joiner hydration: join_huddle fetches agent list from relay

## Frontend
- Startup atomicity: ephemeralChannelId set only after full setup
- EOSE-based replay boundary with timestamp belt-and-suspenders
- Agent-only TTS filter: only bot-role pubkeys spoken, fail-closed
- AudioWorklet fire-and-forget: no main-thread backpressure
- Cleanup consolidation: single cleanupFailedStart helper
- leaveHuddle returns boolean: bar stays visible if backend cleanup fails
- HuddleBar respects Connected phase in poll-failure fallback

## Performance and Quality
- LazyLock regex in supertonic.rs: 12 patterns compiled once
- tokio::fs for all async file ops in model downloads
- Model-shape validation: bounds-check dims before indexing
- Async transcription task: tokio::sync::mpsc, no Tokio thread blocking
- Dead code removed, stale comments fixed, rustfmt + biome applied
Multi-round crossfire review (Opus + Codex CLI, 8 full review passes,
17 parallel worker delegates) identified and fixed 25+ issues across
the huddle voice pipeline.

Safety & correctness:
- Session generation guard: Arc<AtomicU64> on HuddleState prevents
  stale transcription tasks from posting kind:9 after leave/end
- Speech buffer capped at 30s (prevents OOM in noisy environments)
- Raw IPC payload bounded at 100KB per batch
- TTS input bounded at 2000 chars with Unicode-safe truncation
- LiveKit token/URL hidden from polling via #[serde(skip)]
- Model downloads: SHA-256 verification with pinned hashes, Rust-native
  tar+bzip2 extraction with pre-validation (path traversal, symlinks),
  streaming to disk, size limits, version manifest for cache invalidation
- Pubkey format validation (64 hex chars) at Tauri boundary
- Max 20 agents per huddle enforced on both start and incremental add
- UUID validation on all huddle event builders
- PCM alignment rejection (not just warning) on non-4-byte-aligned input

Architecture & DRY:
- teardown_huddle() helper eliminates duplicated leave/end shutdown code
- start_stt_pipeline delegates to maybe_start_stt_pipeline
- split_sentences consolidated in preprocessing.rs (deleted from tts.rs
  and supertonic.rs, net -85 lines)
- int_to_words extended to 0-999,999
- Agent refresh throttled to 15s with success-gated timestamp

Lifecycle:
- Creator enforcement: is_creator field on HuddleState, end_huddle
  rejects non-creators, HuddleBar shows End/Leave conditionally
- Failed startup cleanup calls end_huddle (not leave_huddle) to prevent
  orphaned ephemeral channels
- Participant state hydrated from relay immediately on start/join

UX & AX:
- Agent prompt tightened: silent when not addressed, no dot responses,
  no repeat after interruption
- TTS subscription buffers pre-EOSE events, replays live ones (fixes
  silent drop of fast agent responses)
- AddAgentDialog filters already-added agents
- Participant display shows 'In huddle' instead of misleading count
- Room label shows 'Huddle' instead of raw LiveKit room name
- TTS backpressure logged (Rust + frontend)
- AudioWorklet IPC wrapped behind invokeRawBinary() abstraction
- ASCII lifecycle docs added to HuddleContext, audioWorklet, livekit
- supertonic.rs header fixed (44.1kHz not 24kHz)
- worklet.js documents intentional partial-buffer drop on disconnect
- ParticipantList NaN-safe hue derivation with gray fallback
- tts.rs expect() calls replaced with match+break

18 files changed, +984 -370
6-round crossfire review (Opus 9/10 APPROVE, Codex 9/10 APPROVE).
Net -21 lines across 9 source files.

Rust backend:
- Unify fetch_agent_pubkeys + fetch_all_member_pubkeys → fetch_channel_members(role_filter)
- Add AppState::huddle() convenience (replaces 25+ lock().map_err() calls)
- spawn_transcription_task uses post_event_raw (checks HTTP status, shared auth)
- maybe_start_tts_pipeline returns Result<bool, String> (surfaces silent failures)
- Guidelines use kind:48106 instead of fragile [System] prefix on kind:9
- Remove duplicate guidelines re-post on add_agent_to_huddle (EOSE replay suffices)
- Extract post_connect_setup helper from start/join huddle (prevents drift)
- Bump session_generation on STT pipeline replacement (prevents stale transcripts)
- Consolidate triple lock in check_pipeline_hotstart → single acquisition
- Extract drain_until_shutdown<T> to mod.rs as pub(super) (shared by stt + tts)
- Tighten voice-mode prompt: 15 words, no filler, no apologies, no meta-responses
- Document TtsPipeline::cancel() as intentional future API surface

Frontend:
- Extract disconnectMedia() helper (leaveHuddle/endHuddle no longer duplicate cleanup)
- TTS subscription uses subscribeToChannelLive (since: now, no historical backlog)
- Remove EOSE buffering/replay — unnecessary with live-only subscription
- Narrow subscription to KIND_STREAM_MESSAGE only (less wire traffic)
- Add subscribeToChannelLive() to RelayClient (limit:1000 for reconnect safety)
Crossfire review (opus×2 + codex) identified 15+ issues. All fixed:

Safety & correctness:
- Fix validate_pubkey_hex panic on multi-byte UTF-8 input
- Fix pipeline init failure wedge (is_finished + dead pipeline detection)
- Fix guidelines delivery race (post kind:48106 before adding agents)
- Fix teardown blocking mutex during thread join
- Fix endHuddle swallowing failures (returns boolean)
- Fix livekit.ts mic track leak on disconnect failure (try/finally)
- Fix tts_active staying true during cancel
- Filter single-char TTS responses (prevents speaking 'period')

DRY extraction:
- models.rs: 826→690 lines via ModelSlot + shared helpers
  (download_file, fetch_url, fresh_temp_dir, verify_and_install)
- events.rs: 4 huddle event builders → shared build_huddle_event
- tts.rs: 4 cancel/shutdown patterns → handle_cancel_or_shutdown

UX improvements:
- ParticipantList resolves pubkeys to display names (useUsersBatchQuery)
- ProfileAvatar with hex-prefix HexAvatar fallback
- Voice guidelines prompt tightened for fast LLMs

Cleanup:
- Delete dead voice_style_path(), TtsPipeline::cancel()
- Fix stale comments, document LE endianness assumption
Four targeted improvements from crossfire review (codex 9/10, opus 8/10):

1. post_connect_setup fault boundary: model download and member hydration
   failures no longer tear down a working huddle. The call stays up in
   degraded mode (no STT/TTS) instead of failing entirely.

2. Remove duplicate preprocessing from supertonic.rs: emoji stripping and
   whitespace collapsing already handled by preprocessing.rs. Deleted two
   stale LazyLock<Regex> statics (RE_EMOJI, RE_WHITESPACE).

3. Feature-gate dead code: webhook.rs and session.rs gated behind
   #[cfg(feature = "webhook")]. hmac/sha2/hex deps made optional.
   3 tests run by default, 8 with --features webhook.

4. Model download progress in HuddleBar: shows 'Voice models: STT 42%,
   TTS 78%' while downloading, disappears when ready. Serde decoder
   correctly handles both string and object enum variants.

Also: documented WHY dual membership polling (Rust + React) is intentional —
Rust preserves stale list on failure (STT p-tags), React clears on failure
(TTS authorization must fail-closed). Different safety requirements.
…pp launch

1. Add kind 48106 (huddle guidelines) to the relay's ingest allowlist.
   The range 48100..=48103 was allowed but 48106 was missing, causing
   'restricted: unknown event kind' when posting voice-mode guidelines.
   Widened to 48100..=48106.

2. Trigger background voice model downloads at app launch (in Tauri setup
   hook). Models are ~303 MB total (50 MB Moonshine STT + 253 MB Supertonic
   TTS). Downloads are async, idempotent, SHA-256 verified, and no-op if
   already cached. First huddle no longer has a cold-start download wait.
…bled

Fixes discovered during live huddle testing with agents:

**Relay:**
- Add kind 48106 (huddle guidelines) to requires_h_channel_scope so
  guidelines are routed to the ephemeral channel. Agents now see the
  voice-mode prompt via their subscription.

**TTS (tts.rs):**
- Prime audio output with 100ms silent buffer at worker startup. On macOS,
  CoreAudio initializes lazily — without priming, the first Player races
  against device startup and player.empty() returns true prematurely,
  truncating the first TTS message after a few words.

**STT (stt.rs):**
- Disable barge-in (VAD-based TTS cancellation). Without acoustic echo
  cancellation, any ambient noise — keyboard, fan, breathing — triggers
  false barge-in and kills TTS mid-sentence. Push-to-talk will replace
  this; echo-gating (skip accumulation during TTS) is preserved.
- Reduce TTS cooldown 200ms → 50ms. The old value ate the first word
  when the user spoke immediately after the agent finished.
- Reduce silence flush threshold 28 → 19 frames (450ms → 300ms).
  Snappier transcript delivery without splitting mid-word pauses.
Huddles are now enabled out of the box with dev defaults:
  LIVEKIT_URL=ws://localhost:7880
  LIVEKIT_API_KEY=devkey
  LIVEKIT_API_SECRET=secret

Set SPROUT_HUDDLES_DISABLED=true to disable. Override the LIVEKIT_*
env vars for production deployments.
Add push-to-talk (PTT) via global Ctrl+Space shortcut as the default
voice input mode for huddles. Voice activity detection (VAD) remains
as a switchable option with barge-in enabled.

Rust backend:
- VoiceInputMode enum (PushToTalk default, VoiceActivity)
- ptt_active: Arc<AtomicBool> shared with STT pipeline
- Global shortcut handler with 200ms release delay and generation
  counter to prevent press→release→press race condition
- PTT press cancels TTS immediately (only when TTS is active)
- STT gating: is_speech ANDed with ptt_active flag — natural flush
  on release via silence accumulation
- PTT mode accumulates entire key-hold as one utterance (no mid-
  sentence splits on pauses); flush only on release edge
- Barge-in re-enabled for VAD mode with 320ms debounce
- Mode switch mid-huddle restarts STT pipeline
- Disable WebView background throttling (tauri.conf.json) so
  AudioWorklet keeps processing when window loses focus

Frontend:
- worklet.js: PTT gating via transmitting flag + port.onmessage
- audioWorklet.ts: Tauri ptt-state event → worklet forwarding
- HuddleContext: pttActive/voiceInputMode/setVoiceInputMode state
- HuddleBar: PTT/VAD indicator, mode toggle, green ring on transmit

Also fixes two review items from crossfire:
- Preserve session_generation across error-path state resets
- set_tts_enabled takes pipeline out of lock before shutdown
Enable multiple humans to join the same huddle simultaneously.

Backend (mod.rs):
- join_huddle: add human to ephemeral channel (hard-fail unless
  already member), full rollback pattern on any failure
- leave_huddle: auto-end when last human departs — emits
  HUDDLE_ENDED + archives ephemeral channel, avoids relay
  'cannot remove last owner' error by skipping build_leave
- end_huddle: any participant can end (creator disconnect recovery)
- Extract reset_preserving_generation() DRY helper (3 call sites)
- Add count_human_members() for last-human detection

Frontend context (HuddleContext.tsx):
- Add joinHuddle() to context API
- Extract connectAndSetupMedia() shared helper (start + join)
- Add activeSpeakers + isReconnecting state from LiveKit events
- cleanupFailedStart(isCreator): joiners use leave_huddle only,
  preventing a failed join from ending everyone's huddle

Frontend livekit.ts:
- HuddleRoomCallbacks: ActiveSpeakersChanged, Disconnected,
  Reconnecting, Reconnected room event listeners
- getUserMedia with echoCancellation + noiseSuppression

Frontend UI:
- HuddleIndicator (new): subscribes to kind:48100-48103 via
  dedicated subscribeToHuddleEvents (limit 100), event-id dedup
  with batch reconstruction, causal sort, fail-closed validation,
  infers huddle from JOIN/LEFT even without START event. Renders
  start button when idle, green glow join button when active.
  Single icon replaces dual headphone buttons.
- HuddleBar: reconnecting indicator, animate-pulse PTT (WCAG),
  ARIA live region (output element), Leave + End-all buttons for
  all participants
- ParticipantList: activeSpeakers green ring, aria-label/role
- ChannelMembersBar: single HuddleIndicator replaces dual icons,
  invalidateQueries on start/join for immediate sidebar refresh

Relay client (relayClientSession.ts):
- subscribeToHuddleEvents(): kinds 48100-48103 only, limit 100

File size overrides bumped for mod.rs (1450), HuddleContext (630),
relayClientSession (830) to accommodate multi-human additions.
TTS Engine Replacement (Supertonic → Kokoro):
- New kokoro.rs: Kokoro-82M ONNX TTS engine, 100% GPL-free
  - Single ONNX session via ort crate (was 4 sessions with Supertonic)
  - CoreML acceleration with automatic CPU fallback
  - 24kHz output (was 44.1kHz) — rodio resamples transparently
  - 54 voices across 9 languages (was 10 voices, 5 languages)
- G2P: three-tier dictionary lookup, no espeak-ng dependency
  - Tier 1: misaki gold+silver dicts (178K words, Apache-2.0)
  - Tier 2: CMUdict (126K words, BSD) with ARPAbet→IPA conversion
  - Tier 3: morphological suffix rules (-s/-ed/-ing) from misaki
  - Tier 4: letter-by-letter spelling for OOV words
  - Contraction handling (don't, it's, we've, etc.)
- Model download: model_q8f16.onnx (86MB), af_heart.bin voice (510KB),
  us_gold.json + us_silver.json (6MB), cmudict.dict (3.6MB)
  All SHA-256 verified, runtime download via existing model manager
- tts.rs: sentence-by-sentence lookahead pipeline (BATCH_SIZE=1)
  for 200ms TTFA with zero-gap playback

Crossfire Review Fixes:
- C1: end_huddle now requires is_creator (or explicit force flag)
  UI hides 'End all' button for non-creators
- C2: tts_starting sentinel prevents TOCTOU race in pipeline creation
- C3: tts_active flag set after first player.append, not before synthesis
- I1: audioWorklet setMode() isolates VAD from PTT events
- I4: voiceInputModeRef prevents stale closure in connectAndSetupMedia

Bug Fixes:
- Fix tokio::spawn panic during Tauri setup (use tauri::async_runtime::spawn)
- Fix earshot VAD panic on out-of-range samples (clamp after resampling)

Dependencies:
- ort: rc.11 → rc.12 + coreml feature (zero binary size cost on macOS)
- Removed: rand, rand_distr, unicode-normalization (Supertonic-only)

Licenses: Kokoro model (Apache-2.0), misaki dicts (Apache-2.0),
CMUdict (BSD), ort (MIT), ONNX Runtime (MIT). No GPL anywhere.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant