feat: Sprout Huddles — voice conversations with AI agents by tlongwell-block · Pull Request #299 · block/sprout

tlongwell-block · 2026-04-11T18:22:34Z

Summary

Adds real-time voice huddles to Sprout. Humans talk via LiveKit WebRTC; each desktop GUI locally transcribes speech and posts it to an ephemeral Nostr text channel where agents read, respond, and get read aloud. Agents never touch audio — they see only text. Multiple humans can join the same huddle.

Human speaks → AudioWorklet → Rust STT pipeline → text → relay → agent
Agent responds → text → Rust TTS pipeline → speakers → human hears
LiveKit SFU handles human-to-human audio mixing automatically

Push-to-talk (Ctrl+Space) is the default voice input mode. Voice activity detection (VAD) with barge-in remains as a switchable option.

What's New

Kokoro TTS (replaced Supertonic)

Kokoro-82M ONNX — Apache-2.0, 54 voices, 9 languages, single ONNX session (was 4 with Supertonic)
GPL-free G2P — three-tier dictionary lookup (misaki 178K words + CMUdict 126K words + morphological suffix rules), no espeak-ng
CoreML acceleration — automatic fallback to CPU when CoreML can't handle quantized ops
Sentence-by-sentence lookahead — BATCH_SIZE=1 for ~200ms TTFA with zero-gap playback
Runtime model download — model_q8f16.onnx (86MB), voice (510KB), lexicons (10MB), all SHA-256 verified

Multi-Human Huddles

Join existing huddles — HuddleIndicator detects active huddles via dedicated kind:48100-48103 subscription, shows green glow headphone icon with participant count badge
Active speaker highlighting — green ring on speaking participants via LiveKit events
Auto-end lifecycle — last human leaving auto-ends the huddle (archives ephemeral channel)
Creator-only end — end_huddle requires is_creator (or explicit force flag for crash recovery). UI hides "End all" for non-creators.
Safe joiner cleanup — joiner's mic failure only calls leave_huddle, never end_huddle
Reconnect handling — isReconnecting state shown in HuddleBar
Accessibility — ARIA live region, animate-pulse on PTT dot, aria-label/role on avatars

Crossfire Fixes (from multi-model review: Opus + Codex + Gemini)

C1: end_huddle creator check + UI guard
C2: tts_starting sentinel prevents TOCTOU race in pipeline creation
C3: tts_active flag set after first player.append, not before synthesis
I1: AudioWorklet setMode() isolates VAD from PTT events
I4: voiceInputModeRef prevents stale closure in connectAndSetupMedia
Bug: tokio::spawn → tauri::async_runtime::spawn (panic during Tauri setup)
Bug: earshot VAD clamp (panic on out-of-range samples after resampling)

Relay

HuddleService wired into AppState — enabled by default with dev credentials
POST /api/huddles/{channel_id}/token — LiveKit token endpoint with auth + scope + membership checks
Huddle lifecycle events (kind 48100–48103, 48106) added to ingest allowlist + channel routing

SDK

4 huddle lifecycle event builders with 8 tests

Desktop — Rust (`desktop/src-tauri/src/huddle/`)

kokoro.rs — Kokoro-82M ONNX TTS: single ort session, CoreML with CPU fallback, three-tier G2P (misaki + CMUdict + morphological rules), ARPAbet→IPA conversion, contraction handling
mod.rs — HuddleManager state machine, 16 Tauri commands, VoiceInputMode (PTT/VAD), session generation guard, tts_starting sentinel, auto-end on last human leave
stt.rs — STT pipeline: rubato resample → earshot VAD (clamped) → sherpa-onnx Moonshine. PTT gating, barge-in with 320ms debounce
tts.rs — TTS pipeline: Kokoro sentence-by-sentence lookahead → rodio playback. Cancel flag for PTT/barge-in, tts_active set after first append
models.rs — Background model download (Moonshine + Kokoro + CMUdict), SHA-256 verification, atomic swap install
preprocessing.rs — Text cleanup for TTS: strip markdown/code/URLs, numbers→words, unified split_sentences. 18 unit tests
agents.rs — Agent enrollment, voice-mode guidelines (kind:48106)

Desktop — WebView (`desktop/src/features/huddle/`)

HuddleContext — React context with startHuddle, joinHuddle, shared connectAndSetupMedia, voiceInputModeRef, fail-closed agent TTS filtering
HuddleIndicator — Subscribes to huddle lifecycle events, causal reconstruction, single icon (start/join)
HuddleBar — PTT/VAD indicator + toggle, active speaker rings, creator-only "End all"
AudioWorklet — PTT gating with setMode() for VAD/PTT isolation

Push-to-Talk Design

Ctrl+Space (global shortcut, works when app not focused)

Pressed:  ptt_active=true, cancel TTS, emit to frontend
Released: 200ms delay, flush STT, emit release

STT: PTT mode = vad AND ptt_active (entire key-hold = one utterance)
     VAD mode = vad only (continuous, barge-in enabled)

New Dependencies

Crate/Package	Version	License	Purpose
`sherpa-onnx`	1.12	Apache-2.0	STT (Moonshine)
`ort`	=2.0.0-rc.12	MIT/Apache-2.0	ONNX Runtime (Kokoro TTS) + CoreML
`ndarray`	0.17	MIT/Apache-2.0	Tensor operations
`earshot`	1.0	MIT/Apache-2.0	Pure Rust VAD
`rubato`	2.0	MIT	Audio resampling
`rodio`	0.22	MIT/Apache-2.0	Audio playback
`livekit-client`	2.18.1	Apache-2.0	LiveKit JS SDK

Model licenses: Kokoro (Apache-2.0), misaki dicts (Apache-2.0), CMUdict (BSD), Moonshine (MIT).
License audit: Zero GPL dependencies.

Testing

128 desktop tests pass, SDK + relay tests pass
TypeScript typecheck clean, biome clean, clippy clean
All 7 pre-push hooks green
Live-tested: huddle with agent TTS/STT, PTT flow, Kokoro G2P, CoreML fallback

Follow-up

Delete supertonic.rs (dead code, zero references)
Multi-human live testing with 2+ desktop instances
Voice selection dropdown (54 Kokoro voices available)
Investigate CoreML-compatible model variant (fp16 instead of q8f16)
Parakeet STT evaluation (blocked on ort/sherpa-onnx version conflict)
Full AEC via webrtc-audio-processing for laptop speakers
Push-based huddle state events (replace polling)

Humans can voice chat via LiveKit WebRTC. Ephemeral text channel is created for the huddle transcript. No STT/TTS yet (Phase 2+). Relay: - Wire HuddleService into AppState (optional, env-var gated) - POST /api/huddles/{channel_id}/token endpoint with auth + scope + channel access checks - Verify huddle kinds 48100-48105 stored/fanned-out (no changes needed) SDK: - 4 huddle lifecycle event builders (48100-48103) with tests Desktop Rust: - HuddleManager state machine (Idle → Creating → Active → Leaving) - 5 Tauri commands: start/join/leave/end_huddle, get_huddle_state - push_audio_pcm stub (Phase 1 placeholder for Phase 2 STT) - 4 huddle event builders in events.rs - Bulletproof rollback: all error paths reset to Idle, orphaned channels archived, HUDDLE_STARTED emitted only after token success - NSMicrophoneUsageDescription in Info.plist Desktop WebView: - LiveKit JS SDK integration (livekit-client 2.18.1) - AudioWorklet mic tap with 100ms PCM batching → Rust via InvokeBody::Raw - HuddleBar floating UI (mute, leave, participant list) - Resource cleanup on all failure paths Crossfired: 3 rounds, Opus + Codex, both APPROVE 10/10 on final round. Phase 0 spikes validated: AudioWorklet IPC, sherpa-onnx compilation, LiveKit JS in WKWebView — all green.

Human speech is transcribed and posted to the ephemeral channel. Speak in huddle → text appears in channel → agents can see it. STT Pipeline (stt.rs): - PCM f32 48kHz from AudioWorklet → rubato resample to 16kHz mono → earshot VAD → sherpa-onnx Moonshine → transcribed text - Dedicated std::thread (CPU-bound, not async) - Bounded audio queue (sync_channel, 50 slots, try_send drops on backpressure) - Shutdown flag + Drop impl joins worker thread cleanly - Final flush: buffered speech transcribed on shutdown - Merged decoder layout (v2) matches Moonshine tiny int8 model Model Download Manager (models.rs): - Background download of Moonshine tiny (~26MB) from sherpa-onnx releases - Atomic extraction: temp dir → verify → rename-backup swap (old model preserved on failure) - Platform-gated: #[cfg(unix)] tar extraction, clear error on non-Unix - Race-safe: status set to Downloading before spawn - OnceLock singleton, no unwrap() on mutex (poison recovery) Pipeline Integration (mod.rs): - push_audio_pcm feeds SttPipeline when active - Transcribed text posted as kind:9 with agent p-tags (read at post time, not snapshot) - Auto-start on huddle join/start if models ready (is_moonshine_ready) - Pipeline shutdown called before state clear on leave/end - recv_timeout replaces busy polling New dependencies: sherpa-onnx 1.12, earshot 1.0, rubato 2.0, audioadapter-buffers 3.0 Crossfired: Codex 3/10 → 8/10 → 10/10 APPROVE after 3 fix rounds.

Agents participate in huddles via text. They hear all human speech, respond when relevant, and get interrupted by new speech. Agent Enrollment (agents.rs): - add_agent_to_huddle: dual channel add (ephemeral + parent) - Ephemeral add is required; parent add is best-effort - Returns structured AgentAddResult with parent_error detail - Only successfully enrolled agents get p-tagged on transcriptions - Voice-mode guidelines posted as kind:9 [System] message (not kind:40099) HuddleManager Integration (mod.rs): - add_agent_to_huddle Tauri command with active phase check - start_huddle tracks successful_agents (failed adds not enrolled) - Voice-mode system message posted after HUDDLE_STARTED - Agent pubkeys stored in Arc<Mutex<Vec<String>>> for live p-tag reads Agent Add UI: - '+' button on HuddleBar opens AddAgentDialog - Dialog fetches list_managed_agents, filters to running agents only - Structured error handling: hard failures shown as red, parent_error as amber warning - Dialog stays open on warning so user can see the message Design decisions: - No separate ACP process spawn needed — existing managed agent auto-subscribes via kind:9000 membership notification - ACP system prompt injection deferred to post-MVP (requires SubscriptionRule changes) - Client does NOT mint kind:40099 (relay-signed only) — uses kind:9 instead Crossfired: Codex 4/10 → 8/10 → 10/10 APPROVE after 3 rounds.

Agent responses are read aloud. Full end-to-end conversation loop: speak → transcribe → agent responds → read aloud → human interrupts. TTS Pipeline (tts.rs): - sherpa-onnx Kokoro + rodio playback on dedicated thread - Bounded text queue (8 slots, try_send drops on backpressure) - Cancel flag for barge-in (clears queue + stops rodio Player) - Shutdown checks in playback loop (cancel || shutdown) - tts_active AtomicBool shared with STT for echo gating - speak_agent_message Tauri command for WebView to feed agent text Text Preprocessing (preprocessing.rs): - Strip fenced code blocks (both ``` and ~~~) → 'code block omitted' - Strip inline code, URLs → 'link omitted' - Smart underscore handling (emphasis stripped, snake_case preserved) - Numbers → words (0-999), times → spoken ('nine oh five') - Emoji stripped, whitespace collapsed - 10 unit tests Barge-In + STT Gating (stt.rs): - VAD speech onset during TTS → set tts_cancel flag → immediate TTS stop - speech_buf NOT accumulated during TTS (echo prevention) - 200ms cooldown after TTS stops before STT re-enables - tts_cancel shared between STT and TTS pipelines Model Download (models.rs): - Kokoro int8 model download (~92MB) with atomic extraction - Same verify-swap pattern as Moonshine UI (HuddleBar.tsx): - TTS toggle button (Volume2/VolumeX icons) New dependency: rodio 0.22 Crossfired: Codex 3/10 → 10/10 APPROVE after 2 rounds.

Adds a Headphones icon button next to the workflow Zap button in ChannelMembersBar. Clicking starts a huddle for the current channel via the start_huddle Tauri command.

Eliminates the espeak-ng GPL-3.0 dependency from the TTS pipeline. sherpa-onnx bundled espeak-ng for Kokoro phonemization; the kokoro-tts crate uses cmudict-fast (Apache-2.0) instead — zero GPL exposure. Changes: - tts.rs: swap sherpa-onnx Kokoro → kokoro-tts KokoroTts API - Async API driven by single-threaded tokio Runtime in worker thread - Voice::AfHeart(1.0) for American English female voice - Fixed 24kHz sample rate output - models.rs: download from HuggingFace (individual files, not tar.bz2) - kokoro-v0_19.onnx + voices.bin only (no espeak-ng-data) - is_kokoro_ready() uses is_file() for stricter validation - Cargo.toml: add kokoro-tts + pin ort=2.0.0-rc.11, keep sherpa-onnx for STT ort pinned to rc.11 because kokoro-tts 0.3.2 is incompatible with rc.12 (generic Error<R> breaking change). No vendoring needed. License audit: cargo tree shows ZERO GPL deps in the resolved graph.

…HuddleBar Connect the headphones button to the full huddle pipeline: Button click → Rust start_huddle → LiveKit connect → AudioWorklet → HuddleBar New file: - HuddleContext.tsx: React context managing the huddle lifecycle with concurrency guards (busyRef), operation tokens (tokenRef) for start/leave race protection, and rustActiveRef for Rust state tracking. Every cleanup step is independently try/caught for resilience. Modified: - AppShell.tsx: Wrap with HuddleProvider, render HuddleBar (+5 lines) - ChannelMembersBar.tsx: Use useHuddle() instead of raw invoke - HuddleBar.tsx: Pull localAudioTrack + leaveHuddle from context, preserve active state on transient poll errors - audioWorklet.ts: Clear onmessage before disconnect to prevent stale PCM sends - index.ts: Export HuddleProvider and useHuddle Crossfire reviewed: Opus 9/10 APPROVE, Codex 8/10 (3 rounds).

The relay's ingest pipeline rejected huddle lifecycle events as 'restricted: unknown event kind'. Add kinds 48100-48103 to: - required_scope_for_kind: maps to Scope::ChannelsWrite - requires_h_channel_scope: routes events to parent channel via h-tag Without this, start_huddle fails after creating the ephemeral channel because the HUDDLE_STARTED event (kind 48100) is rejected, triggering rollback and returning an error to the frontend.

start_huddle was setting participants = member_pubkeys (always []), never including the current user's pubkey. add_agent_to_huddle only pushed to agent_pubkeys (for p-tags), not participants (for UI). Now: - start_huddle inserts the user's own pubkey at position 0 - add_agent_to_huddle also appends to participants

In dev mode, getUserMedia is silently denied because the terminal app needs microphone permission in System Settings. For production builds, macOS 13+ requires an Entitlements.plist with the audio-input entitlement. - Create Entitlements.plist with com.apple.security.device.audio-input - Wire it into tauri.conf.json bundle.macOS.entitlements - Info.plist already had NSMicrophoneUsageDescription (no change needed) Dev mode workaround: grant mic permission to Terminal.app in System Settings → Privacy & Security → Microphone.

Voice models (Moonshine STT, Kokoro TTS) are now auto-downloaded on first huddle start/join. Downloads are idempotent — no-op if already on disk or in progress. First huddle runs voice-only; subsequent huddles get STT+TTS once download completes. HuddleBar now shows a voice activity indicator: - Green pulsing dot: mic is live and picking up audio - Gray dot: mic connected but silent - 'no mic' text: LiveKit/mic connection failed Also adds diagnostic console logs to connectToHuddle for tracing the mic → LiveKit → publish flow.

Moonshine archive now contains split decoder layout (v1 int8): preprocess.onnx, encode.int8.onnx, cached_decode.int8.onnx, uncached_decode.int8.onnx — not the merged v2 layout. Kokoro model files moved from hexgrad/Kokoro-82M (404) to thewh1teagle/kokoro-onnx/releases/download/model-files/. - Update MOONSHINE_EXPECTED_FILES to match actual archive contents - Update OfflineMoonshineModelConfig to use split decoder fields - Update KOKORO_MODEL_URL and KOKORO_VOICES_URL to working URLs

Agents now see which channel the huddle is attached to: 'This huddle is attached to channel <uuid> — that's the main channel.' Changed VOICE_MODE_GUIDELINES from a const to voice_mode_guidelines(parent_channel_id).

The previous kokoro-v0_19.onnx + voices.bin from thewh1teagle/kokoro-onnx used a ZIP/numpy voices format incompatible with the kokoro-tts 0.3 crate (which expects bincode). The hexgrad/Kokoro-82M HuggingFace URLs also 404'd. Switch to the crate author's own releases (mzdk100/kokoro V1.0): - kokoro-v1.0.int8.onnx (92MB, quantized — smaller than v0.19's 325MB) - voices.bin (bincode format, compatible with kokoro-tts 0.3)

…messages Two fixes to complete the TTS pipeline: Rust (speak_agent_message): - Now async — lazily starts the TTS pipeline if models finished downloading after the huddle began (first huddle race condition). JS (HuddleContext): - Store ephemeralChannelId + selfPubkey after start_huddle - Subscribe to ephemeral channel via relayClient - Filter for kind:9 messages from non-self pubkeys (skip [System]) - Call invoke('speak_agent_message') for each agent message - Clean up subscription on leave/unmount

TTS now splits text into sentences before synthesis. Kokoro handles short chunks better — fewer spelling artifacts from CMU dictionary fallback, better prosody, and allows barge-in between sentences. Also fixes TTS subscription: - Skip historical messages (created_at < subscription time) - Skip empty/whitespace-only content - Remove debug console.log statements

Lookahead: pre-synthesize sentence N+1 while sentence N plays via rodio (separate audio thread). Eliminates inter-sentence gaps when synthesis is faster than playback. Sentence splitting: don't split on '1.' '2.' etc (digit + period). Only split on period/exclamation/question preceded by a letter. Removed semicolon as a split point (too aggressive). This should fix the first-word spelling issue (fragments too short for Kokoro's phonemizer) and the gap between sentences.

…espeak) Supertonic TTS: 66M params, flow-matching architecture, MIT license, Unicode tokenizer (no espeak/GPL), 10 voices (F1-F5, M1-M5). 167× real-time factor on M4 Pro — synthesis faster than playback. New files: - supertonic.rs: adapted from official helper.rs (637 lines), ort rc.11 compatible, 4 ONNX sessions (duration/encoder/estimator/vocoder) Changed: - tts.rs: Supertonic engine replaces Kokoro, sentence-split lookahead synthesis (synth N+1 while rodio plays N), new_with_voice() for per-agent voice selection, F1 default voice - models.rs: Supertonic model downloads from HuggingFace (~253MB total), 7 files (4 ONNX + config + tokenizer + F1 voice style) - Cargo.toml: removed kokoro-tts, added ndarray 0.17 + rayon, unicode-normalization, regex, rand, rand_distr. ort gets ndarray feature. - mod.rs: kokoro_* → supertonic_* throughout

7 files × 89 = 623 overflows u8 (max 255). Use u32 arithmetic.

The tts.json config shows ae.sample_rate = 44100. Playing 44.1kHz audio at 24kHz makes it sound slow and low-pitched.

Tier 1: 1. Word-skipping fix (.max(1) on latent_lengths) — prevents dropping monosyllabic words like 'a', 'the', 'hello' (supertonic PR#33) 2. Single persistent rodio Player — eliminates OS audio device setup gap between sentences (was creating new Player per sentence) 3. Batch 3 sentences per synth call — model sees more context, better prosody across sentence boundaries 4. Volume boost 2.5× with clamp — Supertonic output is quiet Tier 2: 5. 8ms fade in/out at chunk boundaries — eliminates clicks/pops 6. Inter-sentence silence 0.15s (was 0.3s default) — less robotic 7. Pre-buffer via batching — all batches synthesized and queued before playback wait, rodio plays them gaplessly

Comprehensive quality pass on the huddles implementation, driven by iterative crossfire review (codex CLI + opus subagents, 11 rounds). ## Correctness - Fix barge-in startup order: TTS starts before STT so tts_cancel is available - Shared tts_cancel in HuddleState: survives pipeline restarts and TTS toggle - Pipeline replacement leak: shutdown old STT pipeline before replacing - Participant state: only successfully enrolled agents shown - Sentence batching: join with space, not period-space (fixes garbled TTS) - Agent prompt re-delivery: guidelines re-posted when agents join mid-huddle - TTS duplicate-start prevention: re-check before storing new pipeline - Two-phase activation: Connected → Active lifecycle with confirm_huddle_active - Agent enrollment with role=bot: relay membership API correctly identifies agents ## Voice Quality - Silence threshold: 300ms → 450ms (reduces sentence fragmentation) - Barge-in debounce: 5 consecutive VAD frames (~80ms) required during TTS - STT state reset: clean segment state across TTS transitions and cooldown - STT hot-start: auto-starts when models finish downloading mid-huddle ## Agent Identity (authoritative, fail-closed) - Relay membership API: fetch_agent_pubkeys_from_relay with role=bot filter - get_huddle_agent_pubkeys Tauri command for frontend - Backend periodic refresh in check_pipeline_hotstart (every 5s) - Frontend periodic refresh (every 10s) with fail-closed semantics - Result-based error propagation: fetch failures keep TTS mute - Joiner hydration: join_huddle fetches agent list from relay ## Frontend - Startup atomicity: ephemeralChannelId set only after full setup - EOSE-based replay boundary with timestamp belt-and-suspenders - Agent-only TTS filter: only bot-role pubkeys spoken, fail-closed - AudioWorklet fire-and-forget: no main-thread backpressure - Cleanup consolidation: single cleanupFailedStart helper - leaveHuddle returns boolean: bar stays visible if backend cleanup fails - HuddleBar respects Connected phase in poll-failure fallback ## Performance and Quality - LazyLock regex in supertonic.rs: 12 patterns compiled once - tokio::fs for all async file ops in model downloads - Model-shape validation: bounds-check dims before indexing - Async transcription task: tokio::sync::mpsc, no Tokio thread blocking - Dead code removed, stale comments fixed, rustfmt + biome applied

Multi-round crossfire review (Opus + Codex CLI, 8 full review passes, 17 parallel worker delegates) identified and fixed 25+ issues across the huddle voice pipeline. Safety & correctness: - Session generation guard: Arc<AtomicU64> on HuddleState prevents stale transcription tasks from posting kind:9 after leave/end - Speech buffer capped at 30s (prevents OOM in noisy environments) - Raw IPC payload bounded at 100KB per batch - TTS input bounded at 2000 chars with Unicode-safe truncation - LiveKit token/URL hidden from polling via #[serde(skip)] - Model downloads: SHA-256 verification with pinned hashes, Rust-native tar+bzip2 extraction with pre-validation (path traversal, symlinks), streaming to disk, size limits, version manifest for cache invalidation - Pubkey format validation (64 hex chars) at Tauri boundary - Max 20 agents per huddle enforced on both start and incremental add - UUID validation on all huddle event builders - PCM alignment rejection (not just warning) on non-4-byte-aligned input Architecture & DRY: - teardown_huddle() helper eliminates duplicated leave/end shutdown code - start_stt_pipeline delegates to maybe_start_stt_pipeline - split_sentences consolidated in preprocessing.rs (deleted from tts.rs and supertonic.rs, net -85 lines) - int_to_words extended to 0-999,999 - Agent refresh throttled to 15s with success-gated timestamp Lifecycle: - Creator enforcement: is_creator field on HuddleState, end_huddle rejects non-creators, HuddleBar shows End/Leave conditionally - Failed startup cleanup calls end_huddle (not leave_huddle) to prevent orphaned ephemeral channels - Participant state hydrated from relay immediately on start/join UX & AX: - Agent prompt tightened: silent when not addressed, no dot responses, no repeat after interruption - TTS subscription buffers pre-EOSE events, replays live ones (fixes silent drop of fast agent responses) - AddAgentDialog filters already-added agents - Participant display shows 'In huddle' instead of misleading count - Room label shows 'Huddle' instead of raw LiveKit room name - TTS backpressure logged (Rust + frontend) - AudioWorklet IPC wrapped behind invokeRawBinary() abstraction - ASCII lifecycle docs added to HuddleContext, audioWorklet, livekit - supertonic.rs header fixed (44.1kHz not 24kHz) - worklet.js documents intentional partial-buffer drop on disconnect - ParticipantList NaN-safe hue derivation with gray fallback - tts.rs expect() calls replaced with match+break 18 files changed, +984 -370

6-round crossfire review (Opus 9/10 APPROVE, Codex 9/10 APPROVE). Net -21 lines across 9 source files. Rust backend: - Unify fetch_agent_pubkeys + fetch_all_member_pubkeys → fetch_channel_members(role_filter) - Add AppState::huddle() convenience (replaces 25+ lock().map_err() calls) - spawn_transcription_task uses post_event_raw (checks HTTP status, shared auth) - maybe_start_tts_pipeline returns Result<bool, String> (surfaces silent failures) - Guidelines use kind:48106 instead of fragile [System] prefix on kind:9 - Remove duplicate guidelines re-post on add_agent_to_huddle (EOSE replay suffices) - Extract post_connect_setup helper from start/join huddle (prevents drift) - Bump session_generation on STT pipeline replacement (prevents stale transcripts) - Consolidate triple lock in check_pipeline_hotstart → single acquisition - Extract drain_until_shutdown<T> to mod.rs as pub(super) (shared by stt + tts) - Tighten voice-mode prompt: 15 words, no filler, no apologies, no meta-responses - Document TtsPipeline::cancel() as intentional future API surface Frontend: - Extract disconnectMedia() helper (leaveHuddle/endHuddle no longer duplicate cleanup) - TTS subscription uses subscribeToChannelLive (since: now, no historical backlog) - Remove EOSE buffering/replay — unnecessary with live-only subscription - Narrow subscription to KIND_STREAM_MESSAGE only (less wire traffic) - Add subscribeToChannelLive() to RelayClient (limit:1000 for reconnect safety)

Crossfire review (opus×2 + codex) identified 15+ issues. All fixed: Safety & correctness: - Fix validate_pubkey_hex panic on multi-byte UTF-8 input - Fix pipeline init failure wedge (is_finished + dead pipeline detection) - Fix guidelines delivery race (post kind:48106 before adding agents) - Fix teardown blocking mutex during thread join - Fix endHuddle swallowing failures (returns boolean) - Fix livekit.ts mic track leak on disconnect failure (try/finally) - Fix tts_active staying true during cancel - Filter single-char TTS responses (prevents speaking 'period') DRY extraction: - models.rs: 826→690 lines via ModelSlot + shared helpers (download_file, fetch_url, fresh_temp_dir, verify_and_install) - events.rs: 4 huddle event builders → shared build_huddle_event - tts.rs: 4 cancel/shutdown patterns → handle_cancel_or_shutdown UX improvements: - ParticipantList resolves pubkeys to display names (useUsersBatchQuery) - ProfileAvatar with hex-prefix HexAvatar fallback - Voice guidelines prompt tightened for fast LLMs Cleanup: - Delete dead voice_style_path(), TtsPipeline::cancel() - Fix stale comments, document LE endianness assumption

Four targeted improvements from crossfire review (codex 9/10, opus 8/10): 1. post_connect_setup fault boundary: model download and member hydration failures no longer tear down a working huddle. The call stays up in degraded mode (no STT/TTS) instead of failing entirely. 2. Remove duplicate preprocessing from supertonic.rs: emoji stripping and whitespace collapsing already handled by preprocessing.rs. Deleted two stale LazyLock<Regex> statics (RE_EMOJI, RE_WHITESPACE). 3. Feature-gate dead code: webhook.rs and session.rs gated behind #[cfg(feature = "webhook")]. hmac/sha2/hex deps made optional. 3 tests run by default, 8 with --features webhook. 4. Model download progress in HuddleBar: shows 'Voice models: STT 42%, TTS 78%' while downloading, disappears when ready. Serde decoder correctly handles both string and object enum variants. Also: documented WHY dual membership polling (Rust + React) is intentional — Rust preserves stale list on failure (STT p-tags), React clears on failure (TTS authorization must fail-closed). Different safety requirements.

…pp launch 1. Add kind 48106 (huddle guidelines) to the relay's ingest allowlist. The range 48100..=48103 was allowed but 48106 was missing, causing 'restricted: unknown event kind' when posting voice-mode guidelines. Widened to 48100..=48106. 2. Trigger background voice model downloads at app launch (in Tauri setup hook). Models are ~303 MB total (50 MB Moonshine STT + 253 MB Supertonic TTS). Downloads are async, idempotent, SHA-256 verified, and no-op if already cached. First huddle no longer has a cold-start download wait.

…bled Fixes discovered during live huddle testing with agents: **Relay:** - Add kind 48106 (huddle guidelines) to requires_h_channel_scope so guidelines are routed to the ephemeral channel. Agents now see the voice-mode prompt via their subscription. **TTS (tts.rs):** - Prime audio output with 100ms silent buffer at worker startup. On macOS, CoreAudio initializes lazily — without priming, the first Player races against device startup and player.empty() returns true prematurely, truncating the first TTS message after a few words. **STT (stt.rs):** - Disable barge-in (VAD-based TTS cancellation). Without acoustic echo cancellation, any ambient noise — keyboard, fan, breathing — triggers false barge-in and kills TTS mid-sentence. Push-to-talk will replace this; echo-gating (skip accumulation during TTS) is preserved. - Reduce TTS cooldown 200ms → 50ms. The old value ate the first word when the user spoke immediately after the agent finished. - Reduce silence flush threshold 28 → 19 frames (450ms → 300ms). Snappier transcript delivery without splitting mid-word pauses.

Huddles are now enabled out of the box with dev defaults: LIVEKIT_URL=ws://localhost:7880 LIVEKIT_API_KEY=devkey LIVEKIT_API_SECRET=secret Set SPROUT_HUDDLES_DISABLED=true to disable. Override the LIVEKIT_* env vars for production deployments.

Add push-to-talk (PTT) via global Ctrl+Space shortcut as the default voice input mode for huddles. Voice activity detection (VAD) remains as a switchable option with barge-in enabled. Rust backend: - VoiceInputMode enum (PushToTalk default, VoiceActivity) - ptt_active: Arc<AtomicBool> shared with STT pipeline - Global shortcut handler with 200ms release delay and generation counter to prevent press→release→press race condition - PTT press cancels TTS immediately (only when TTS is active) - STT gating: is_speech ANDed with ptt_active flag — natural flush on release via silence accumulation - PTT mode accumulates entire key-hold as one utterance (no mid- sentence splits on pauses); flush only on release edge - Barge-in re-enabled for VAD mode with 320ms debounce - Mode switch mid-huddle restarts STT pipeline - Disable WebView background throttling (tauri.conf.json) so AudioWorklet keeps processing when window loses focus Frontend: - worklet.js: PTT gating via transmitting flag + port.onmessage - audioWorklet.ts: Tauri ptt-state event → worklet forwarding - HuddleContext: pttActive/voiceInputMode/setVoiceInputMode state - HuddleBar: PTT/VAD indicator, mode toggle, green ring on transmit Also fixes two review items from crossfire: - Preserve session_generation across error-path state resets - set_tts_enabled takes pipeline out of lock before shutdown

Enable multiple humans to join the same huddle simultaneously. Backend (mod.rs): - join_huddle: add human to ephemeral channel (hard-fail unless already member), full rollback pattern on any failure - leave_huddle: auto-end when last human departs — emits HUDDLE_ENDED + archives ephemeral channel, avoids relay 'cannot remove last owner' error by skipping build_leave - end_huddle: any participant can end (creator disconnect recovery) - Extract reset_preserving_generation() DRY helper (3 call sites) - Add count_human_members() for last-human detection Frontend context (HuddleContext.tsx): - Add joinHuddle() to context API - Extract connectAndSetupMedia() shared helper (start + join) - Add activeSpeakers + isReconnecting state from LiveKit events - cleanupFailedStart(isCreator): joiners use leave_huddle only, preventing a failed join from ending everyone's huddle Frontend livekit.ts: - HuddleRoomCallbacks: ActiveSpeakersChanged, Disconnected, Reconnecting, Reconnected room event listeners - getUserMedia with echoCancellation + noiseSuppression Frontend UI: - HuddleIndicator (new): subscribes to kind:48100-48103 via dedicated subscribeToHuddleEvents (limit 100), event-id dedup with batch reconstruction, causal sort, fail-closed validation, infers huddle from JOIN/LEFT even without START event. Renders start button when idle, green glow join button when active. Single icon replaces dual headphone buttons. - HuddleBar: reconnecting indicator, animate-pulse PTT (WCAG), ARIA live region (output element), Leave + End-all buttons for all participants - ParticipantList: activeSpeakers green ring, aria-label/role - ChannelMembersBar: single HuddleIndicator replaces dual icons, invalidateQueries on start/join for immediate sidebar refresh Relay client (relayClientSession.ts): - subscribeToHuddleEvents(): kinds 48100-48103 only, limit 100 File size overrides bumped for mod.rs (1450), HuddleContext (630), relayClientSession (830) to accommodate multi-human additions.

TTS Engine Replacement (Supertonic → Kokoro): - New kokoro.rs: Kokoro-82M ONNX TTS engine, 100% GPL-free - Single ONNX session via ort crate (was 4 sessions with Supertonic) - CoreML acceleration with automatic CPU fallback - 24kHz output (was 44.1kHz) — rodio resamples transparently - 54 voices across 9 languages (was 10 voices, 5 languages) - G2P: three-tier dictionary lookup, no espeak-ng dependency - Tier 1: misaki gold+silver dicts (178K words, Apache-2.0) - Tier 2: CMUdict (126K words, BSD) with ARPAbet→IPA conversion - Tier 3: morphological suffix rules (-s/-ed/-ing) from misaki - Tier 4: letter-by-letter spelling for OOV words - Contraction handling (don't, it's, we've, etc.) - Model download: model_q8f16.onnx (86MB), af_heart.bin voice (510KB), us_gold.json + us_silver.json (6MB), cmudict.dict (3.6MB) All SHA-256 verified, runtime download via existing model manager - tts.rs: sentence-by-sentence lookahead pipeline (BATCH_SIZE=1) for 200ms TTFA with zero-gap playback Crossfire Review Fixes: - C1: end_huddle now requires is_creator (or explicit force flag) UI hides 'End all' button for non-creators - C2: tts_starting sentinel prevents TOCTOU race in pipeline creation - C3: tts_active flag set after first player.append, not before synthesis - I1: audioWorklet setMode() isolates VAD from PTT events - I4: voiceInputModeRef prevents stale closure in connectAndSetupMedia Bug Fixes: - Fix tokio::spawn panic during Tauri setup (use tauri::async_runtime::spawn) - Fix earshot VAD panic on out-of-range samples (clamp after resampling) Dependencies: - ort: rc.11 → rc.12 + coreml feature (zero binary size cost on macOS) - Removed: rand, rand_distr, unicode-normalization (Supertonic-only) Licenses: Kokoro model (Apache-2.0), misaki dicts (Apache-2.0), CMUdict (BSD), ort (MIT), ONNX Runtime (MIT). No GPL anywhere.

tlongwell-block added 5 commits April 11, 2026 12:14

style: format all huddle files (biome + rustfmt)

4504c12

tlongwell-block requested a review from wesbillman as a code owner April 11, 2026 18:22

tlongwell-block added 2 commits April 11, 2026 14:35

feat(huddles): add headphones button to channel header bar

55a725e

Adds a Headphones icon button next to the workflow Zap button in ChannelMembersBar. Clicking starts a huddle for the current channel via the start_huddle Tauri command.

tlongwell-block force-pushed the feat/huddles-plan branch from d960a09 to 05d8c8a Compare April 11, 2026 19:52

tlongwell-block added 21 commits April 11, 2026 16:22

feat(huddles): include parent channel ID in voice-mode guidelines

1dfb733

Agents now see which channel the huddle is attached to: 'This huddle is attached to channel <uuid> — that's the main channel.' Changed VOICE_MODE_GUIDELINES from a const to voice_mode_guidelines(parent_channel_id).

fix(huddles): u8 overflow in Supertonic download progress calculation

609e787

7 files × 89 = 623 overflows u8 (max 255). Use u32 arithmetic.

fix(huddles): Supertonic sample rate is 44100, not 24000

3961f4f

The tts.json config shows ae.sample_rate = 44100. Playing 44.1kHz audio at 24kHz makes it sound slow and low-pitched.

chore: update Cargo.lock for Supertonic deps (ndarray 0.17, rand, regex)

210f72f

chore: remove debug console.log from livekit.ts

261cf79

style: rustfmt + biome format all huddle files

a8034ff

tlongwell-block added 8 commits April 12, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Sprout Huddles — voice conversations with AI agents#299

feat: Sprout Huddles — voice conversations with AI agents#299
tlongwell-block wants to merge 36 commits intomainfrom
feat/huddles-plan

tlongwell-block commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlongwell-block commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's New

Kokoro TTS (replaced Supertonic)

Multi-Human Huddles

Crossfire Fixes (from multi-model review: Opus + Codex + Gemini)

Relay

SDK

Desktop — Rust (desktop/src-tauri/src/huddle/)

Desktop — WebView (desktop/src/features/huddle/)

Push-to-Talk Design

New Dependencies

Testing

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlongwell-block commented Apr 11, 2026 •

edited

Loading

Desktop — Rust (`desktop/src-tauri/src/huddle/`)

Desktop — WebView (`desktop/src/features/huddle/`)