feat: Sprout Huddles — voice conversations with AI agents#299
Open
tlongwell-block wants to merge 36 commits intomainfrom
Open
feat: Sprout Huddles — voice conversations with AI agents#299tlongwell-block wants to merge 36 commits intomainfrom
tlongwell-block wants to merge 36 commits intomainfrom
Conversation
Humans can voice chat via LiveKit WebRTC. Ephemeral text channel is
created for the huddle transcript. No STT/TTS yet (Phase 2+).
Relay:
- Wire HuddleService into AppState (optional, env-var gated)
- POST /api/huddles/{channel_id}/token endpoint with auth + scope +
channel access checks
- Verify huddle kinds 48100-48105 stored/fanned-out (no changes needed)
SDK:
- 4 huddle lifecycle event builders (48100-48103) with tests
Desktop Rust:
- HuddleManager state machine (Idle → Creating → Active → Leaving)
- 5 Tauri commands: start/join/leave/end_huddle, get_huddle_state
- push_audio_pcm stub (Phase 1 placeholder for Phase 2 STT)
- 4 huddle event builders in events.rs
- Bulletproof rollback: all error paths reset to Idle, orphaned
channels archived, HUDDLE_STARTED emitted only after token success
- NSMicrophoneUsageDescription in Info.plist
Desktop WebView:
- LiveKit JS SDK integration (livekit-client 2.18.1)
- AudioWorklet mic tap with 100ms PCM batching → Rust via InvokeBody::Raw
- HuddleBar floating UI (mute, leave, participant list)
- Resource cleanup on all failure paths
Crossfired: 3 rounds, Opus + Codex, both APPROVE 10/10 on final round.
Phase 0 spikes validated: AudioWorklet IPC, sherpa-onnx compilation,
LiveKit JS in WKWebView — all green.
Human speech is transcribed and posted to the ephemeral channel. Speak in huddle → text appears in channel → agents can see it. STT Pipeline (stt.rs): - PCM f32 48kHz from AudioWorklet → rubato resample to 16kHz mono → earshot VAD → sherpa-onnx Moonshine → transcribed text - Dedicated std::thread (CPU-bound, not async) - Bounded audio queue (sync_channel, 50 slots, try_send drops on backpressure) - Shutdown flag + Drop impl joins worker thread cleanly - Final flush: buffered speech transcribed on shutdown - Merged decoder layout (v2) matches Moonshine tiny int8 model Model Download Manager (models.rs): - Background download of Moonshine tiny (~26MB) from sherpa-onnx releases - Atomic extraction: temp dir → verify → rename-backup swap (old model preserved on failure) - Platform-gated: #[cfg(unix)] tar extraction, clear error on non-Unix - Race-safe: status set to Downloading before spawn - OnceLock singleton, no unwrap() on mutex (poison recovery) Pipeline Integration (mod.rs): - push_audio_pcm feeds SttPipeline when active - Transcribed text posted as kind:9 with agent p-tags (read at post time, not snapshot) - Auto-start on huddle join/start if models ready (is_moonshine_ready) - Pipeline shutdown called before state clear on leave/end - recv_timeout replaces busy polling New dependencies: sherpa-onnx 1.12, earshot 1.0, rubato 2.0, audioadapter-buffers 3.0 Crossfired: Codex 3/10 → 8/10 → 10/10 APPROVE after 3 fix rounds.
Agents participate in huddles via text. They hear all human speech, respond when relevant, and get interrupted by new speech. Agent Enrollment (agents.rs): - add_agent_to_huddle: dual channel add (ephemeral + parent) - Ephemeral add is required; parent add is best-effort - Returns structured AgentAddResult with parent_error detail - Only successfully enrolled agents get p-tagged on transcriptions - Voice-mode guidelines posted as kind:9 [System] message (not kind:40099) HuddleManager Integration (mod.rs): - add_agent_to_huddle Tauri command with active phase check - start_huddle tracks successful_agents (failed adds not enrolled) - Voice-mode system message posted after HUDDLE_STARTED - Agent pubkeys stored in Arc<Mutex<Vec<String>>> for live p-tag reads Agent Add UI: - '+' button on HuddleBar opens AddAgentDialog - Dialog fetches list_managed_agents, filters to running agents only - Structured error handling: hard failures shown as red, parent_error as amber warning - Dialog stays open on warning so user can see the message Design decisions: - No separate ACP process spawn needed — existing managed agent auto-subscribes via kind:9000 membership notification - ACP system prompt injection deferred to post-MVP (requires SubscriptionRule changes) - Client does NOT mint kind:40099 (relay-signed only) — uses kind:9 instead Crossfired: Codex 4/10 → 8/10 → 10/10 APPROVE after 3 rounds.
Agent responses are read aloud. Full end-to-end conversation loop:
speak → transcribe → agent responds → read aloud → human interrupts.
TTS Pipeline (tts.rs):
- sherpa-onnx Kokoro + rodio playback on dedicated thread
- Bounded text queue (8 slots, try_send drops on backpressure)
- Cancel flag for barge-in (clears queue + stops rodio Player)
- Shutdown checks in playback loop (cancel || shutdown)
- tts_active AtomicBool shared with STT for echo gating
- speak_agent_message Tauri command for WebView to feed agent text
Text Preprocessing (preprocessing.rs):
- Strip fenced code blocks (both ``` and ~~~) → 'code block omitted'
- Strip inline code, URLs → 'link omitted'
- Smart underscore handling (emphasis stripped, snake_case preserved)
- Numbers → words (0-999), times → spoken ('nine oh five')
- Emoji stripped, whitespace collapsed
- 10 unit tests
Barge-In + STT Gating (stt.rs):
- VAD speech onset during TTS → set tts_cancel flag → immediate TTS stop
- speech_buf NOT accumulated during TTS (echo prevention)
- 200ms cooldown after TTS stops before STT re-enables
- tts_cancel shared between STT and TTS pipelines
Model Download (models.rs):
- Kokoro int8 model download (~92MB) with atomic extraction
- Same verify-swap pattern as Moonshine
UI (HuddleBar.tsx):
- TTS toggle button (Volume2/VolumeX icons)
New dependency: rodio 0.22
Crossfired: Codex 3/10 → 10/10 APPROVE after 2 rounds.
Adds a Headphones icon button next to the workflow Zap button in ChannelMembersBar. Clicking starts a huddle for the current channel via the start_huddle Tauri command.
Eliminates the espeak-ng GPL-3.0 dependency from the TTS pipeline. sherpa-onnx bundled espeak-ng for Kokoro phonemization; the kokoro-tts crate uses cmudict-fast (Apache-2.0) instead — zero GPL exposure. Changes: - tts.rs: swap sherpa-onnx Kokoro → kokoro-tts KokoroTts API - Async API driven by single-threaded tokio Runtime in worker thread - Voice::AfHeart(1.0) for American English female voice - Fixed 24kHz sample rate output - models.rs: download from HuggingFace (individual files, not tar.bz2) - kokoro-v0_19.onnx + voices.bin only (no espeak-ng-data) - is_kokoro_ready() uses is_file() for stricter validation - Cargo.toml: add kokoro-tts + pin ort=2.0.0-rc.11, keep sherpa-onnx for STT ort pinned to rc.11 because kokoro-tts 0.3.2 is incompatible with rc.12 (generic Error<R> breaking change). No vendoring needed. License audit: cargo tree shows ZERO GPL deps in the resolved graph.
d960a09 to
05d8c8a
Compare
…HuddleBar Connect the headphones button to the full huddle pipeline: Button click → Rust start_huddle → LiveKit connect → AudioWorklet → HuddleBar New file: - HuddleContext.tsx: React context managing the huddle lifecycle with concurrency guards (busyRef), operation tokens (tokenRef) for start/leave race protection, and rustActiveRef for Rust state tracking. Every cleanup step is independently try/caught for resilience. Modified: - AppShell.tsx: Wrap with HuddleProvider, render HuddleBar (+5 lines) - ChannelMembersBar.tsx: Use useHuddle() instead of raw invoke - HuddleBar.tsx: Pull localAudioTrack + leaveHuddle from context, preserve active state on transient poll errors - audioWorklet.ts: Clear onmessage before disconnect to prevent stale PCM sends - index.ts: Export HuddleProvider and useHuddle Crossfire reviewed: Opus 9/10 APPROVE, Codex 8/10 (3 rounds).
The relay's ingest pipeline rejected huddle lifecycle events as 'restricted: unknown event kind'. Add kinds 48100-48103 to: - required_scope_for_kind: maps to Scope::ChannelsWrite - requires_h_channel_scope: routes events to parent channel via h-tag Without this, start_huddle fails after creating the ephemeral channel because the HUDDLE_STARTED event (kind 48100) is rejected, triggering rollback and returning an error to the frontend.
start_huddle was setting participants = member_pubkeys (always []), never including the current user's pubkey. add_agent_to_huddle only pushed to agent_pubkeys (for p-tags), not participants (for UI). Now: - start_huddle inserts the user's own pubkey at position 0 - add_agent_to_huddle also appends to participants
In dev mode, getUserMedia is silently denied because the terminal app needs microphone permission in System Settings. For production builds, macOS 13+ requires an Entitlements.plist with the audio-input entitlement. - Create Entitlements.plist with com.apple.security.device.audio-input - Wire it into tauri.conf.json bundle.macOS.entitlements - Info.plist already had NSMicrophoneUsageDescription (no change needed) Dev mode workaround: grant mic permission to Terminal.app in System Settings → Privacy & Security → Microphone.
Voice models (Moonshine STT, Kokoro TTS) are now auto-downloaded on first huddle start/join. Downloads are idempotent — no-op if already on disk or in progress. First huddle runs voice-only; subsequent huddles get STT+TTS once download completes. HuddleBar now shows a voice activity indicator: - Green pulsing dot: mic is live and picking up audio - Gray dot: mic connected but silent - 'no mic' text: LiveKit/mic connection failed Also adds diagnostic console logs to connectToHuddle for tracing the mic → LiveKit → publish flow.
Moonshine archive now contains split decoder layout (v1 int8): preprocess.onnx, encode.int8.onnx, cached_decode.int8.onnx, uncached_decode.int8.onnx — not the merged v2 layout. Kokoro model files moved from hexgrad/Kokoro-82M (404) to thewh1teagle/kokoro-onnx/releases/download/model-files/. - Update MOONSHINE_EXPECTED_FILES to match actual archive contents - Update OfflineMoonshineModelConfig to use split decoder fields - Update KOKORO_MODEL_URL and KOKORO_VOICES_URL to working URLs
Agents now see which channel the huddle is attached to: 'This huddle is attached to channel <uuid> — that's the main channel.' Changed VOICE_MODE_GUIDELINES from a const to voice_mode_guidelines(parent_channel_id).
The previous kokoro-v0_19.onnx + voices.bin from thewh1teagle/kokoro-onnx used a ZIP/numpy voices format incompatible with the kokoro-tts 0.3 crate (which expects bincode). The hexgrad/Kokoro-82M HuggingFace URLs also 404'd. Switch to the crate author's own releases (mzdk100/kokoro V1.0): - kokoro-v1.0.int8.onnx (92MB, quantized — smaller than v0.19's 325MB) - voices.bin (bincode format, compatible with kokoro-tts 0.3)
…messages
Two fixes to complete the TTS pipeline:
Rust (speak_agent_message):
- Now async — lazily starts the TTS pipeline if models finished
downloading after the huddle began (first huddle race condition).
JS (HuddleContext):
- Store ephemeralChannelId + selfPubkey after start_huddle
- Subscribe to ephemeral channel via relayClient
- Filter for kind:9 messages from non-self pubkeys (skip [System])
- Call invoke('speak_agent_message') for each agent message
- Clean up subscription on leave/unmount
TTS now splits text into sentences before synthesis. Kokoro handles short chunks better — fewer spelling artifacts from CMU dictionary fallback, better prosody, and allows barge-in between sentences. Also fixes TTS subscription: - Skip historical messages (created_at < subscription time) - Skip empty/whitespace-only content - Remove debug console.log statements
Lookahead: pre-synthesize sentence N+1 while sentence N plays via rodio (separate audio thread). Eliminates inter-sentence gaps when synthesis is faster than playback. Sentence splitting: don't split on '1.' '2.' etc (digit + period). Only split on period/exclamation/question preceded by a letter. Removed semicolon as a split point (too aggressive). This should fix the first-word spelling issue (fragments too short for Kokoro's phonemizer) and the gap between sentences.
…espeak) Supertonic TTS: 66M params, flow-matching architecture, MIT license, Unicode tokenizer (no espeak/GPL), 10 voices (F1-F5, M1-M5). 167× real-time factor on M4 Pro — synthesis faster than playback. New files: - supertonic.rs: adapted from official helper.rs (637 lines), ort rc.11 compatible, 4 ONNX sessions (duration/encoder/estimator/vocoder) Changed: - tts.rs: Supertonic engine replaces Kokoro, sentence-split lookahead synthesis (synth N+1 while rodio plays N), new_with_voice() for per-agent voice selection, F1 default voice - models.rs: Supertonic model downloads from HuggingFace (~253MB total), 7 files (4 ONNX + config + tokenizer + F1 voice style) - Cargo.toml: removed kokoro-tts, added ndarray 0.17 + rayon, unicode-normalization, regex, rand, rand_distr. ort gets ndarray feature. - mod.rs: kokoro_* → supertonic_* throughout
7 files × 89 = 623 overflows u8 (max 255). Use u32 arithmetic.
The tts.json config shows ae.sample_rate = 44100. Playing 44.1kHz audio at 24kHz makes it sound slow and low-pitched.
Tier 1: 1. Word-skipping fix (.max(1) on latent_lengths) — prevents dropping monosyllabic words like 'a', 'the', 'hello' (supertonic PR#33) 2. Single persistent rodio Player — eliminates OS audio device setup gap between sentences (was creating new Player per sentence) 3. Batch 3 sentences per synth call — model sees more context, better prosody across sentence boundaries 4. Volume boost 2.5× with clamp — Supertonic output is quiet Tier 2: 5. 8ms fade in/out at chunk boundaries — eliminates clicks/pops 6. Inter-sentence silence 0.15s (was 0.3s default) — less robotic 7. Pre-buffer via batching — all batches synthesized and queued before playback wait, rodio plays them gaplessly
Comprehensive quality pass on the huddles implementation, driven by iterative crossfire review (codex CLI + opus subagents, 11 rounds). ## Correctness - Fix barge-in startup order: TTS starts before STT so tts_cancel is available - Shared tts_cancel in HuddleState: survives pipeline restarts and TTS toggle - Pipeline replacement leak: shutdown old STT pipeline before replacing - Participant state: only successfully enrolled agents shown - Sentence batching: join with space, not period-space (fixes garbled TTS) - Agent prompt re-delivery: guidelines re-posted when agents join mid-huddle - TTS duplicate-start prevention: re-check before storing new pipeline - Two-phase activation: Connected → Active lifecycle with confirm_huddle_active - Agent enrollment with role=bot: relay membership API correctly identifies agents ## Voice Quality - Silence threshold: 300ms → 450ms (reduces sentence fragmentation) - Barge-in debounce: 5 consecutive VAD frames (~80ms) required during TTS - STT state reset: clean segment state across TTS transitions and cooldown - STT hot-start: auto-starts when models finish downloading mid-huddle ## Agent Identity (authoritative, fail-closed) - Relay membership API: fetch_agent_pubkeys_from_relay with role=bot filter - get_huddle_agent_pubkeys Tauri command for frontend - Backend periodic refresh in check_pipeline_hotstart (every 5s) - Frontend periodic refresh (every 10s) with fail-closed semantics - Result-based error propagation: fetch failures keep TTS mute - Joiner hydration: join_huddle fetches agent list from relay ## Frontend - Startup atomicity: ephemeralChannelId set only after full setup - EOSE-based replay boundary with timestamp belt-and-suspenders - Agent-only TTS filter: only bot-role pubkeys spoken, fail-closed - AudioWorklet fire-and-forget: no main-thread backpressure - Cleanup consolidation: single cleanupFailedStart helper - leaveHuddle returns boolean: bar stays visible if backend cleanup fails - HuddleBar respects Connected phase in poll-failure fallback ## Performance and Quality - LazyLock regex in supertonic.rs: 12 patterns compiled once - tokio::fs for all async file ops in model downloads - Model-shape validation: bounds-check dims before indexing - Async transcription task: tokio::sync::mpsc, no Tokio thread blocking - Dead code removed, stale comments fixed, rustfmt + biome applied
Multi-round crossfire review (Opus + Codex CLI, 8 full review passes, 17 parallel worker delegates) identified and fixed 25+ issues across the huddle voice pipeline. Safety & correctness: - Session generation guard: Arc<AtomicU64> on HuddleState prevents stale transcription tasks from posting kind:9 after leave/end - Speech buffer capped at 30s (prevents OOM in noisy environments) - Raw IPC payload bounded at 100KB per batch - TTS input bounded at 2000 chars with Unicode-safe truncation - LiveKit token/URL hidden from polling via #[serde(skip)] - Model downloads: SHA-256 verification with pinned hashes, Rust-native tar+bzip2 extraction with pre-validation (path traversal, symlinks), streaming to disk, size limits, version manifest for cache invalidation - Pubkey format validation (64 hex chars) at Tauri boundary - Max 20 agents per huddle enforced on both start and incremental add - UUID validation on all huddle event builders - PCM alignment rejection (not just warning) on non-4-byte-aligned input Architecture & DRY: - teardown_huddle() helper eliminates duplicated leave/end shutdown code - start_stt_pipeline delegates to maybe_start_stt_pipeline - split_sentences consolidated in preprocessing.rs (deleted from tts.rs and supertonic.rs, net -85 lines) - int_to_words extended to 0-999,999 - Agent refresh throttled to 15s with success-gated timestamp Lifecycle: - Creator enforcement: is_creator field on HuddleState, end_huddle rejects non-creators, HuddleBar shows End/Leave conditionally - Failed startup cleanup calls end_huddle (not leave_huddle) to prevent orphaned ephemeral channels - Participant state hydrated from relay immediately on start/join UX & AX: - Agent prompt tightened: silent when not addressed, no dot responses, no repeat after interruption - TTS subscription buffers pre-EOSE events, replays live ones (fixes silent drop of fast agent responses) - AddAgentDialog filters already-added agents - Participant display shows 'In huddle' instead of misleading count - Room label shows 'Huddle' instead of raw LiveKit room name - TTS backpressure logged (Rust + frontend) - AudioWorklet IPC wrapped behind invokeRawBinary() abstraction - ASCII lifecycle docs added to HuddleContext, audioWorklet, livekit - supertonic.rs header fixed (44.1kHz not 24kHz) - worklet.js documents intentional partial-buffer drop on disconnect - ParticipantList NaN-safe hue derivation with gray fallback - tts.rs expect() calls replaced with match+break 18 files changed, +984 -370
6-round crossfire review (Opus 9/10 APPROVE, Codex 9/10 APPROVE). Net -21 lines across 9 source files. Rust backend: - Unify fetch_agent_pubkeys + fetch_all_member_pubkeys → fetch_channel_members(role_filter) - Add AppState::huddle() convenience (replaces 25+ lock().map_err() calls) - spawn_transcription_task uses post_event_raw (checks HTTP status, shared auth) - maybe_start_tts_pipeline returns Result<bool, String> (surfaces silent failures) - Guidelines use kind:48106 instead of fragile [System] prefix on kind:9 - Remove duplicate guidelines re-post on add_agent_to_huddle (EOSE replay suffices) - Extract post_connect_setup helper from start/join huddle (prevents drift) - Bump session_generation on STT pipeline replacement (prevents stale transcripts) - Consolidate triple lock in check_pipeline_hotstart → single acquisition - Extract drain_until_shutdown<T> to mod.rs as pub(super) (shared by stt + tts) - Tighten voice-mode prompt: 15 words, no filler, no apologies, no meta-responses - Document TtsPipeline::cancel() as intentional future API surface Frontend: - Extract disconnectMedia() helper (leaveHuddle/endHuddle no longer duplicate cleanup) - TTS subscription uses subscribeToChannelLive (since: now, no historical backlog) - Remove EOSE buffering/replay — unnecessary with live-only subscription - Narrow subscription to KIND_STREAM_MESSAGE only (less wire traffic) - Add subscribeToChannelLive() to RelayClient (limit:1000 for reconnect safety)
Crossfire review (opus×2 + codex) identified 15+ issues. All fixed: Safety & correctness: - Fix validate_pubkey_hex panic on multi-byte UTF-8 input - Fix pipeline init failure wedge (is_finished + dead pipeline detection) - Fix guidelines delivery race (post kind:48106 before adding agents) - Fix teardown blocking mutex during thread join - Fix endHuddle swallowing failures (returns boolean) - Fix livekit.ts mic track leak on disconnect failure (try/finally) - Fix tts_active staying true during cancel - Filter single-char TTS responses (prevents speaking 'period') DRY extraction: - models.rs: 826→690 lines via ModelSlot + shared helpers (download_file, fetch_url, fresh_temp_dir, verify_and_install) - events.rs: 4 huddle event builders → shared build_huddle_event - tts.rs: 4 cancel/shutdown patterns → handle_cancel_or_shutdown UX improvements: - ParticipantList resolves pubkeys to display names (useUsersBatchQuery) - ProfileAvatar with hex-prefix HexAvatar fallback - Voice guidelines prompt tightened for fast LLMs Cleanup: - Delete dead voice_style_path(), TtsPipeline::cancel() - Fix stale comments, document LE endianness assumption
Four targeted improvements from crossfire review (codex 9/10, opus 8/10): 1. post_connect_setup fault boundary: model download and member hydration failures no longer tear down a working huddle. The call stays up in degraded mode (no STT/TTS) instead of failing entirely. 2. Remove duplicate preprocessing from supertonic.rs: emoji stripping and whitespace collapsing already handled by preprocessing.rs. Deleted two stale LazyLock<Regex> statics (RE_EMOJI, RE_WHITESPACE). 3. Feature-gate dead code: webhook.rs and session.rs gated behind #[cfg(feature = "webhook")]. hmac/sha2/hex deps made optional. 3 tests run by default, 8 with --features webhook. 4. Model download progress in HuddleBar: shows 'Voice models: STT 42%, TTS 78%' while downloading, disappears when ready. Serde decoder correctly handles both string and object enum variants. Also: documented WHY dual membership polling (Rust + React) is intentional — Rust preserves stale list on failure (STT p-tags), React clears on failure (TTS authorization must fail-closed). Different safety requirements.
…pp launch 1. Add kind 48106 (huddle guidelines) to the relay's ingest allowlist. The range 48100..=48103 was allowed but 48106 was missing, causing 'restricted: unknown event kind' when posting voice-mode guidelines. Widened to 48100..=48106. 2. Trigger background voice model downloads at app launch (in Tauri setup hook). Models are ~303 MB total (50 MB Moonshine STT + 253 MB Supertonic TTS). Downloads are async, idempotent, SHA-256 verified, and no-op if already cached. First huddle no longer has a cold-start download wait.
…bled Fixes discovered during live huddle testing with agents: **Relay:** - Add kind 48106 (huddle guidelines) to requires_h_channel_scope so guidelines are routed to the ephemeral channel. Agents now see the voice-mode prompt via their subscription. **TTS (tts.rs):** - Prime audio output with 100ms silent buffer at worker startup. On macOS, CoreAudio initializes lazily — without priming, the first Player races against device startup and player.empty() returns true prematurely, truncating the first TTS message after a few words. **STT (stt.rs):** - Disable barge-in (VAD-based TTS cancellation). Without acoustic echo cancellation, any ambient noise — keyboard, fan, breathing — triggers false barge-in and kills TTS mid-sentence. Push-to-talk will replace this; echo-gating (skip accumulation during TTS) is preserved. - Reduce TTS cooldown 200ms → 50ms. The old value ate the first word when the user spoke immediately after the agent finished. - Reduce silence flush threshold 28 → 19 frames (450ms → 300ms). Snappier transcript delivery without splitting mid-word pauses.
Huddles are now enabled out of the box with dev defaults: LIVEKIT_URL=ws://localhost:7880 LIVEKIT_API_KEY=devkey LIVEKIT_API_SECRET=secret Set SPROUT_HUDDLES_DISABLED=true to disable. Override the LIVEKIT_* env vars for production deployments.
Add push-to-talk (PTT) via global Ctrl+Space shortcut as the default voice input mode for huddles. Voice activity detection (VAD) remains as a switchable option with barge-in enabled. Rust backend: - VoiceInputMode enum (PushToTalk default, VoiceActivity) - ptt_active: Arc<AtomicBool> shared with STT pipeline - Global shortcut handler with 200ms release delay and generation counter to prevent press→release→press race condition - PTT press cancels TTS immediately (only when TTS is active) - STT gating: is_speech ANDed with ptt_active flag — natural flush on release via silence accumulation - PTT mode accumulates entire key-hold as one utterance (no mid- sentence splits on pauses); flush only on release edge - Barge-in re-enabled for VAD mode with 320ms debounce - Mode switch mid-huddle restarts STT pipeline - Disable WebView background throttling (tauri.conf.json) so AudioWorklet keeps processing when window loses focus Frontend: - worklet.js: PTT gating via transmitting flag + port.onmessage - audioWorklet.ts: Tauri ptt-state event → worklet forwarding - HuddleContext: pttActive/voiceInputMode/setVoiceInputMode state - HuddleBar: PTT/VAD indicator, mode toggle, green ring on transmit Also fixes two review items from crossfire: - Preserve session_generation across error-path state resets - set_tts_enabled takes pipeline out of lock before shutdown
Enable multiple humans to join the same huddle simultaneously. Backend (mod.rs): - join_huddle: add human to ephemeral channel (hard-fail unless already member), full rollback pattern on any failure - leave_huddle: auto-end when last human departs — emits HUDDLE_ENDED + archives ephemeral channel, avoids relay 'cannot remove last owner' error by skipping build_leave - end_huddle: any participant can end (creator disconnect recovery) - Extract reset_preserving_generation() DRY helper (3 call sites) - Add count_human_members() for last-human detection Frontend context (HuddleContext.tsx): - Add joinHuddle() to context API - Extract connectAndSetupMedia() shared helper (start + join) - Add activeSpeakers + isReconnecting state from LiveKit events - cleanupFailedStart(isCreator): joiners use leave_huddle only, preventing a failed join from ending everyone's huddle Frontend livekit.ts: - HuddleRoomCallbacks: ActiveSpeakersChanged, Disconnected, Reconnecting, Reconnected room event listeners - getUserMedia with echoCancellation + noiseSuppression Frontend UI: - HuddleIndicator (new): subscribes to kind:48100-48103 via dedicated subscribeToHuddleEvents (limit 100), event-id dedup with batch reconstruction, causal sort, fail-closed validation, infers huddle from JOIN/LEFT even without START event. Renders start button when idle, green glow join button when active. Single icon replaces dual headphone buttons. - HuddleBar: reconnecting indicator, animate-pulse PTT (WCAG), ARIA live region (output element), Leave + End-all buttons for all participants - ParticipantList: activeSpeakers green ring, aria-label/role - ChannelMembersBar: single HuddleIndicator replaces dual icons, invalidateQueries on start/join for immediate sidebar refresh Relay client (relayClientSession.ts): - subscribeToHuddleEvents(): kinds 48100-48103 only, limit 100 File size overrides bumped for mod.rs (1450), HuddleContext (630), relayClientSession (830) to accommodate multi-human additions.
TTS Engine Replacement (Supertonic → Kokoro): - New kokoro.rs: Kokoro-82M ONNX TTS engine, 100% GPL-free - Single ONNX session via ort crate (was 4 sessions with Supertonic) - CoreML acceleration with automatic CPU fallback - 24kHz output (was 44.1kHz) — rodio resamples transparently - 54 voices across 9 languages (was 10 voices, 5 languages) - G2P: three-tier dictionary lookup, no espeak-ng dependency - Tier 1: misaki gold+silver dicts (178K words, Apache-2.0) - Tier 2: CMUdict (126K words, BSD) with ARPAbet→IPA conversion - Tier 3: morphological suffix rules (-s/-ed/-ing) from misaki - Tier 4: letter-by-letter spelling for OOV words - Contraction handling (don't, it's, we've, etc.) - Model download: model_q8f16.onnx (86MB), af_heart.bin voice (510KB), us_gold.json + us_silver.json (6MB), cmudict.dict (3.6MB) All SHA-256 verified, runtime download via existing model manager - tts.rs: sentence-by-sentence lookahead pipeline (BATCH_SIZE=1) for 200ms TTFA with zero-gap playback Crossfire Review Fixes: - C1: end_huddle now requires is_creator (or explicit force flag) UI hides 'End all' button for non-creators - C2: tts_starting sentinel prevents TOCTOU race in pipeline creation - C3: tts_active flag set after first player.append, not before synthesis - I1: audioWorklet setMode() isolates VAD from PTT events - I4: voiceInputModeRef prevents stale closure in connectAndSetupMedia Bug Fixes: - Fix tokio::spawn panic during Tauri setup (use tauri::async_runtime::spawn) - Fix earshot VAD panic on out-of-range samples (clamp after resampling) Dependencies: - ort: rc.11 → rc.12 + coreml feature (zero binary size cost on macOS) - Removed: rand, rand_distr, unicode-normalization (Supertonic-only) Licenses: Kokoro model (Apache-2.0), misaki dicts (Apache-2.0), CMUdict (BSD), ort (MIT), ONNX Runtime (MIT). No GPL anywhere.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds real-time voice huddles to Sprout. Humans talk via LiveKit WebRTC; each desktop GUI locally transcribes speech and posts it to an ephemeral Nostr text channel where agents read, respond, and get read aloud. Agents never touch audio — they see only text. Multiple humans can join the same huddle.
Push-to-talk (Ctrl+Space) is the default voice input mode. Voice activity detection (VAD) with barge-in remains as a switchable option.
What's New
Kokoro TTS (replaced Supertonic)
Multi-Human Huddles
HuddleIndicatordetects active huddles via dedicated kind:48100-48103 subscription, shows green glow headphone icon with participant count badgeend_huddlerequiresis_creator(or explicitforceflag for crash recovery). UI hides "End all" for non-creators.leave_huddle, neverend_huddleisReconnectingstate shown in HuddleBaranimate-pulseon PTT dot,aria-label/roleon avatarsCrossfire Fixes (from multi-model review: Opus + Codex + Gemini)
end_huddlecreator check + UI guardtts_startingsentinel prevents TOCTOU race in pipeline creationtts_activeflag set after firstplayer.append, not before synthesissetMode()isolates VAD from PTT eventsvoiceInputModeRefprevents stale closure inconnectAndSetupMediatokio::spawn→tauri::async_runtime::spawn(panic during Tauri setup)Relay
HuddleServicewired intoAppState— enabled by default with dev credentialsPOST /api/huddles/{channel_id}/token— LiveKit token endpoint with auth + scope + membership checksSDK
Desktop — Rust (
desktop/src-tauri/src/huddle/)kokoro.rs— Kokoro-82M ONNX TTS: single ort session, CoreML with CPU fallback, three-tier G2P (misaki + CMUdict + morphological rules), ARPAbet→IPA conversion, contraction handlingmod.rs— HuddleManager state machine, 16 Tauri commands, VoiceInputMode (PTT/VAD), session generation guard,tts_startingsentinel, auto-end on last human leavestt.rs— STT pipeline: rubato resample → earshot VAD (clamped) → sherpa-onnx Moonshine. PTT gating, barge-in with 320ms debouncetts.rs— TTS pipeline: Kokoro sentence-by-sentence lookahead → rodio playback. Cancel flag for PTT/barge-in,tts_activeset after first appendmodels.rs— Background model download (Moonshine + Kokoro + CMUdict), SHA-256 verification, atomic swap installpreprocessing.rs— Text cleanup for TTS: strip markdown/code/URLs, numbers→words, unifiedsplit_sentences. 18 unit testsagents.rs— Agent enrollment, voice-mode guidelines (kind:48106)Desktop — WebView (
desktop/src/features/huddle/)startHuddle,joinHuddle, sharedconnectAndSetupMedia,voiceInputModeRef, fail-closed agent TTS filteringsetMode()for VAD/PTT isolationPush-to-Talk Design
New Dependencies
sherpa-onnxortndarrayearshotrubatorodiolivekit-clientModel licenses: Kokoro (Apache-2.0), misaki dicts (Apache-2.0), CMUdict (BSD), Moonshine (MIT).
License audit: Zero GPL dependencies.
Testing
Follow-up
supertonic.rs(dead code, zero references)