Skip to content

Conversation

@zkewal
Copy link

@zkewal zkewal commented Feb 9, 2026

Summary

Adds realtime streaming transcription to livekit-plugins-mistralai using Mistral's voxtral-mini-transcribe-realtime-2602 model via the first-party SDK's RealtimeTranscription API. The existing batch transcription path is preserved unchanged.

Why This Matters

The realtime model streams partial transcription deltas mid-utterance instead of waiting for the full audio, reducing perceived latency. It also addresses the batch model's tendency to hallucinate on silence significantly from our private evals. See #4754 for context.

Changes

stt.py (+470/-12)

  • Add SpeechStream(stt.RecognizeStream) — fully async streaming via RealtimeTranscription.transcribe_stream() with 50ms PCM audio chunks
  • Debounced finalization after VAD FlushSentinel (finalize_delay_ms, default 100ms), plus idle-based finalization (650ms) for the STTNode path where FlushSentinel is not used
  • Map Mistral events to LiveKit lifecycle: TextDeltaINTERIM_TRANSCRIPT, SegmentDelta/DoneFINAL_TRANSCRIPT, with START/END_OF_SPEECH
  • Unknown event recovery and duplicate final detection
  • Model-based routing: batch models use recognize(), realtime models use stream(). Calling recognize() with a realtime model raises a clear error
  • _build_capabilities() with graceful fallback for offline_recognize across livekit-agents versions

models.py

  • Add voxtral-mini-transcribe-realtime-2602 to STTModels

pyproject.toml

  • mistralai>=1.9.11mistralai[realtime]>=1.12.0 (adds WebSocket support)
  • livekit-agents>=1.4.1>=1.3.5 (no 1.4-specific APIs used; avoids OpenTelemetry version conflict with mistralai==1.12.0)

__init__.py

  • Export SpeechStream

Usage

from livekit.plugins.mistralai import STT

# Streaming (new)
stt = STT(model="voxtral-mini-transcribe-realtime-2602")
stream = stt.stream()

# Batch (unchanged)
stt = STT(model="voxtral-mini-latest")
event = await stt.recognize(buffer)

Testing & Validation

  • Verified batch backward compatibility (existing models unaffected)
  • Validated VAD flush finalization and idle-timeout finalization paths
  • Confirmed no dependency conflicts with livekit-agents>=1.3.5

Closes #4754

Add realtime streaming transcription using Mistral's
   voxtral-mini-transcribe-realtime-2602 model via the first-party SDK's
   RealtimeTranscription API.

   - Add SpeechStream (stt.RecognizeStream) with async audio generator
     and event-driven transcription processing
   - Support interim transcripts (TextDelta) and final transcripts
     (SegmentDelta/Done) with START/END_OF_SPEECH lifecycle events
   - Debounced finalization after VAD FlushSentinel with configurable
     finalize_delay_ms, plus idle-based finalization for STTNode path
   - Model-based routing: batch models use recognize(), realtime models
     use stream()
   - Add mistralai[realtime] dependency for WebSocket support

   Closes livekit#4754
@zkewal zkewal changed the title Feat/mistralai streaming stt feat(mistralai): add streaming STT support for Voxtral Realtime Feb 9, 2026
@zkewal
Copy link
Author

zkewal commented Feb 9, 2026

@chenghao-mou there is a dependency conflict with the mistralai first-party SDK, which includes web-socket support for the streaming model and livekit/agents, over the opentelemetry-api package. The livekit/agents repo requires version >=1.39.0, which was bumped with the 1.3.6 release, while the mistralai Python SDK depends on version 1.38.0.

How do you suggest I handle this? In my local testing, I downgraded livekit/agents to 1.3.5 and was able to test the current PR.

More details here: https://github.com/livekit/agents/actions/runs/21839471789/job/63019576507?pr=4773#step:5:13

submitted an issue to Mistral AI to raise the upper limit: mistralai/client-python#341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Streaming STT Support for Voxtral Realtime

1 participant