Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
87068d5
add interface draft
chenghao-mou Feb 5, 2026
e0d5ec1
Merge branch 'main' into feat/AGT-2520-multimodal-EOU
chenghao-mou Mar 6, 2026
8eebccc
draft
chenghao-mou Mar 11, 2026
f92fbc0
fix type issues
chenghao-mou Mar 11, 2026
d1086ff
refactor stream to support turn detector protocol
chenghao-mou Mar 12, 2026
0a02bb1
minor fixes
chenghao-mou Mar 12, 2026
168d0d7
minor fixes
chenghao-mou Mar 12, 2026
277db6e
WIP: use only ws stream
chenghao-mou Mar 24, 2026
03c0e2e
Merge branch 'main' into feat/AGT-2520-multimodal-EOU
chenghao-mou Mar 24, 2026
56b4796
fix uv.lock bad merge
chenghao-mou Mar 24, 2026
be9a550
WIP: more refactoring
chenghao-mou Mar 25, 2026
601229c
fix mypy
chenghao-mou Mar 25, 2026
c4d92f8
remove temp url
chenghao-mou Mar 25, 2026
e963d85
disable turn detection when agent is still speaking
chenghao-mou Mar 25, 2026
c529d79
minor refactoring
chenghao-mou Mar 29, 2026
09baed8
fix type issues
chenghao-mou Mar 29, 2026
3830638
wip
chenghao-mou Apr 10, 2026
f214aa0
clean up encoder
chenghao-mou Apr 20, 2026
c922f44
wip
chenghao-mou Apr 20, 2026
f94a0dd
Merge branch 'main' into feat/AGT-2520-multimodal-EOU
chenghao-mou Apr 20, 2026
604bfdc
update protos
chenghao-mou Apr 21, 2026
f9ec64a
minor fixes
chenghao-mou Apr 21, 2026
ddbf594
address comments
chenghao-mou Apr 21, 2026
d465564
add text fallback
chenghao-mou Apr 22, 2026
6e7d6bf
add text fallback
chenghao-mou Apr 22, 2026
200d634
fix threshold
chenghao-mou Apr 22, 2026
dbd11b0
remove temp deps
chenghao-mou Apr 22, 2026
60004dd
support realtime model
chenghao-mou Apr 22, 2026
6de53f4
fix type issues
chenghao-mou Apr 22, 2026
4ed8a82
add id in logs
chenghao-mou Apr 23, 2026
0db57ea
use threaded audio encoder
chenghao-mou Apr 24, 2026
bbcfc3a
close encoder
chenghao-mou Apr 24, 2026
7e04332
update dep
chenghao-mou Apr 27, 2026
04db92f
address comments
chenghao-mou Apr 30, 2026
46fd3bf
add cloud agent worker token
chenghao-mou Apr 30, 2026
e4e8ef6
Merge branch 'main' into feat/AGT-2520-multimodal-EOU
chenghao-mou Apr 30, 2026
fc94068
fix type issues
chenghao-mou Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ To run the examples, you'll need:

- A [LiveKit Cloud](https://cloud.livekit.io) account or a local [LiveKit server](https://github.com/livekit/livekit)
- API keys for the model providers you want to use in a `.env` file
- Python 3.9 or higher
- Python 3.10 or higher
- [uv](https://docs.astral.sh/uv/)

### Environment file
Expand Down
11 changes: 9 additions & 2 deletions examples/voice_agents/basic_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
from livekit.agents.beta import EndCallTool
from livekit.agents.llm import function_tool
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

# uncomment to enable Krisp background voice/noise cancellation
# from livekit.plugins import noise_cancellation
Expand Down Expand Up @@ -98,7 +97,15 @@ async def entrypoint(ctx: JobContext) -> None:
turn_handling=TurnHandlingOptions(
# VAD and turn detection are used to determine when the user is speaking and when the agent should respond
# See more at https://docs.livekit.io/agents/build/turns
turn_detection=MultilingualModel(),
# turn_detection=MultilingualModel(),
turn_detection=inference.MultimodalTurnDetector(
# TODO: @chenghao-mou remove this before merging
base_url="http://0.0.0.0:8080/v1",
),
endpointing={
"min_delay": 0.5,
"max_delay": 3.0,
},
Comment on lines +100 to +108
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Example basic_agent.py hardcodes localhost URL for turn detection

The basic example hardcodes base_url="http://0.0.0.0:8080/v1" for the MultimodalTurnDetector. This overrides the production default DEFAULT_BASE_URL = "https://agent-gateway.livekit.cloud/v1" (detector.py:24). Since this is the primary example users reference, anyone copying this code will get connection failures unless they happen to run a local turn detection service on port 8080. This appears to be a debugging leftover—the previous code used MultilingualModel() with no special URL.

Suggested change
# turn_detection=MultilingualModel(),
turn_detection=inference.MultimodalTurnDetector(
base_url="http://0.0.0.0:8080/v1",
),
endpointing={
"min_delay": 1.5,
"max_delay": 3.0,
},
# turn_detection=MultilingualModel(),
turn_detection=inference.MultimodalTurnDetector(),
endpointing={
"min_delay": 1.5,
"max_delay": 3.0,
},
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

interruption={
# sometimes background noise could interrupt the agent session, these are considered false positive interruptions
# when it's detected, you may resume the agent's speech
Expand Down
10 changes: 10 additions & 0 deletions livekit-agents/livekit/agents/inference/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@
from .llm import LLM, LLMModels, LLMStream
from .stt import STT, STTModels
from .tts import TTS, TTSModels
from .turn_detection import (
MIN_SILENCE_DURATION_MS,
MultimodalTurnDetector,
TurnDetectionEvent,
TurnDetectionStream,
)

__all__ = [
"STT",
Expand All @@ -16,8 +22,12 @@
"STTModels",
"TTSModels",
"LLMModels",
"MultimodalTurnDetector",
"TurnDetectionStream",
"TurnDetectionEvent",
"AdaptiveInterruptionDetector",
"InterruptionDetectionError",
"OverlappingSpeechEvent",
"InterruptionDataFrameType",
"MIN_SILENCE_DURATION_MS",
]
10 changes: 10 additions & 0 deletions livekit-agents/livekit/agents/inference/turn_detection/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from .detector import MultimodalTurnDetector, TurnDetectionEvent, TurnDetectorOptions
from .stream import MIN_SILENCE_DURATION_MS, TurnDetectionStream

__all__ = [
"MultimodalTurnDetector",
"TurnDetectionStream",
"TurnDetectionEvent",
"TurnDetectorOptions",
"MIN_SILENCE_DURATION_MS",
]
137 changes: 137 additions & 0 deletions livekit-agents/livekit/agents/inference/turn_detection/detector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
from __future__ import annotations

import os
import weakref
from dataclasses import dataclass
from typing import TYPE_CHECKING, Literal

import aiohttp

from ... import utils
from ...language import LanguageCode
from ...types import (
DEFAULT_API_CONNECT_OPTIONS,
NOT_GIVEN,
APIConnectOptions,
NotGivenOr,
)
from .languages import LANGUAGES

if TYPE_CHECKING:
from .stream import TurnDetectionStream


DEFAULT_SAMPLE_RATE: int = 16000
DEFAULT_BASE_URL = "https://agent-gateway.livekit.cloud/v1"


@dataclass
class TurnDetectionEvent:
type: Literal["eot_prediction"]
end_of_turn_probability: float
last_speaking_time: float
detection_delay: float | None = None
backend: Literal["multimodal", "text"] = "multimodal"


@dataclass
class TurnDetectorOptions:
sample_rate: int
base_url: str
api_key: str
api_secret: str
conn_options: APIConnectOptions


class MultimodalTurnDetector:
def __init__(
self,
*,
base_url: NotGivenOr[str] = NOT_GIVEN,
api_key: NotGivenOr[str] = NOT_GIVEN,
api_secret: NotGivenOr[str] = NOT_GIVEN,
sample_rate: int = DEFAULT_SAMPLE_RATE,
http_session: aiohttp.ClientSession | None = None,
conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
) -> None:
lk_base_url = utils.resolve_env_var(
base_url, "LIVEKIT_INFERENCE_URL", default=DEFAULT_BASE_URL
)
lk_api_key = utils.resolve_env_var(
api_key, "LIVEKIT_INFERENCE_API_KEY", "LIVEKIT_API_KEY", default=""
)
lk_api_secret = utils.resolve_env_var(
api_secret, "LIVEKIT_INFERENCE_API_SECRET", "LIVEKIT_API_SECRET", default=""
)
if not lk_api_secret:
raise ValueError(
"api_secret is required, either as argument or set LIVEKIT_API_SECRET env var"
)
if not lk_api_key:
raise ValueError(
"api_key is required, either as argument or set LIVEKIT_API_KEY env var"
)

self._worker_token = os.getenv("LIVEKIT_WORKER_TOKEN")
self._opts = TurnDetectorOptions(
sample_rate=sample_rate,
base_url=lk_base_url,
api_key=lk_api_key,
api_secret=lk_api_secret,
conn_options=conn_options,
)
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.

self._session = http_session
self._streams: weakref.WeakSet[TurnDetectionStream] = weakref.WeakSet()

@property
def model(self) -> str:
return "eot-multimodal"

@property
def provider(self) -> str:
return "livekit"

def _ensure_session(self) -> aiohttp.ClientSession:
if not self._session:
self._session = utils.http_context.http_session()
return self._session

def stream(
self,
*,
conn_options: APIConnectOptions = DEFAULT_API_CONNECT_OPTIONS,
) -> TurnDetectionStream:
from .stream import TurnDetectionStream

stream: TurnDetectionStream = TurnDetectionStream(
detector=self,
opts=self._opts,
conn_options=conn_options,
)
self._streams.add(stream)
return stream

async def unlikely_threshold(
self, language: LanguageCode | None, modality: Literal["multimodal", "text"] = "multimodal"
) -> float | None:
thresholds = LANGUAGES.get(
language.language if language is not None else "en", (0.35, 0.011)
)
if modality == "multimodal":
return thresholds[0]
else:
return thresholds[1]

async def supports_language(
self, language: LanguageCode | None, modality: Literal["multimodal", "text"] = "multimodal"
) -> bool:
# default to english if no language is provided
lang = language.language if language is not None else "en"
return lang in LANGUAGES
Comment on lines +126 to +131
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 supports_language ignores modality parameter, returning True for unsupported multimodal languages

MultimodalTurnDetector.supports_language checks only whether the language key exists in LANGUAGES, completely ignoring the modality parameter. For languages like "nl", "pt", "it", "ru", "tr", and "id" that have None as their multimodal (audio) threshold in languages.py:11-16, the method returns True even though multimodal inference cannot produce a usable threshold.

Impact in audio_recognition.py

At audio_recognition.py:1087, the caller checks supports_language without passing modality (defaults to "multimodal"). Because it returns True, the code proceeds to run the turn prediction and then calls unlikely_threshold, which returns None. The threshold-based delay adjustment at line 1113–1117 is skipped (always uses min_delay), and the auto-flush at audio_recognition.py:1388 never triggers. This means the multimodal turn detector runs inference for these languages but can never properly gate turn completion.

Suggested change
async def supports_language(
self, language: LanguageCode | None, modality: Literal["multimodal", "text"] = "multimodal"
) -> bool:
# default to english if no language is provided
lang = language.language if language is not None else "en"
return lang in LANGUAGES
async def supports_language(
self, language: LanguageCode | None, modality: Literal["multimodal", "text"] = "multimodal"
) -> bool:
# default to english if no language is provided
lang = language.language if language is not None else "en"
thresholds = LANGUAGES.get(lang)
if thresholds is None:
return False
if modality == "multimodal":
return thresholds[0] is not None
return thresholds[1] is not None
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


async def aclose(self) -> None:
for stream in list(self._streams):
await stream.aclose()
self._streams.clear()
self._session = None
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
LANGUAGES = {
# language code: (audio threshold, text threshold)
"en": (0.35, 0.011),
"fr": (0.35, 0.0078),
"de": (0.35, 0.0062),
"hi": (0.35, 0.0398),
"ja": (0.35, 0.0096),
"ko": (0.35, 0.0156),
"zh": (0.35, 0.0066),
"es": (0.35, 0.0058),
"nl": (None, 0.0077),
"pt": (None, 0.0069),
"it": (None, 0.0037),
"ru": (None, 0.0032),
"tr": (None, 0.0045),
"id": (None, 0.0132),
}
Loading
Loading