-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat(azure): Add Azure Voice Live Realtime API support #4693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Outdated
Show resolved
Hide resolved
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Show resolved
Hide resolved
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/__init__.py
Outdated
Show resolved
Hide resolved
|
@theomonnom @chenghao-mou @longcw could you help review? |
|
I'm excited to see Azure Voice Live Realtime API support coming to LiveKit agents! I've been evaluating Voice Live Realtime API for my app. Based on my use case requirements, it would be great if the following capabilities could be considered:
Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter: byom-azure-openai-realtime - For realtime models
modalities: ["text", "audio"] ↔ ["text"]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Outdated
Show resolved
Hide resolved
thanks for your comments. this PR already allows user to specify different models. here I create a voice live agent using gpt-4o model session = AgentSession(
llm=azure.realtime.RealtimeModel(
endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
turn_detection=turn_detection,
tool_choice="auto", # Enable function calling
)
)you can see livekit-plugins\livekit-plugins-azure\README.md for more info. |
LiveKit agents framework provides built-in support for both voice and text input. RoomOptions for Mixed I/O: You can configure your session to accept both voice and text input using [RoomOptions] here is a agent demo i run successfully
|
Voice Live as having built-in noise reduction and echo cancellation (even when you don’t explicitly set options). |
Thanks for the clarification! I tested it and confirmed that gpt-5.2 works without the profile parameter - Azure auto-detects the model type. Good to know! |
I appreciate the pointer to RoomOptions for mixed I/O. My use case is specifically about dynamic switching mid-session — users start in voice mode, switch to text (typing), then back to voice within the same session. The reason I was looking at updating modalities mid-session is cost — Azure Voice Live charges per token, and audio output tokens are significantly more expensive than text tokens. When a user switches to text mode, I'd ideally want to tell Azure to only generate text output (no audio), to avoid paying for audio tokens that won't be used. In my testing (I might be wrong here), it seems like Azure Voice Live doesn't support this: session.update with modalities: ["text"] didn't seem to take effect — the response still showed modalities: ['audio', 'text'] Have you seen different behavior? Would love to know if there's a supported way to switch to text-only output mid-session to save on audio token costs. |
|
Could this be created on top of the openai plugin? Like we did for the xai plugin? |
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/utils.py
Outdated
Show resolved
Hide resolved
|
Language configuration missing — causes multi-language misidentification Another issue I encountered during this integration testing where the Voice Live API incorrectly detects the user's language — responding in Japanese, German, French, or Chinese even when the user is speaking English only. The root cause is that _configure_session doesn't set input_audio_transcription with a language parameter, so the API falls back to auto-detection which is unreliable. Additionally, AzureStandardVoice is created without a locale, leaving TTS language unconstrained. Suggestion: Expose language as an optional parameter on RealtimeModel.init so users can configure it per their use case — whether constraining to a single language or supporting multilingual sessions. |
…el instances instead of raw dicts.
7399c41 to
6ac5165
Compare
…for both transcription and TTS
Thanks for your suggestion. I've added an input_audio_transcription parameter that configures language for both transcription and TTS. Here is an example: Please see README for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Show resolved
Hide resolved
The current Azure plugin in livekit-plugins-azure is implementing Azure Voice Live, which cannot be built on top of the OpenAI plugin.
|
… channel streams, causing the agent to become unresponsive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| credential = DefaultAzureCredential() | ||
| else: | ||
| api_key = self._realtime_model._opts.api_key | ||
| assert api_key is not None, "API key must be set when not using default credential" | ||
| credential = AzureKeyCredential(api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 DefaultAzureCredential resource leak on every connection/reconnection
When use_default_credential=True, a new DefaultAzureCredential() is created as a local variable inside _run_connection() but is never closed via aclose(). The async DefaultAzureCredential from azure.identity.aio manages underlying HTTP sessions for token acquisition and must be explicitly closed.
Root Cause and Impact
At realtime_model.py:380, DefaultAzureCredential() is created each time _run_connection is called. Since _main_task (line 339) calls _run_connection in a retry loop — and notably reconnects without counting retries on server disconnect (line 345-349) — a new DefaultAzureCredential is created on every reconnection attempt. The old credential goes out of scope without being closed.
The connect() context manager at line 387-391 takes the credential but does not own its lifecycle. Per Azure SDK conventions, the caller is responsible for closing credentials they create.
async def _run_connection(self) -> None:
if self._realtime_model._opts.use_default_credential:
credential = DefaultAzureCredential() # Created but never closed
...
try:
async with connect(..., credential=credential) as conn:
...
finally:
self._connection = None
self._connection_ready.clear()
# credential is leaked hereImpact: Each reconnection leaks HTTP client sessions held by DefaultAzureCredential. Over long-running sessions with periodic idle-timeout reconnects, this accumulates leaked resources (file descriptors, TCP connections to token endpoints).
Prompt for agents
In _run_connection (realtime_model.py), when use_default_credential is True, the DefaultAzureCredential must be properly closed after use. Wrap the credential in a try/finally or use it as an async context manager. For example, after the try/except/finally block at line 386-414, add credential cleanup in the finally clause. Something like:
finally:
self._connection = None
self._connection_ready.clear()
if isinstance(credential, DefaultAzureCredential):
await credential.close()
Alternatively, use 'async with DefaultAzureCredential() as credential:' to ensure proper cleanup.
Was this helpful? React with 👍 or 👎 to provide feedback.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Outdated
Show resolved
Hide resolved
Azure Voice Live supports generate text only. but there is a bug in Azure realtime model plugin: the Azure realtime model plugin only handles RESPONSE_AUDIO_TRANSCRIPT_DELTA. When using modalities=["text"], the text is sent via RESPONSE_TEXT_DELTA, not RESPONSE_AUDIO_TRANSCRIPT_DELTA. I've fix this bug . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py
Show resolved
Hide resolved
…ent ignores actual transcription configuration
Description
This PR adds support for Azure Voice Live, enabling real-time speech-to-speech conversations through the Azure plugin.
Fix #4716
What's New
Azure Voice Live Integration
RealtimeModelclass providing end-to-end speech-to-speech capabilitiesFeatures
en-US-AvaMultilingualNeural)save_audio_per_turn=True)Usage
Environment Variables
New Dependencies
azure-ai-voicelive[aiohttp]>=1.0.0azure-identity>=1.15.0