feat(azure): Add Azure Voice Live Realtime API support #4693

007DXR · 2026-02-02T11:14:35Z

Description

This PR adds support for Azure Voice Live, enabling real-time speech-to-speech conversations through the Azure plugin.
Fix #4716

What's New

Azure Voice Live Integration

New RealtimeModel class providing end-to-end speech-to-speech capabilities
Full bidirectional audio streaming
Server-side VAD (Voice Activity Detection) with configurable thresholds
Automatic reconnection handling for connection resilience

Features

Speech-to-Speech: Direct audio input/output without separate STT/TTS pipeline
Function Calling: Built-in tool use for agentic workflows
Multilingual Support: Works with Azure's multilingual neural voices (e.g., en-US-AvaMultilingualNeural)
Interruption Handling: Graceful handling of user interruptions during responses
Metrics Collection: Token usage and TTFT (Time to First Token) tracking
Debug Mode: Optional audio saving per turn for debugging (save_audio_per_turn=True)

Usage

from livekit.agents import Agent, AgentSession
from livekit.plugins import azure

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        voice="en-US-AvaMultilingualNeural",
    )
)

await session.start(room=ctx.room, agent=Agent(instructions="You are helpful."))

Environment Variables

export AZURE_VOICELIVE_ENDPOINT=https://<region>.api.cognitive.microsoft.com/
export AZURE_VOICELIVE_API_KEY=<your-speech-key>

New Dependencies

azure-ai-voicelive[aiohttp]>=1.0.0
azure-identity>=1.15.0

devin-ai-integration

Devin Review found 2 potential issues.

View issues and 6 additional flags in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

devin-ai-integration

Devin Review found 1 new potential issue.

View issue and 8 additional flags in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

devin-ai-integration

Devin Review found 2 new potential issues.

View issues and 9 additional flags in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/__init__.py

007DXR · 2026-02-05T10:07:40Z

@theomonnom @chenghao-mou @longcw could you help review?

luzhangtina · 2026-02-05T16:07:57Z

I'm excited to see Azure Voice Live Realtime API support coming to LiveKit agents! I've been evaluating Voice Live Realtime API for my app.

Based on my use case requirements, it would be great if the following capabilities could be considered:

BYOM Profile Support
My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:

byom-azure-openai-realtime - For realtime models
byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.)
Would it be possible to expose this profile selection?

Runtime Input Mode Switching
My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:

modalities: ["text", "audio"] ↔ ["text"]
turn_detection: enabled ↔ disabled

Additional Session Configuration
It would also be helpful to have options for:

Noise reduction: input_audio_noise_reduction
Echo cancellation: input_audio_echo_cancellation
TTS speed/rate control
Semantic VAD types (azure_semantic_vad_multilingual, etc.)

devin-ai-integration

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

007DXR · 2026-02-06T09:04:20Z

BYOM Profile Support
My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:

byom-azure-openai-realtime - For realtime models byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.) Would it be possible to expose this profile selection?

thanks for your comments. this PR already allows user to specify different models. here I create a voice live agent using gpt-4o model

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)

you can see livekit-plugins\livekit-plugins-azure\README.md for more info.

007DXR · 2026-02-06T10:31:47Z

3. Runtime Input Mode Switching
My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:

modalities: ["text", "audio"] ↔ ["text"] turn_detection: enabled ↔ disabled

LiveKit agents framework provides built-in support for both voice and text input. RoomOptions for Mixed I/O: You can configure your session to accept both voice and text input using [RoomOptions]

 await session.start(
    agent=agent,
    room=ctx.room,
    room_options=room_io.RoomOptions(
        audio_input=room_io.AudioInputOptions(sample_rate=24000, num_channels=1),
        audio_output=room_io.AudioOutputOptions(sample_rate=24000, num_channels=1),
        text_input=True,  # Enable text input
        text_output=True,  # Enable text output
    ),
)

here is a agent demo i run successfully

set LiveKit server
run this agent on dev mode
Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

"""
Voice and Text Realtime Agent

A multimodal agent that accepts both voice and text input, responding with voice and text output.
Uses Azure Voice Live API for low-latency speech-to-speech interactions.

Features:
- Voice input: Speak naturally, the agent listens and responds
- Text input: Type in the console or send via LiveKit TextStream to `lk.chat` topic
- Voice output: Agent responds with natural speech
- Text output: Agent responses printed to console and sent via `lk.transcription` topic
- Function calling: Weather lookup example

This is ideal for applications that need:
- Accessibility support (text fallback for hearing impaired)
- Quiet environments where voice isn't suitable
- Hybrid chat + voice interfaces

Environment Variables:
    AZURE_VOICELIVE_ENDPOINT - Azure Voice Live endpoint (wss://...)
    AZURE_VOICELIVE_API_KEY - Azure API key
    AZURE_VOICELIVE_MODEL - Azure model name (default: gpt-4o)
    AZURE_VOICELIVE_VOICE - Azure voice (default: en-US-AvaMultilingualNeural)
    LIVEKIT_URL - LiveKit server URL
    LIVEKIT_API_KEY - LiveKit API key
    LIVEKIT_API_SECRET - LiveKit API secret

Usage:
    python voice_and_text_realtime_agent.py dev
    Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

Try:
 - Speak: "What's the weather like in Tokyo?"
 - send text via TextStream to `lk.chat` topic
"""

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()


class VoiceAndTextAgent(Agent):
    """An agent that handles both voice and text interactions."""

    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly, helpful assistant that can interact via voice and text. "
                "Keep your responses conversational and concise since users may be listening. "
                "Avoid using markdown, emojis, or special formatting in your responses. "
                "You can help with weather information and general questions."
            ),
        )

    async def on_enter(self):
        """Called when the agent becomes active in the session."""
        # Generate a greeting when the agent starts
        # allow_interruptions=False ensures the greeting completes
        self.session.generate_reply(
            instructions="Greet the user warmly and let them know they can speak or type to interact with you.",
            allow_interruptions=False,
        )

    @function_tool
    async def get_weather(
        self,
        context: RunContext,
        location: str,
    ) -> str:
        """Get the current weather for a location.

        Args:
            location: The city or location to get weather for (e.g., "Tokyo", "New York")
        """
        logger.info(f"Looking up weather for {location}")
        # In a real application, you would call a weather API here
        # This is a placeholder response
        return f"The weather in {location} is currently sunny with a temperature of 72°F (22°C)."


server = AgentServer()


@server.rtc_session()
async def entrypoint(ctx: JobContext):
    """Main entry point for the agent session."""
    logger.info(f"Starting voice and text realtime agent in room: {ctx.room.name}")

    # Configure Semantic VAD for intelligent turn detection
    turn_detection = AzureSemanticVadEn(
        threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
        silence_duration_ms=500,  # Silence duration before turn ends
        prefix_padding_ms=300,  # Audio padding before speech
        speech_duration_ms=200,  # Minimum speech duration to trigger detection
        remove_filler_words=True,  # Remove filler words like "um", "uh"
    )

    # Create session with Azure Voice Live API
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
        )
    )

    # Start the session with both voice and text I/O enabled
    await session.start(
        agent=VoiceAndTextAgent(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            # Voice I/O configuration
            audio_input=room_io.AudioInputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            audio_output=room_io.AudioOutputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            # Text I/O configuration
            # - Text input: Users can send text via TextStream to `lk.chat` topic
            # - Text output: Agent responses sent via TextStream to `lk.transcription` topic
            text_input=True,
            text_output=True,
        ),
    )


if __name__ == "__main__":
    cli.run_app(server)

007DXR · 2026-02-06T10:47:05Z

Additional Session Configuration
It would also be helpful to have options for:

Noise reduction: input_audio_noise_reduction

Echo cancellation: input_audio_echo_cancellation

TTS speed/rate control

Semantic VAD types (azure_semantic_vad_multilingual, etc.)

Voice Live as having built-in noise reduction and echo cancellation (even when you don’t explicitly set options).
You can explicitly set Semantic VAD like this:

  # Configure Semantic VAD for intelligent turn detection
   # Options:
   # - AzureSemanticVad: Default semantic VAD (multilingual)
   # - AzureSemanticVadEn: English-only, optimized for English
   # - AzureSemanticVadMultilingual: Explicit multilingual support
   from azure.ai.voicelive.models import AzureSemanticVadEn

   turn_detection = AzureSemanticVadEn(
       threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
       silence_duration_ms=500,  # Silence duration before turn ends
       prefix_padding_ms=300,  # Audio padding before speech
       speech_duration_ms=200,  # Minimum speech duration to trigger detection
       remove_filler_words=True,  # Remove filler words like "um", "uh"
   )

luzhangtina · 2026-02-06T15:38:39Z

BYOM Profile Support
My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:
byom-azure-openai-realtime - For realtime models byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.) Would it be possible to expose this profile selection?

thanks for your comments. this PR already allows user to specify different models. here I create a voice live agent using gpt-4o model
session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)
you can see livekit-plugins\livekit-plugins-azure\README.md for more info.

Thanks for the clarification! I tested it and confirmed that gpt-5.2 works without the profile parameter - Azure auto-detects the model type. Good to know!

luzhangtina · 2026-02-06T17:39:02Z

Runtime Input Mode Switching
My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:
modalities: ["text", "audio"] ↔ ["text"] turn_detection: enabled ↔ disabled

LiveKit agents framework provides built-in support for both voice and text input. RoomOptions for Mixed I/O: You can configure your session to accept both voice and text input using [RoomOptions]

 await session.start(
    agent=agent,
    room=ctx.room,
    room_options=room_io.RoomOptions(
        audio_input=room_io.AudioInputOptions(sample_rate=24000, num_channels=1),
        audio_output=room_io.AudioOutputOptions(sample_rate=24000, num_channels=1),
        text_input=True,  # Enable text input
        text_output=True,  # Enable text output
    ),
)

here is a agent demo i run successfully

set LiveKit server
run this agent on dev mode
Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

"""
Voice and Text Realtime Agent

A multimodal agent that accepts both voice and text input, responding with voice and text output.
Uses Azure Voice Live API for low-latency speech-to-speech interactions.

Features:
- Voice input: Speak naturally, the agent listens and responds
- Text input: Type in the console or send via LiveKit TextStream to `lk.chat` topic
- Voice output: Agent responds with natural speech
- Text output: Agent responses printed to console and sent via `lk.transcription` topic
- Function calling: Weather lookup example

This is ideal for applications that need:
- Accessibility support (text fallback for hearing impaired)
- Quiet environments where voice isn't suitable
- Hybrid chat + voice interfaces

Environment Variables:
    AZURE_VOICELIVE_ENDPOINT - Azure Voice Live endpoint (wss://...)
    AZURE_VOICELIVE_API_KEY - Azure API key
    AZURE_VOICELIVE_MODEL - Azure model name (default: gpt-4o)
    AZURE_VOICELIVE_VOICE - Azure voice (default: en-US-AvaMultilingualNeural)
    LIVEKIT_URL - LiveKit server URL
    LIVEKIT_API_KEY - LiveKit API key
    LIVEKIT_API_SECRET - LiveKit API secret

Usage:
    python voice_and_text_realtime_agent.py dev
    Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

Try:
 - Speak: "What's the weather like in Tokyo?"
 - send text via TextStream to `lk.chat` topic
"""

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()


class VoiceAndTextAgent(Agent):
    """An agent that handles both voice and text interactions."""

    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly, helpful assistant that can interact via voice and text. "
                "Keep your responses conversational and concise since users may be listening. "
                "Avoid using markdown, emojis, or special formatting in your responses. "
                "You can help with weather information and general questions."
            ),
        )

    async def on_enter(self):
        """Called when the agent becomes active in the session."""
        # Generate a greeting when the agent starts
        # allow_interruptions=False ensures the greeting completes
        self.session.generate_reply(
            instructions="Greet the user warmly and let them know they can speak or type to interact with you.",
            allow_interruptions=False,
        )

    @function_tool
    async def get_weather(
        self,
        context: RunContext,
        location: str,
    ) -> str:
        """Get the current weather for a location.

        Args:
            location: The city or location to get weather for (e.g., "Tokyo", "New York")
        """
        logger.info(f"Looking up weather for {location}")
        # In a real application, you would call a weather API here
        # This is a placeholder response
        return f"The weather in {location} is currently sunny with a temperature of 72°F (22°C)."


server = AgentServer()


@server.rtc_session()
async def entrypoint(ctx: JobContext):
    """Main entry point for the agent session."""
    logger.info(f"Starting voice and text realtime agent in room: {ctx.room.name}")

    # Configure Semantic VAD for intelligent turn detection
    turn_detection = AzureSemanticVadEn(
        threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
        silence_duration_ms=500,  # Silence duration before turn ends
        prefix_padding_ms=300,  # Audio padding before speech
        speech_duration_ms=200,  # Minimum speech duration to trigger detection
        remove_filler_words=True,  # Remove filler words like "um", "uh"
    )

    # Create session with Azure Voice Live API
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
        )
    )

    # Start the session with both voice and text I/O enabled
    await session.start(
        agent=VoiceAndTextAgent(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            # Voice I/O configuration
            audio_input=room_io.AudioInputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            audio_output=room_io.AudioOutputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            # Text I/O configuration
            # - Text input: Users can send text via TextStream to `lk.chat` topic
            # - Text output: Agent responses sent via TextStream to `lk.transcription` topic
            text_input=True,
            text_output=True,
        ),
    )


if __name__ == "__main__":
    cli.run_app(server)

I appreciate the pointer to RoomOptions for mixed I/O. My use case is specifically about dynamic switching mid-session — users start in voice mode, switch to text (typing), then back to voice within the same session.

The reason I was looking at updating modalities mid-session is cost — Azure Voice Live charges per token, and audio output tokens are significantly more expensive than text tokens. When a user switches to text mode, I'd ideally want to tell Azure to only generate text output (no audio), to avoid paying for audio tokens that won't be used.

In my testing (I might be wrong here), it seems like Azure Voice Live doesn't support this:

session.update with modalities: ["text"] didn't seem to take effect — the response still showed modalities: ['audio', 'text']
Initializing with modalities: ["text"] from the start resulted in no output at all
turn_detection couldn't be re-enabled after being set to None (turn_detection_type_change_not_allowed)
I ended up working around it by controlling audio at the agent level (session.input/output.set_audio_enabled()), which works functionally but Azure still generates audio tokens server-side — so no cost savings on the Azure side unfortunately.

Have you seen different behavior? Would love to know if there's a supported way to switch to text-only output mid-session to save on audio token costs.

theomonnom · 2026-02-06T19:07:09Z

Could this be created on top of the openai plugin? Like we did for the xai plugin?

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/utils.py

luzhangtina · 2026-02-07T14:40:23Z

Language configuration missing — causes multi-language misidentification

Another issue I encountered during this integration testing where the Voice Live API incorrectly detects the user's language — responding in Japanese, German, French, or Chinese even when the user is speaking English only.

The root cause is that _configure_session doesn't set input_audio_transcription with a language parameter, so the API falls back to auto-detection which is unreliable. Additionally, AzureStandardVoice is created without a locale, leaving TTS language unconstrained.

Suggestion: Expose language as an optional parameter on RealtimeModel.init so users can configure it per their use case — whether constraining to a single language or supporting multilingual sessions.

CLAassistant · 2026-02-08T16:16:24Z

All committers have signed the CLA.

…el instances instead of raw dicts.

…for both transcription and TTS

007DXR · 2026-02-08T17:56:58Z

Language configuration missing — causes multi-language misidentification

Another issue I encountered during this integration testing where the Voice Live API incorrectly detects the user's language — responding in Japanese, German, French, or Chinese even when the user is speaking English only.

The root cause is that _configure_session doesn't set input_audio_transcription with a language parameter, so the API falls back to auto-detection which is unreliable. Additionally, AzureStandardVoice is created without a locale, leaving TTS language unconstrained.

Suggestion: Expose language as an optional parameter on RealtimeModel.init so users can configure it per their use case — whether constraining to a single language or supporting multilingual sessions.

Thanks for your suggestion. I've added an input_audio_transcription parameter that configures language for both transcription and TTS. Here is an example:

# Configure input audio transcription with language constraint
# This helps prevent language misidentification
input_audio_transcription = AudioInputTranscriptionOptions(
    model="whisper-1",
    language="en-US",  # Constrain to English for reliable detection
)

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        input_audio_transcription=input_audio_transcription,
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)

Please see README for details.

devin-ai-integration

Devin Review found 1 new potential issue.

View 17 additional findings in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

devin-ai-integration

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

007DXR · 2026-02-08T18:55:14Z

Could this be created on top of the openai plugin? Like we did for the xai plugin?

The current Azure plugin in livekit-plugins-azure is implementing Azure Voice Live, which cannot be built on top of the OpenAI plugin.

Azure OpenAI Realtime - OpenAI's Realtime API hosted on Azure. This uses the same protocol as OpenAI and is already supported via livekit-plugins-openai.realtime.RealtimeModel.with_azure().
Azure Voice Live - Azure's native voice AI service (azure.ai.voicelive). This is a completely different service with its own SDK and protocol.

… channel streams, causing the agent to become unresponsive.

devin-ai-integration

Devin Review found 2 new potential issues.

View 17 additional findings in Devin Review.

devin-ai-integration · 2026-02-09T05:32:08Z

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

+            credential = DefaultAzureCredential()
+        else:
+            api_key = self._realtime_model._opts.api_key
+            assert api_key is not None, "API key must be set when not using default credential"
+            credential = AzureKeyCredential(api_key)


🔴 DefaultAzureCredential resource leak on every connection/reconnection

When use_default_credential=True, a new DefaultAzureCredential() is created as a local variable inside _run_connection() but is never closed via aclose(). The async DefaultAzureCredential from azure.identity.aio manages underlying HTTP sessions for token acquisition and must be explicitly closed.

Root Cause and Impact

At realtime_model.py:380, DefaultAzureCredential() is created each time _run_connection is called. Since _main_task (line 339) calls _run_connection in a retry loop — and notably reconnects without counting retries on server disconnect (line 345-349) — a new DefaultAzureCredential is created on every reconnection attempt. The old credential goes out of scope without being closed.

The connect() context manager at line 387-391 takes the credential but does not own its lifecycle. Per Azure SDK conventions, the caller is responsible for closing credentials they create.

async def _run_connection(self) -> None: if self._realtime_model._opts.use_default_credential: credential = DefaultAzureCredential() # Created but never closed ... try: async with connect(..., credential=credential) as conn: ... finally: self._connection = None self._connection_ready.clear() # credential is leaked here

Impact: Each reconnection leaks HTTP client sessions held by DefaultAzureCredential. Over long-running sessions with periodic idle-timeout reconnects, this accumulates leaked resources (file descriptors, TCP connections to token endpoints).

Prompt for agents

In _run_connection (realtime_model.py), when use_default_credential is True, the DefaultAzureCredential must be properly closed after use. Wrap the credential in a try/finally or use it as an async context manager. For example, after the try/except/finally block at line 386-414, add credential cleanup in the finally clause. Something like: finally: self._connection = None self._connection_ready.clear() if isinstance(credential, DefaultAzureCredential): await credential.close() Alternatively, use 'async with DefaultAzureCredential() as credential:' to ensure proper cleanup.

Was this helpful? React with 👍 or 👎 to provide feedback.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

007DXR · 2026-02-09T10:02:05Z

I ended up working around it by controlling audio at the agent level (session.input/output.set_audio_enabled()), which works functionally but Azure still generates audio tokens server-side — so no cost savings on the Azure side unfortunately.

Azure Voice Live supports generate text only. but there is a bug in Azure realtime model plugin: the Azure realtime model plugin only handles RESPONSE_AUDIO_TRANSCRIPT_DELTA. When using modalities=["text"], the text is sent via RESPONSE_TEXT_DELTA, not RESPONSE_AUDIO_TRANSCRIPT_DELTA. I've fix this bug .
I create a text-only agent demo, which ensures Azure Voice Live only generates text token, no audio token. you can try this agent on console mode: uv run .\text_only.py console --text

## This example demonstrates a text-only agent using direct text transport.
## Instead of using LiveKit's TextStream, this example shows how to:
## - Send text input via: `session.generate_reply(user_input="user's input text")`
## - Receive agent's response via `session.on("conversation_item_added", ev)`
## docs: https://docs.livekit.io/agents/build/events/#conversation_item_added

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent, AgentStateChangedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()

class MyAgent(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="You are a helpful assistant.",
        )

server = AgentServer()

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            # turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
            modalities=["text"]
        )
    )


    # Wait for session to be ready before sending text input
    session_ready = asyncio.Event()

    @session.on("agent_state_changed")
    def on_agent_state_changed(event: AgentStateChangedEvent):
        # Session is ready when agent transitions from "initializing" to another state
        if event.old_state == "initializing" and event.new_state in ("listening", "idle"):
            session_ready.set()
           

    # Disable audio and transcript I/O (for pure text mode)
    session.input.audio = None
    session.output.audio = None
    session.output.transcription = None

    await session.start(
        agent=MyAgent(),
    )

    # Wait for the session to be ready (agent is no longer "initializing")
    await session_ready.wait()

    # Example: Send text input directly to the agent
    # You can call this from anywhere (e.g., HTTP endpoint, websocket, etc.)
    await session.generate_reply(user_input="Hello, how can you help me today?")


if __name__ == "__main__":
    cli.run_app(server)

devin-ai-integration

Devin Review found 1 new potential issue.

View 19 additional findings in Devin Review.

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py

…ent ignores actual transcription configuration

add azure realtime model

e1fdf2d

devin-ai-integration bot reviewed Feb 2, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Outdated Show resolved Hide resolved

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Outdated Show resolved Hide resolved

fix some format errors

5d45273

devin-ai-integration bot reviewed Feb 2, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Outdated Show resolved Hide resolved

007DXR added 2 commits February 3, 2026 11:32

add resample_audio method

78a0492

handle custom instructions properly

a471b99

devin-ai-integration bot reviewed Feb 3, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Show resolved Hide resolved

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/__init__.py Outdated Show resolved Hide resolved

update readme

dc9c473

devin-ai-integration bot reviewed Feb 6, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Outdated Show resolved Hide resolved

luzhangtina reviewed Feb 7, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/utils.py Outdated Show resolved Hide resolved

Fixed: livekit_item_to_azure_item() now returns proper Azure SDK mod…

6ac5165

…el instances instead of raw dicts.

007DXR force-pushed the xinran/realtime_model branch from 7399c41 to 6ac5165 Compare February 8, 2026 16:26

Added a input_audio_transcription parameter that configures language …

7676dae

…for both transcription and TTS

devin-ai-integration bot reviewed Feb 8, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Show resolved Hide resolved

007DXR added 3 commits February 8, 2026 18:15

update 1

5e6ee9c

update MessageContentPart

0db3674

format

77fec4f

devin-ai-integration bot reviewed Feb 8, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Show resolved Hide resolved

007DXR added 2 commits February 9, 2026 11:33

During user interruptions, the agent session can deadlock on orphaned…

a8e9cdf

… channel streams, causing the agent to become unresponsive.

export azure.responses

df4ee87

devin-ai-integration bot reviewed Feb 9, 2026

View reviewed changes

fix the Azure realtime model to handle RESPONSE_TEXT_DELTA

cd5a4f3

devin-ai-integration bot reviewed Feb 9, 2026

View reviewed changes

livekit-plugins/livekit-plugins-azure/livekit/plugins/azure/realtime/realtime_model.py Show resolved Hide resolved

Hardcoded user_transcription_enabled=True in input_speech_stopped ev…

f9590c0

…ent ignores actual transcription configuration

feat(azure): Add Azure Voice Live Realtime API support #4693

Are you sure you want to change the base?

feat(azure): Add Azure Voice Live Realtime API support #4693

Uh oh!

Conversation

007DXR commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What's New

Features

Usage

Environment Variables

New Dependencies

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

007DXR commented Feb 5, 2026

Uh oh!

luzhangtina commented Feb 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

007DXR commented Feb 6, 2026

Uh oh!

007DXR commented Feb 6, 2026

Uh oh!

007DXR commented Feb 6, 2026

Uh oh!

luzhangtina commented Feb 6, 2026

Uh oh!

luzhangtina commented Feb 6, 2026

Uh oh!

theomonnom commented Feb 6, 2026

Uh oh!

Uh oh!

luzhangtina commented Feb 7, 2026

Uh oh!

CLAassistant commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

007DXR commented Feb 8, 2026

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

007DXR commented Feb 8, 2026

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

007DXR commented Feb 9, 2026

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

007DXR commented Feb 2, 2026 •

edited

Loading

CLAassistant commented Feb 8, 2026 •

edited

Loading