Skip to content

Conversation

@007DXR
Copy link

@007DXR 007DXR commented Feb 2, 2026

Description

This PR adds support for Azure Voice Live, enabling real-time speech-to-speech conversations through the Azure plugin.
Fix #4716

What's New

Azure Voice Live Integration

  • New RealtimeModel class providing end-to-end speech-to-speech capabilities
  • Full bidirectional audio streaming
  • Server-side VAD (Voice Activity Detection) with configurable thresholds
  • Automatic reconnection handling for connection resilience

Features

  • Speech-to-Speech: Direct audio input/output without separate STT/TTS pipeline
  • Function Calling: Built-in tool use for agentic workflows
  • Multilingual Support: Works with Azure's multilingual neural voices (e.g., en-US-AvaMultilingualNeural)
  • Interruption Handling: Graceful handling of user interruptions during responses
  • Metrics Collection: Token usage and TTFT (Time to First Token) tracking
  • Debug Mode: Optional audio saving per turn for debugging (save_audio_per_turn=True)

Usage

from livekit.agents import Agent, AgentSession
from livekit.plugins import azure

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        voice="en-US-AvaMultilingualNeural",
    )
)

await session.start(room=ctx.room, agent=Agent(instructions="You are helpful."))

Environment Variables

export AZURE_VOICELIVE_ENDPOINT=https://<region>.api.cognitive.microsoft.com/
export AZURE_VOICELIVE_API_KEY=<your-speech-key>

New Dependencies

  • azure-ai-voicelive[aiohttp]>=1.0.0
  • azure-identity>=1.15.0

Open with Devin

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View issues and 6 additional flags in Devin Review.

Open in Devin Review

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View issue and 8 additional flags in Devin Review.

Open in Devin Review

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View issues and 9 additional flags in Devin Review.

Open in Devin Review

@007DXR
Copy link
Author

007DXR commented Feb 5, 2026

@theomonnom @chenghao-mou @longcw could you help review?

@luzhangtina
Copy link

I'm excited to see Azure Voice Live Realtime API support coming to LiveKit agents! I've been evaluating Voice Live Realtime API for my app.

Based on my use case requirements, it would be great if the following capabilities could be considered:

  1. BYOM Profile Support
    My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:

byom-azure-openai-realtime - For realtime models
byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.)
Would it be possible to expose this profile selection?

  1. Runtime Input Mode Switching
    My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:

modalities: ["text", "audio"] ↔ ["text"]
turn_detection: enabled ↔ disabled

  1. Additional Session Configuration
    It would also be helpful to have options for:
  • Noise reduction: input_audio_noise_reduction
  • Echo cancellation: input_audio_echo_cancellation
  • TTS speed/rate control
  • Semantic VAD types (azure_semantic_vad_multilingual, etc.)

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

@007DXR
Copy link
Author

007DXR commented Feb 6, 2026

  1. BYOM Profile Support
    My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:

byom-azure-openai-realtime - For realtime models byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.) Would it be possible to expose this profile selection?

thanks for your comments. this PR already allows user to specify different models. here I create a voice live agent using gpt-4o model

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)

you can see livekit-plugins\livekit-plugins-azure\README.md for more info.

@007DXR
Copy link
Author

007DXR commented Feb 6, 2026

3. Runtime Input Mode Switching
My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:

modalities: ["text", "audio"] ↔ ["text"] turn_detection: enabled ↔ disabled

LiveKit agents framework provides built-in support for both voice and text input. RoomOptions for Mixed I/O: You can configure your session to accept both voice and text input using [RoomOptions]

 await session.start(
    agent=agent,
    room=ctx.room,
    room_options=room_io.RoomOptions(
        audio_input=room_io.AudioInputOptions(sample_rate=24000, num_channels=1),
        audio_output=room_io.AudioOutputOptions(sample_rate=24000, num_channels=1),
        text_input=True,  # Enable text input
        text_output=True,  # Enable text output
    ),
)

here is a agent demo i run successfully

  1. set LiveKit server
  2. run this agent on dev mode
  3. Open https://meet.livekit.io/ and create a room. Your agent will join automatically.
"""
Voice and Text Realtime Agent

A multimodal agent that accepts both voice and text input, responding with voice and text output.
Uses Azure Voice Live API for low-latency speech-to-speech interactions.

Features:
- Voice input: Speak naturally, the agent listens and responds
- Text input: Type in the console or send via LiveKit TextStream to `lk.chat` topic
- Voice output: Agent responds with natural speech
- Text output: Agent responses printed to console and sent via `lk.transcription` topic
- Function calling: Weather lookup example

This is ideal for applications that need:
- Accessibility support (text fallback for hearing impaired)
- Quiet environments where voice isn't suitable
- Hybrid chat + voice interfaces

Environment Variables:
    AZURE_VOICELIVE_ENDPOINT - Azure Voice Live endpoint (wss://...)
    AZURE_VOICELIVE_API_KEY - Azure API key
    AZURE_VOICELIVE_MODEL - Azure model name (default: gpt-4o)
    AZURE_VOICELIVE_VOICE - Azure voice (default: en-US-AvaMultilingualNeural)
    LIVEKIT_URL - LiveKit server URL
    LIVEKIT_API_KEY - LiveKit API key
    LIVEKIT_API_SECRET - LiveKit API secret

Usage:
    python voice_and_text_realtime_agent.py dev
    Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

Try:
 - Speak: "What's the weather like in Tokyo?"
 - send text via TextStream to `lk.chat` topic
"""

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()


class VoiceAndTextAgent(Agent):
    """An agent that handles both voice and text interactions."""

    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly, helpful assistant that can interact via voice and text. "
                "Keep your responses conversational and concise since users may be listening. "
                "Avoid using markdown, emojis, or special formatting in your responses. "
                "You can help with weather information and general questions."
            ),
        )

    async def on_enter(self):
        """Called when the agent becomes active in the session."""
        # Generate a greeting when the agent starts
        # allow_interruptions=False ensures the greeting completes
        self.session.generate_reply(
            instructions="Greet the user warmly and let them know they can speak or type to interact with you.",
            allow_interruptions=False,
        )

    @function_tool
    async def get_weather(
        self,
        context: RunContext,
        location: str,
    ) -> str:
        """Get the current weather for a location.

        Args:
            location: The city or location to get weather for (e.g., "Tokyo", "New York")
        """
        logger.info(f"Looking up weather for {location}")
        # In a real application, you would call a weather API here
        # This is a placeholder response
        return f"The weather in {location} is currently sunny with a temperature of 72°F (22°C)."


server = AgentServer()


@server.rtc_session()
async def entrypoint(ctx: JobContext):
    """Main entry point for the agent session."""
    logger.info(f"Starting voice and text realtime agent in room: {ctx.room.name}")

    # Configure Semantic VAD for intelligent turn detection
    turn_detection = AzureSemanticVadEn(
        threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
        silence_duration_ms=500,  # Silence duration before turn ends
        prefix_padding_ms=300,  # Audio padding before speech
        speech_duration_ms=200,  # Minimum speech duration to trigger detection
        remove_filler_words=True,  # Remove filler words like "um", "uh"
    )

    # Create session with Azure Voice Live API
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
        )
    )

    # Start the session with both voice and text I/O enabled
    await session.start(
        agent=VoiceAndTextAgent(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            # Voice I/O configuration
            audio_input=room_io.AudioInputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            audio_output=room_io.AudioOutputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            # Text I/O configuration
            # - Text input: Users can send text via TextStream to `lk.chat` topic
            # - Text output: Agent responses sent via TextStream to `lk.transcription` topic
            text_input=True,
            text_output=True,
        ),
    )


if __name__ == "__main__":
    cli.run_app(server)

@007DXR
Copy link
Author

007DXR commented Feb 6, 2026

Additional Session Configuration
It would also be helpful to have options for:

  • Noise reduction: input_audio_noise_reduction
  • Echo cancellation: input_audio_echo_cancellation
  • TTS speed/rate control
  • Semantic VAD types (azure_semantic_vad_multilingual, etc.)

Voice Live as having built-in noise reduction and echo cancellation (even when you don’t explicitly set options).
You can explicitly set Semantic VAD like this:

  # Configure Semantic VAD for intelligent turn detection
   # Options:
   # - AzureSemanticVad: Default semantic VAD (multilingual)
   # - AzureSemanticVadEn: English-only, optimized for English
   # - AzureSemanticVadMultilingual: Explicit multilingual support
   from azure.ai.voicelive.models import AzureSemanticVadEn

   turn_detection = AzureSemanticVadEn(
       threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
       silence_duration_ms=500,  # Silence duration before turn ends
       prefix_padding_ms=300,  # Audio padding before speech
       speech_duration_ms=200,  # Minimum speech duration to trigger detection
       remove_filler_words=True,  # Remove filler words like "um", "uh"
   )

@luzhangtina
Copy link

  1. BYOM Profile Support
    My use case requires reliable function/tool calling with strict schema support, so I need to use models like gpt-5.1 or gpt-5.2 rather than the default gpt-realtime model. I chose Voice Live over the traditional STT→LLM→TTS pipeline due to its better latency performance, but I still need flexibility in model selection.

Per Microsoft's BYOM documentation, the WebSocket URL supports a profile query parameter:
byom-azure-openai-realtime - For realtime models byom-azure-openai-chat-completion - For chat completion models (gpt-4o, gpt-5.1, gpt-5.2, etc.) Would it be possible to expose this profile selection?

thanks for your comments. this PR already allows user to specify different models. here I create a voice live agent using gpt-4o model

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)

you can see livekit-plugins\livekit-plugins-azure\README.md for more info.

Thanks for the clarification! I tested it and confirmed that gpt-5.2 works without the profile parameter - Azure auto-detects the model type. Good to know!

@luzhangtina
Copy link

  1. Runtime Input Mode Switching
    My app allows users to switch between voice and text input mid-conversation. Some way to update these session properties at runtime would be great:
    modalities: ["text", "audio"] ↔ ["text"] turn_detection: enabled ↔ disabled

LiveKit agents framework provides built-in support for both voice and text input. RoomOptions for Mixed I/O: You can configure your session to accept both voice and text input using [RoomOptions]

 await session.start(
    agent=agent,
    room=ctx.room,
    room_options=room_io.RoomOptions(
        audio_input=room_io.AudioInputOptions(sample_rate=24000, num_channels=1),
        audio_output=room_io.AudioOutputOptions(sample_rate=24000, num_channels=1),
        text_input=True,  # Enable text input
        text_output=True,  # Enable text output
    ),
)

here is a agent demo i run successfully

  1. set LiveKit server
  2. run this agent on dev mode
  3. Open https://meet.livekit.io/ and create a room. Your agent will join automatically.
"""
Voice and Text Realtime Agent

A multimodal agent that accepts both voice and text input, responding with voice and text output.
Uses Azure Voice Live API for low-latency speech-to-speech interactions.

Features:
- Voice input: Speak naturally, the agent listens and responds
- Text input: Type in the console or send via LiveKit TextStream to `lk.chat` topic
- Voice output: Agent responds with natural speech
- Text output: Agent responses printed to console and sent via `lk.transcription` topic
- Function calling: Weather lookup example

This is ideal for applications that need:
- Accessibility support (text fallback for hearing impaired)
- Quiet environments where voice isn't suitable
- Hybrid chat + voice interfaces

Environment Variables:
    AZURE_VOICELIVE_ENDPOINT - Azure Voice Live endpoint (wss://...)
    AZURE_VOICELIVE_API_KEY - Azure API key
    AZURE_VOICELIVE_MODEL - Azure model name (default: gpt-4o)
    AZURE_VOICELIVE_VOICE - Azure voice (default: en-US-AvaMultilingualNeural)
    LIVEKIT_URL - LiveKit server URL
    LIVEKIT_API_KEY - LiveKit API key
    LIVEKIT_API_SECRET - LiveKit API secret

Usage:
    python voice_and_text_realtime_agent.py dev
    Open https://meet.livekit.io/ and create a room. Your agent will join automatically.

Try:
 - Speak: "What's the weather like in Tokyo?"
 - send text via TextStream to `lk.chat` topic
"""

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()


class VoiceAndTextAgent(Agent):
    """An agent that handles both voice and text interactions."""

    def __init__(self) -> None:
        super().__init__(
            instructions=(
                "You are a friendly, helpful assistant that can interact via voice and text. "
                "Keep your responses conversational and concise since users may be listening. "
                "Avoid using markdown, emojis, or special formatting in your responses. "
                "You can help with weather information and general questions."
            ),
        )

    async def on_enter(self):
        """Called when the agent becomes active in the session."""
        # Generate a greeting when the agent starts
        # allow_interruptions=False ensures the greeting completes
        self.session.generate_reply(
            instructions="Greet the user warmly and let them know they can speak or type to interact with you.",
            allow_interruptions=False,
        )

    @function_tool
    async def get_weather(
        self,
        context: RunContext,
        location: str,
    ) -> str:
        """Get the current weather for a location.

        Args:
            location: The city or location to get weather for (e.g., "Tokyo", "New York")
        """
        logger.info(f"Looking up weather for {location}")
        # In a real application, you would call a weather API here
        # This is a placeholder response
        return f"The weather in {location} is currently sunny with a temperature of 72°F (22°C)."


server = AgentServer()


@server.rtc_session()
async def entrypoint(ctx: JobContext):
    """Main entry point for the agent session."""
    logger.info(f"Starting voice and text realtime agent in room: {ctx.room.name}")

    # Configure Semantic VAD for intelligent turn detection
    turn_detection = AzureSemanticVadEn(
        threshold=0.5,  # Voice activity detection threshold (0.0-1.0)
        silence_duration_ms=500,  # Silence duration before turn ends
        prefix_padding_ms=300,  # Audio padding before speech
        speech_duration_ms=200,  # Minimum speech duration to trigger detection
        remove_filler_words=True,  # Remove filler words like "um", "uh"
    )

    # Create session with Azure Voice Live API
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
        )
    )

    # Start the session with both voice and text I/O enabled
    await session.start(
        agent=VoiceAndTextAgent(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            # Voice I/O configuration
            audio_input=room_io.AudioInputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            audio_output=room_io.AudioOutputOptions(
                sample_rate=24000,
                num_channels=1,
            ),
            # Text I/O configuration
            # - Text input: Users can send text via TextStream to `lk.chat` topic
            # - Text output: Agent responses sent via TextStream to `lk.transcription` topic
            text_input=True,
            text_output=True,
        ),
    )


if __name__ == "__main__":
    cli.run_app(server)

I appreciate the pointer to RoomOptions for mixed I/O. My use case is specifically about dynamic switching mid-session — users start in voice mode, switch to text (typing), then back to voice within the same session.

The reason I was looking at updating modalities mid-session is cost — Azure Voice Live charges per token, and audio output tokens are significantly more expensive than text tokens. When a user switches to text mode, I'd ideally want to tell Azure to only generate text output (no audio), to avoid paying for audio tokens that won't be used.

In my testing (I might be wrong here), it seems like Azure Voice Live doesn't support this:

session.update with modalities: ["text"] didn't seem to take effect — the response still showed modalities: ['audio', 'text']
Initializing with modalities: ["text"] from the start resulted in no output at all
turn_detection couldn't be re-enabled after being set to None (turn_detection_type_change_not_allowed)
I ended up working around it by controlling audio at the agent level (session.input/output.set_audio_enabled()), which works functionally but Azure still generates audio tokens server-side — so no cost savings on the Azure side unfortunately.

Have you seen different behavior? Would love to know if there's a supported way to switch to text-only output mid-session to save on audio token costs.

@theomonnom
Copy link
Member

Could this be created on top of the openai plugin? Like we did for the xai plugin?

@luzhangtina
Copy link

Language configuration missing — causes multi-language misidentification

Another issue I encountered during this integration testing where the Voice Live API incorrectly detects the user's language — responding in Japanese, German, French, or Chinese even when the user is speaking English only.

The root cause is that _configure_session doesn't set input_audio_transcription with a language parameter, so the API falls back to auto-detection which is unreliable. Additionally, AzureStandardVoice is created without a locale, leaving TTS language unconstrained.

Suggestion: Expose language as an optional parameter on RealtimeModel.init so users can configure it per their use case — whether constraining to a single language or supporting multilingual sessions.

@CLAassistant
Copy link

CLAassistant commented Feb 8, 2026

CLA assistant check
All committers have signed the CLA.

@007DXR 007DXR force-pushed the xinran/realtime_model branch from 7399c41 to 6ac5165 Compare February 8, 2026 16:26
@007DXR
Copy link
Author

007DXR commented Feb 8, 2026

Language configuration missing — causes multi-language misidentification

Another issue I encountered during this integration testing where the Voice Live API incorrectly detects the user's language — responding in Japanese, German, French, or Chinese even when the user is speaking English only.

The root cause is that _configure_session doesn't set input_audio_transcription with a language parameter, so the API falls back to auto-detection which is unreliable. Additionally, AzureStandardVoice is created without a locale, leaving TTS language unconstrained.

Suggestion: Expose language as an optional parameter on RealtimeModel.init so users can configure it per their use case — whether constraining to a single language or supporting multilingual sessions.

Thanks for your suggestion. I've added an input_audio_transcription parameter that configures language for both transcription and TTS. Here is an example:

# Configure input audio transcription with language constraint
# This helps prevent language misidentification
input_audio_transcription = AudioInputTranscriptionOptions(
    model="whisper-1",
    language="en-US",  # Constrain to English for reliable detection
)

session = AgentSession(
    llm=azure.realtime.RealtimeModel(
        endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
        api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
        model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
        voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
        input_audio_transcription=input_audio_transcription,
        turn_detection=turn_detection,
        tool_choice="auto",  # Enable function calling
    )
)

Please see README for details.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 17 additional findings in Devin Review.

Open in Devin Review

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

Open in Devin Review

@007DXR
Copy link
Author

007DXR commented Feb 8, 2026

Could this be created on top of the openai plugin? Like we did for the xai plugin?

The current Azure plugin in livekit-plugins-azure is implementing Azure Voice Live, which cannot be built on top of the OpenAI plugin.

  • Azure OpenAI Realtime - OpenAI's Realtime API hosted on Azure. This uses the same protocol as OpenAI and is already supported via livekit-plugins-openai.realtime.RealtimeModel.with_azure().

  • Azure Voice Live - Azure's native voice AI service (azure.ai.voicelive). This is a completely different service with its own SDK and protocol.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 17 additional findings in Devin Review.

Open in Devin Review

Comment on lines +380 to +384
credential = DefaultAzureCredential()
else:
api_key = self._realtime_model._opts.api_key
assert api_key is not None, "API key must be set when not using default credential"
credential = AzureKeyCredential(api_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 DefaultAzureCredential resource leak on every connection/reconnection

When use_default_credential=True, a new DefaultAzureCredential() is created as a local variable inside _run_connection() but is never closed via aclose(). The async DefaultAzureCredential from azure.identity.aio manages underlying HTTP sessions for token acquisition and must be explicitly closed.

Root Cause and Impact

At realtime_model.py:380, DefaultAzureCredential() is created each time _run_connection is called. Since _main_task (line 339) calls _run_connection in a retry loop — and notably reconnects without counting retries on server disconnect (line 345-349) — a new DefaultAzureCredential is created on every reconnection attempt. The old credential goes out of scope without being closed.

The connect() context manager at line 387-391 takes the credential but does not own its lifecycle. Per Azure SDK conventions, the caller is responsible for closing credentials they create.

async def _run_connection(self) -> None:
    if self._realtime_model._opts.use_default_credential:
        credential = DefaultAzureCredential()  # Created but never closed
    ...
    try:
        async with connect(..., credential=credential) as conn:
            ...
    finally:
        self._connection = None
        self._connection_ready.clear()
        # credential is leaked here

Impact: Each reconnection leaks HTTP client sessions held by DefaultAzureCredential. Over long-running sessions with periodic idle-timeout reconnects, this accumulates leaked resources (file descriptors, TCP connections to token endpoints).

Prompt for agents
In _run_connection (realtime_model.py), when use_default_credential is True, the DefaultAzureCredential must be properly closed after use. Wrap the credential in a try/finally or use it as an async context manager. For example, after the try/except/finally block at line 386-414, add credential cleanup in the finally clause. Something like:

finally:
    self._connection = None
    self._connection_ready.clear()
    if isinstance(credential, DefaultAzureCredential):
        await credential.close()

Alternatively, use 'async with DefaultAzureCredential() as credential:' to ensure proper cleanup.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@007DXR
Copy link
Author

007DXR commented Feb 9, 2026

I ended up working around it by controlling audio at the agent level (session.input/output.set_audio_enabled()), which works functionally but Azure still generates audio tokens server-side — so no cost savings on the Azure side unfortunately.

Azure Voice Live supports generate text only. but there is a bug in Azure realtime model plugin: the Azure realtime model plugin only handles RESPONSE_AUDIO_TRANSCRIPT_DELTA. When using modalities=["text"], the text is sent via RESPONSE_TEXT_DELTA, not RESPONSE_AUDIO_TRANSCRIPT_DELTA. I've fix this bug .
I create a text-only agent demo, which ensures Azure Voice Live only generates text token, no audio token. you can try this agent on console mode: uv run .\text_only.py console --text

## This example demonstrates a text-only agent using direct text transport.
## Instead of using LiveKit's TextStream, this example shows how to:
## - Send text input via: `session.generate_reply(user_input="user's input text")`
## - Receive agent's response via `session.on("conversation_item_added", ev)`
## docs: https://docs.livekit.io/agents/build/events/#conversation_item_added

import asyncio
import logging
import os
import sys

from azure.ai.voicelive.models import AzureSemanticVadEn
from dotenv import load_dotenv

from livekit.agents import (
    Agent,
    AgentServer,
    AgentSession,
    JobContext,
    RunContext,
    cli,
    room_io,
)
from livekit.agents.llm import function_tool
from livekit.agents.voice.events import ConversationItemAddedEvent, AgentStateChangedEvent
from livekit.plugins import azure

logger = logging.getLogger("voice-text-realtime-agent")
logger.setLevel(logging.INFO)

load_dotenv()

class MyAgent(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions="You are a helpful assistant.",
        )

server = AgentServer()

@server.rtc_session()
async def entrypoint(ctx: JobContext):
    session = AgentSession(
        llm=azure.realtime.RealtimeModel(
            endpoint=os.getenv("AZURE_VOICELIVE_ENDPOINT"),
            api_key=os.getenv("AZURE_VOICELIVE_API_KEY"),
            model=os.getenv("AZURE_VOICELIVE_MODEL", "gpt-4o"),
            voice=os.getenv("AZURE_VOICELIVE_VOICE", "en-US-AvaMultilingualNeural"),
            # turn_detection=turn_detection,  # Use semantic VAD
            tool_choice="auto",
            modalities=["text"]
        )
    )


    # Wait for session to be ready before sending text input
    session_ready = asyncio.Event()

    @session.on("agent_state_changed")
    def on_agent_state_changed(event: AgentStateChangedEvent):
        # Session is ready when agent transitions from "initializing" to another state
        if event.old_state == "initializing" and event.new_state in ("listening", "idle"):
            session_ready.set()
           

    # Disable audio and transcript I/O (for pure text mode)
    session.input.audio = None
    session.output.audio = None
    session.output.transcription = None

    await session.start(
        agent=MyAgent(),
    )

    # Wait for the session to be ready (agent is no longer "initializing")
    await session_ready.wait()

    # Example: Send text input directly to the agent
    # You can call this from anywhere (e.g., HTTP endpoint, websocket, etc.)
    await session.generate_reply(user_input="Hello, how can you help me today?")


if __name__ == "__main__":
    cli.run_app(server)

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 19 additional findings in Devin Review.

Open in Devin Review

…ent ignores actual transcription configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

livekit-plugins-azure doesn't support Azure Realtime API

4 participants