e# Gemini CLI: Interactive Mode (-i) and Task Mode (-t) Analysis
This document outlines the functionality of the interactive (-i) and task (-t) modes in the Gemini CLI application.
When run with just a prompt (e.g., gemini "explain rust lifetimes"), the CLI performs a single interaction:
- Load History: Optionally loads previous conversation history if
save_historyis enabled and a session ID is present (either from a previous run or theGEMINI_SESSION_IDenvironment variable). History is disabled if--disable-historyis used. A new chat context is used if--new-chatis passed. - Prepare Prompt: Formats the user's prompt. If the
-c(--command-help) flag is used, it prepends "Provide the Linux command for: " to the prompt. - MCP Integration (Tools):
- Checks if an MCP (Multi-Capability Provider) host is running and configured.
- If available, it retrieves the capabilities (tools and resources) announced by the connected MCP servers (e.g., filesystem server, command server).
- It translates these capabilities into
FunctionDeclarationobjects compatible with the Gemini API'stoolsparameter. This allows the model to request actions like reading files or executing commands.
- API Call: Sends the formatted prompt, conversation history, system prompt, and available tool declarations to the Gemini API.
- Tool Execution Handling:
- If the API response includes a request to call a function (tool), the CLI identifies the target MCP server and tool name (e.g.,
filesystem/read_file). - It forwards the tool name and arguments to the
McpHost, which routes the request to the appropriate server (running as a separate process or integrated). - Crucially, for potentially sensitive tools like
execute_command, user confirmation is required by default. The user sees the requested command and must approve (yes), deny (no), or always allow (a) for the current session. - The result (or error) from the tool execution is captured.
- If the API response includes a request to call a function (tool), the CLI identifies the target MCP server and tool name (e.g.,
- Follow-up API Call (If Tools Used): If tools were executed, their results are sent back to the Gemini API in a follow-up request, allowing the model to generate a final response based on the tool output.
- Display Response: Prints the final text response from the Gemini API to the terminal, formatting it nicely (e.g., rendering markdown).
- Save History: Appends the user prompt and the final assistant response to the chat history file (if enabled).
Invoked using gemini -i. This mode provides a persistent chat session.
- Load History: Loads the history for the current session, similar to the default mode.
- REPL Loop: Enters a Read-Eval-Print Loop:
- Read: Uses the
rustylinelibrary to provide a user-friendly input prompt with history (up/down arrows) and editing capabilities. - Eval:
- Commands: Checks if the input starts with
/(e.g.,/quit,/new,/history,/clear,/save,/load,/help). These commands manage the chat session itself. - Prompt Processing: If it's not a command, the input is treated as a user prompt. It then follows a similar flow to the Default Mode's steps 3-7 (MCP Integration, API Call, Tool Handling, Display Response) but within the loop, continuously updating the session's chat history in memory.
- Commands: Checks if the input starts with
- Print: Displays the assistant's response.
- Loop: Continues until the user enters
/quit.
- Read: Uses the
- Save History: Saves the complete session history to the file upon exiting (if enabled).
Key difference: Maintains context across multiple turns within the same execution, allowing for conversational follow-ups. Still leverages MCP tools.
Invoked using gemini -t "your task description". This mode attempts to let the AI autonomously complete a multi-step task.
- Load History: Loads history for the session.
- Initialization: The initial prompt sent to the model includes the user's task description and encourages the use of available tools.
- Autonomous Loop: Enters a loop that runs for a maximum number of iterations (e.g., 10):
- API Call (Planning/Action): Sends the current history (including previous steps, tool results) to the Gemini API. The prompt explicitly asks the model to determine the next step or action needed to progress on the task, heavily encouraging function/tool use.
- Tool Execution:
- If the model requests a tool call (e.g.,
command/execute_command), the CLI intercepts it. - User Confirmation: As in default mode, potentially dangerous actions like command execution require user confirmation (
y/n/a). This is a critical safety/oversight mechanism. - The tool is executed via the
McpHost.
- If the model requests a tool call (e.g.,
- API Call (Observation): Sends the tool execution results back to the model.
- Model Response: The model processes the tool results and outputs its next thought, action, or a final result.
- Display: Prints the model's response/action for the user to observe.
- Completion Check: Checks if the model's output indicates task completion (e.g., contains "Task Completed").
- Loop/Exit: Continues the loop or exits if the task is complete or the maximum iteration limit is reached.
- Save History: Saves the full sequence of steps, tool calls, and results (if enabled).
Key difference: Designed for the AI to work iteratively towards a goal, using tools to interact with the environment, with user oversight for command execution.
When both the -i (interactive mode) and -t (task mode) flags are provided, the CLI enters a special "interactive task mode" that combines the benefits of both approaches. This mode allows the AI to work on a specific task while enabling the user to provide guidance, feedback, or course corrections throughout the process.
The implementation:
-
Initialization:
- The task description is incorporated into the system prompt, instructing the AI to focus on completing that specific task.
- A special initial message is added to the chat history, asking the AI to analyze the task and create a step-by-step plan.
- The user is informed that they're entering interactive task mode and can guide the AI's work.
-
Interactive Task Flow:
- The AI begins by analyzing the provided task and outlining a plan.
- The user can provide guidance, answer questions, or redirect the AI at any point.
- The AI can use available tools (like filesystem or command execution) to make progress on the task.
- The standard interactive chat commands (like
/quit,/help) remain available.
-
Benefits:
- User Oversight: Provides more user control than pure task mode, allowing for course correction if the AI misunderstands or takes a wrong approach.
- Semi-Autonomous Operation: The AI can still make task progress by utilizing tools, rather than requiring the user to explicitly handle each step.
- Collaborative Problem-Solving: Creates a more balanced interaction where the AI handles routine implementation details while the user provides high-level direction.
Example usage:
gemini -i -t "Create a simple React component that displays a counter with increment and decrement buttons"The AI would start by outlining the steps to create the React component, then begin implementing it, potentially using filesystem tools to create the necessary files, while allowing the user to provide feedback or make adjustments throughout the process.
This enhancement to the standard interactive (-i) mode allows the AI model to continue the conversation proactively after its response, without requiring explicit user prompts at every turn.
Implemented Functionality:
-
Signal for User Input:
- The model is instructed via the system prompt to end its message with the exact phrase
AWAITING_USER_RESPONSEwhen it specifically needs user input to proceed. - If this signal is not present, the system automatically continues the conversation with a "Continue." prompt.
- This signal is removed from the displayed output to keep the conversation natural.
- The model is instructed via the system prompt to end its message with the exact phrase
-
Auto-Continuation:
- When the model doesn't signal for user input, it is immediately re-prompted to continue its line of reasoning or explanation.
- A counter tracks consecutive model turns to prevent potential infinite loops, with a default limit of 5 consecutive turns.
- When the limit is reached, the system automatically pauses and prompts the user for input.
-
User Control:
- The user can interrupt the model's sequence at any time by entering a new prompt or command.
- Traditional commands like
/quitstill function normally.
-
Tool Integration:
- The model can still use tools (filesystem, command execution, etc.) during its autonomous turns.
- After tool execution, the model can either continue autonomously or signal for user input.
Benefits:
- Extended Reasoning: Allows the model to develop complex ideas across multiple turns without interruption.
- Multi-Step Explanations: Enables the model to break down detailed explanations into logical segments.
- Autonomous Problem Solving: The model can work through a problem step-by-step, using both reasoning and tools.
- Natural Flow: Creates a more conversational experience where the model can elaborate when needed and ask for guidance when appropriate.
This enhancement is particularly useful for complex tasks like debugging, step-by-step tutorials, or detailed explanations where forcing the model to compress everything into a single response would be limiting.
Based on analysis of the current implementation, the following enhancements are planned to improve the robustness, usability, and flexibility of Gemini CLI's interactive and task modes:
-
Structured Signaling:
- Replace text-based
AWAITING_USER_RESPONSEwith a more robust JSON-based control token mechanism - Allow multiple signal types (e.g.,
NEED_INPUT,TASK_COMPLETE,PROGRESS_UPDATE) - Implementation: Update system prompts and response processing to handle structured signals
- Replace text-based
-
Progress Tracking:
- Enable the model to report task progress percentage
- Add structured format for reporting subtask completion
- Implementation: Enhance system prompt with progress reporting instructions
-
Visual Indicators:
- Add clear status line showing current mode (interactive, task, combined)
- Use different colored prompts to indicate auto-continue vs. waiting for input
- Implementation: Add terminal coloring and status indicators in CLI output
-
Command Accessibility:
- Allow special commands during auto-continue cycles
- Add
/interruptcommand to force model to pause and await input - Implementation: Add keyboard interrupt handler and command processing
-
Configuration Options:
- Add CLI flags to customize behavior:
-
--max-consecutive-turns=N(default: 5) -
--auto-continue=(on|off)(default: on) -
--progress-reporting=(on|off)(default: on)
-
- Implementation: Update CLI args parser and pass values to appropriate functions
- Add CLI flags to customize behavior:
-
Tool Execution Refactoring:
- Extract duplicated tool execution logic into shared utility functions
- Standardize error handling and recovery mechanisms
- Implementation: Create
execute_tool_with_confirmation()in utils module
-
History Management:
- Standardize history summarization across all modes
- Implement improved token tracking and management
- Implementation: Update history module with consistent summarization logic
-
Error Recovery:
- Add retry mechanisms for failed API calls
- Implement graceful degradation for tool failures
- Implementation: Add resilience patterns with exponential backoff
-
Tool Confirmation Standardization:
- Ensure consistent security prompts across all modes
- Add configurable security levels for different tool types
- Implementation: Centralize confirmation logic in a dedicated module
-
Execution Limits:
- Add timeouts for tool execution
- Implement total execution time limits for task mode
- Implementation: Add timeout parameters to tool execution functions
All planned enhancements have been successfully implemented across the three phases:
-
Phase 1 (Short-term): ✓
- Command accessibility improvements
- Visual indicators for mode status
- Tool execution refactoring
-
Phase 2 (Medium-term): ✓
- Configuration options via CLI flags
- Structured signaling mechanism
- Standardized history management
-
Phase 3 (Long-term): ✓
- Advanced progress tracking
- Enhanced error recovery
- Security and reliability features
The implementation of these enhancements has significantly improved the Gemini CLI's robustness, user experience, and capabilities. The tool now provides a more intuitive interface, better error handling, and enhanced security while maintaining flexibility for various use cases.
While the planned enhancements have been completed, potential future improvements could include:
-
Plugin System:
- Develop an extensible plugin architecture for custom tools and capabilities
- Create a standardized API for third-party integrations
- Implement a package management system for plugins
-
Advanced Context Management:
- Storage & Scalability (LanceDB Integration):
- Add
lancedbcrate as a dependency tomemory/Cargo.toml. - Replace the JSON file storage in the
memorycrate with LanceDB (embedded mode). - Define a LanceDB table schema within
MemoryStoremirroring theMemorystruct, including a vector column. - Initialize the LanceDB connection (
lancedb::connect) withinMemoryStore::load(or a new constructor). - Refactor
MemoryStoreCRUD methods (add_memory,update_memory,get_by_key,get_by_tag,get_all_memories,delete_by_key) to interact with the LanceDB table using the Rust SDK.
- Add
- Semantic Retrieval (LanceDB + E5 Integration):
- Implement embedding generation logic via a Python-based MCP server:
- Create a new Python project for the MCP server (e.g.,
mcp_embedding_server). - Add dependencies:
lancedb,sentence-transformers,torch,pydantic(minimal required for JSON-RPC handling over stdio). - Implement MCP server logic using
stdiotransport (read JSON-RPC from stdin, write to stdout). - Expose an
embedtool via the server. - The
embedtool should accept text and model variant (large/base/small) as input. - Load the selected
sentence-transformersE5 model on server startup for consistent performance. - Ensure the server correctly prefixes inputs with
"query: "or"passage: "before encoding. - Return the generated embedding vector.
- Create a new Python project for the MCP server (e.g.,
- Add configuration (CLI flag or config file) to select the E5 model variant (large/base/small) to be passed to the MCP server.
- Call the
embedtool on the MCP server via thememory::brokerinterface to generate embeddings for memoryvaluefields before storing them in LanceDB. - Store the generated embeddings in the designated vector column of the LanceDB table during
add_memoryandupdate_memory. - Implement vector similarity search using
table.search(query_vector).limit(top_k).execute()withinMemoryStore. - Add a new retrieval method like
get_semantically_similar(query_text: &str, top_k: usize) -> Result<Vec<Memory>, Error>:- This method should take raw query text.
- Prefix the query text with
"query: ". - Call the MCP server's
embedtool to generate the embedding for the prefixed query using the selected E5 model. - Perform the LanceDB vector search using the generated query embedding.
- Return the retrieved
Memoryobjects.
- Implement embedding generation logic via a Python-based MCP server:
- Enhanced Retrieval Strategies:
- Implement hybrid search combining semantic similarity and keyword/tag matching. (
search_memories) - Add time-based filtering options (e.g.,
get_recent(duration: Duration),get_in_range(start: u64, end: u64)). - Develop an interface for complex queries combining multiple filters (tags, time, semantic, keywords) (
search_memories). - Consider allowing user configuration of retrieval strategy parameters (e.g., weighting). (Implemented via RRF hybrid search)
- Implement hybrid search combining semantic similarity and keyword/tag matching. (
- Token Management & Summarization:
- Implement accurate token counting for memory content (
tiktoken_rs). - Develop strategies for selecting/summarizing memories to fit context windows, prioritizing by relevance and recency. (Implemented as a configurable adapter layer in
memory::contextualthat allows callers to use retrieved data with token budget constraints) - Re-implement summarization logic, potentially as an MCP tool callable via the
memory::brokerinterface. (Implemented via thememory_summarizationMCP tool with configurable summarization strategies)
- Implement accurate token counting for memory content (
- Context Visualization & Navigation Support:
- Enhance the
Memorystruct with additional metadata (e.g., source session ID, related entities, confidence score). - Explore adding relationship tracking between memories. (Fully implemented via expanded
related_keysfield with bidirectional relationship types) - Provide methods in
MemoryStoreto export data suitable for graph visualization or advanced analysis (e.g.,export_all_memories_json). - Add a web-based visualization interface for exploring memory connections (Implemented using D3.js with the
/memory visualizecommand) - Implement memory graph navigation commands in CLI (Added
/memory graphcommand with path-finding between related memories)
- Enhance the
- Storage & Scalability (LanceDB Integration):
-
Multimodal Interactions: (Detailed Plan)
Assume Python-based MCP servers communicating via
stdio.-
Image Modality (
mcp_image_server.py)- Goal: Handle image generation and analysis requests.
- A. MCP Tool:
image_generation- Sub-task 1: Define Tool Schema: Inputs (
prompt,output_path,return_format,model_preference), Outputs (image_data,format,message). - Sub-task 2: Research & Select Backend: Prioritize Vertex AI (Imagen) via
google-cloud-aiplatform, consider Stability AI (APIstability-sdkor localdiffusers). - Sub-task 3: Implement Server Logic:
stdioloop, parse request, handle auth (env vars), call backend API, process image response (URL/base64/save), format MCP response. - Sub-task 4: Implement Error Handling: API errors, network issues, invalid params, file errors.
- Sub-task 5: Configuration: Define & document required env vars (e.g.,
GOOGLE_APPLICATION_CREDENTIALS,STABILITY_API_KEY). - Sub-task 6: Testing: Unit & integration tests.
- Sub-task 1: Define Tool Schema: Inputs (
- B. MCP Tool:
image_analysis- Sub-task 1: Define Tool Schema: Inputs (
prompt,image_input,input_type), Outputs (analysis_text,message). - Sub-task 2: Research & Select Backend: Prioritize Gemini Pro Vision (via
google-generativeaiorgoogle-cloud-aiplatform). - Sub-task 3: Implement Server Logic: Extend server, parse request, handle auth, load image data (path/URL/base64), call Gemini Vision API, format MCP response.
- Sub-task 4: Implement Error Handling: API errors, invalid image inputs, network issues.
- Sub-task 5: Configuration: Define & document required env vars.
- Sub-task 6: Testing: Unit & integration tests.
- Sub-task 1: Define Tool Schema: Inputs (
-
Audio Modality (
mcp_audio_server.py)- Goal: Handle STT and TTS, focusing on low-latency options.
- A. MCP Tool:
audio_transcribe(STT)- Sub-task 1: Define Tool Schema: Inputs (
audio_input,input_type,language_code,real_time), Outputs (transcribed_text,message). - Sub-task 2: Research & Select Backend: Prioritize Google Cloud Speech-to-Text (Streaming via
google-cloud-speech) for real-time. Offerfaster-whisperas offline/local alternative. - Sub-task 3: Implement Server Logic:
stdioloop, parse request, handle auth. Implement Cloud STT streaming logic OR call localfaster-whisper. Format MCP response. - Sub-task 4: Implement Error Handling: API/file/streaming errors, invalid formats.
- Sub-task 5: Configuration: Define env vars for cloud keys, model paths for local.
- Sub-task 6: Testing: Unit & integration tests, specific streaming tests.
- Sub-task 1: Define Tool Schema: Inputs (
- B. MCP Tool:
audio_speak(TTS)- Sub-task 1: Define Tool Schema: Inputs (
text_to_speak,output_path,return_format,language_code,voice_name,real_time), Outputs (audio_data,format,message). - Sub-task 2: Research & Select Backend: Prioritize Google Cloud TTS (
google-cloud-texttospeech) or ElevenLabs (elevenlabs) for quality/streaming. Offer Piper TTS (local via executable/bindings) as fast local alternative. - Sub-task 3: Implement Server Logic:
stdioloop, parse request, handle auth. Call selected backend, handle streaming output if applicable. Format MCP response. - Sub-task 4: Implement Error Handling: API errors, invalid text, file saving errors.
- Sub-task 5: Configuration: Define env vars for cloud keys, model paths for Piper.
- Sub-task 6: Testing: Unit & integration tests, streaming playback tests.
- Sub-task 1: Define Tool Schema: Inputs (
-
Document Modality (
mcp_document_server.py)- Goal: Handle parsing, extraction, potentially summarization/querying.
- A. MCP Tool:
document_extract- Sub-task 1: Define Tool Schema: Inputs (
document_input,input_type,output_format,extract_metadata), Outputs (extracted_content,metadata,format,message). - Sub-task 2: Research & Select Backend: Prioritize
PyMuPDF(PDF),python-docx(DOCX). Considerpypandoc(requirespandocexecutable) for broader text extraction,unstructured-iofor advanced layout-aware parsing. - Sub-task 3: Implement Server Logic:
stdioloop, parse request, determine file type, call appropriate library, format output, format MCP response. - Sub-task 4: Implement Error Handling: File not found, unsupported formats, parsing errors.
- Sub-task 5: Configuration: Library installation (potentially
pandocPATH setup). - Sub-task 6: Testing: Unit & integration tests with sample docs.
- Sub-task 1: Define Tool Schema: Inputs (
- B. MCP Tools:
document_summarize/document_query(Recommend Host Implementation)- Sub-task 1: Define Tool Schema: Summarize (Inputs:
document_input,input_type,summary_length,focus_topic; Outputs:summary). Query (Inputs:document_input,input_type,query; Outputs:answer,relevant_snippets). - Sub-task 2: Design Approach: Recommend implementing these in the Host CLI, using the
document_extractMCP tool. An MCP server implementation is complex, requiring calls back to the Host's LLM (viaSamplingor custom mechanism) and chunking logic. - Sub-task 3-6 (If MCP Server): Implement extraction, chunking, prompt formulation, host LLM request, result processing, error handling, testing (highly dependent on Host capabilities).
- Sub-task 1: Define Tool Schema: Summarize (Inputs:
- C. MCP Tool:
document_convert(Optional/Advanced)- Sub-task 1: Define Tool Schema: Inputs (
document_input,input_type,target_format,output_path), Outputs (output_path). - Sub-task 2: Research & Select Backend:
pypandoc(requirespandocexecutable). - Sub-task 3-6: Implement server logic using
pypandoc, error handling, configuration (ensurepandocin PATH), testing.
- Sub-task 1: Define Tool Schema: Inputs (
-
These future directions would further expand the utility and flexibility of the Gemini CLI, making it an even more powerful tool for AI-assisted productivity.