Conversation
New plugin that registers desktop automation tools (click, double_click, type_text, key_press, scroll, mouse_move, open_path) on any LLM via the standard FunctionRegistry. Backed by pyautogui and designed to work with Realtime models that receive screen-share frames. Made-with: Cursor
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Computer Use plugin and example: a configurable grid overlay on shared video frames, LLM tool registration for grid-targeted desktop actions, pyautogui-backed action implementations, tests, docs, and workspace/pyproject updates to include the plugin and example. Changes
Sequence DiagramsequenceDiagram
participant Agent as Agent
participant LLM as LLM
participant Grid as Grid
participant Processor as GridOverlayProcessor
participant Track as VideoTrack
participant PyAutoGUI as PyAutoGUI
Agent->>LLM: register computer-use tools (cols, rows)
LLM->>Grid: instantiate/configure grid
Agent->>Processor: process_video(track, shared_forwarder?)
Track->>Processor: deliver raw video frame
Processor->>Grid: draw_overlay(frame)
Grid-->>Processor: annotated frame
Processor->>Track: enqueue annotated frame
Agent->>LLM: request action (e.g., click cell="C2")
LLM-->>Agent: selected tool + params
Agent->>PyAutoGUI: execute action at computed screen coords
PyAutoGUI-->>Agent: result/confirmation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…istration - Replace x/y virtual coordinates with cell-based targeting (e.g. "H8") - Add configurable Grid class with cols/rows params (default 15x15) - Add GridOverlayProcessor to draw labeled grid on screen share frames - Add sub-cell positioning (top-left, center, bottom-right, etc.) - Replace ComputerUseToolkit class with plain `register(llm)` function - Add computer use example with Gemini Realtime + grid overlay - Update README with new API and grid documentation Made-with: Cursor
a91015c to
37048f7
Compare
Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
plugins/computer_use/tests/test_computer_use.py (1)
7-11: Test the public package surface, not_actionsdirectly.This couples the tests to a private module path and turns internal refactors into test breakage. If these helpers are meant to be supported, re-export them from
vision_agents.plugins.computer_use; otherwise, exercise them throughregister()/FunctionRegistryinstead.As per coding guidelines, "Never import from private modules (
_foo) outside of the package's own__init__.py. Use the public re-export instead."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/computer_use/tests/test_computer_use.py` around lines 7 - 11, Tests are importing private helpers (key_press, make_grid_actions, type_text) from vision_agents.plugins.computer_use._actions; replace those direct imports by using the package's public surface—either import the re-exported symbols from vision_agents.plugins.computer_use (if the package __init__ exposes key_press/make_grid_actions/type_text) or exercise the behavior via the public registration API (call register() / retrieve implementations through FunctionRegistry) so tests target the public API rather than the _actions module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/10_computer_use_example/instructions.md`:
- Line 5: The Critical Rule's tool list is missing the double_click tool; update
the sentence that currently enumerates click, mouse_move, type_text, key_press,
scroll, open_path to also include double_click and add a brief example (e.g.,
"If the user says 'double-click on X', call the `double_click` tool") so it
matches the plugin README and the implemented toolkit; modify the wording around
the rule and any example lines that reference clicking to include `double_click`
(refer to the sentence listing the tools and examples like "If the user says
'click on X'").
In `@examples/10_computer_use_example/README.md`:
- Around line 33-38: Update the fenced code block for the `.env` example so it
includes a language specifier (e.g., change the opening backticks to ```dotenv)
to ensure proper rendering and linting; locate the `.env` example code block in
README.md (the block showing GOOGLE_API_KEY, STREAM_API_KEY, STREAM_API_SECRET)
and add the language label to the opening fence.
- Around line 50-58: Update the "Available actions" table to match the plugin's
actual cell-based API: replace coordinate-based signatures with the real
function signatures such as click(cell, position, button), double_click(cell,
position, button), type_text(text) (or type_text(cell, text) if the plugin
targets a cell), key_press(keys) (unchanged), scroll(cell, direction, clicks,
position), mouse_move(cell, position), and open_path(path); ensure each table
row uses the exact function names and parameter order used by the plugin (e.g.,
click, double_click, type_text, key_press, scroll, mouse_move, open_path) so the
README aligns with the plugin README and instructions.md.
In `@plugins/computer_use/tests/test_computer_use.py`:
- Around line 7-11: The test module currently imports desktop automation helpers
(key_press, make_grid_actions, type_text) at collection time which triggers real
mouse/keyboard actions; change this by making the tests explicitly opt-in to a
real GUI session: remove those top-level imports and instead import key_press,
make_grid_actions and type_text inside the individual tests that need them (or
call pytest.importorskip for the plugin), and add a pytest marker or a pre-check
(e.g., `@pytest.mark.gui` or a fixture like require_display) that calls
pytest.skip when no GUI/display is available so tests only run when a real GUI
is explicitly requested.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_actions.py`:
- Around line 176-197: The open_path function should validate the provided path
before spawning the subprocess: ensure the path is absolute (use os.path.isabs
or pathlib.Path.is_absolute), verify it exists (os.path.exists or Path.exists)
and is a file or directory as expected (Path.is_file/Path.is_dir), and
optionally resolve symlinks (os.path.realpath or Path.resolve()) to normalize
it; if validation fails, return a clear error string (e.g., "Invalid path: must
be absolute and exist: {path}") instead of invoking the OS opener. Keep these
checks at the top of open_path, and only build/execute the cmd and
create_subprocess_exec after the path passes validation. Ensure you reference
open_path when making changes.
- Around line 153-160: The logger currently emits raw typed text in type_text
and full filesystem paths in other logger.debug calls, which can leak secrets;
create and use a small redaction helper (e.g., redact_text(value: str) and
redact_path(path: str)) that returns a short preview (first N chars) plus a
"<redacted>" marker or masks the middle, then replace direct uses of text/path
in logger.debug calls (e.g., the logger.debug in type_text and any
logger.debug/logger.error calls that print paths around lines 198-203) to log
redact_text(text) or redact_path(path) instead; apply the same helper
consistently across the module.
- Around line 16-17: The module currently sets pyautogui.FAILSAFE and
pyautogui.PAUSE at import time — remove those top-level assignments so importing
_actions no longer flips the process-wide failsafe; instead, set pyautogui.PAUSE
and (only if absolutely necessary) pyautogui.FAILSAFE inside an explicit runtime
initializer or at the start of the action functions that perform automation
(e.g., inside the functions in this module that call pyautogui), or expose a
configuration/init function (e.g., init_pyautogui) the plugin activation path
can call; ensure tests that import _actions do not change global failsafe state.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_processor.py`:
- Around line 44-48: The published QueuedVideoTrack returned by
publish_video_track currently can remain waiting after shutdown; update the
shutdown flow to explicitly terminate the track so downstream consumers stop
waiting: add a shutdown handler that calls a termination method on the track
(e.g., a stop/terminate/end_of_stream method on self._video_track) as part of
setting self._shutdown True, and ensure QueuedVideoTrack exposes and uses that
method to inject an end-of-stream frame/stop the async frame generator (so
close() alone is not relied on). Keep publish_video_track returning the same
track but ensure shutdown triggers the track's termination logic so consumers
receive EOF and stop waiting.
---
Nitpick comments:
In `@plugins/computer_use/tests/test_computer_use.py`:
- Around line 7-11: Tests are importing private helpers (key_press,
make_grid_actions, type_text) from vision_agents.plugins.computer_use._actions;
replace those direct imports by using the package's public surface—either import
the re-exported symbols from vision_agents.plugins.computer_use (if the package
__init__ exposes key_press/make_grid_actions/type_text) or exercise the behavior
via the public registration API (call register() / retrieve implementations
through FunctionRegistry) so tests target the public API rather than the
_actions module.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b84127f4-edf7-4595-896a-02ce62d8ef8f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
examples/10_computer_use_example/.gitignoreexamples/10_computer_use_example/README.mdexamples/10_computer_use_example/computer_use_example.pyexamples/10_computer_use_example/instructions.mdexamples/10_computer_use_example/pyproject.tomlplugins/computer_use/README.mdplugins/computer_use/pyproject.tomlplugins/computer_use/tests/__init__.pyplugins/computer_use/tests/test_computer_use.pyplugins/computer_use/vision_agents/plugins/computer_use/__init__.pyplugins/computer_use/vision_agents/plugins/computer_use/_actions.pyplugins/computer_use/vision_agents/plugins/computer_use/_grid.pyplugins/computer_use/vision_agents/plugins/computer_use/_processor.pyplugins/computer_use/vision_agents/plugins/computer_use/_toolkit.pypyproject.toml
plugins/computer_use/vision_agents/plugins/computer_use/_actions.py
Outdated
Show resolved
Hide resolved
| async def type_text(text: str) -> str: | ||
| """Type a string of text into the currently focused element. | ||
|
|
||
| Args: | ||
| text: The text to type. | ||
| """ | ||
| await _run_sync(pyautogui.write, text, interval=0.03) | ||
| logger.debug("type_text(%r)", text[:80]) |
There was a problem hiding this comment.
Redact raw text and path values from logs.
Line 160 logs the typed payload and Lines 200-203 log the full filesystem path. Both can contain secrets or PII, so normal tool use will leak sensitive data into debug/error logs.
Suggested fix
async def type_text(text: str) -> str:
@@
- logger.debug("type_text(%r)", text[:80])
+ logger.debug("type_text(len=%d)", len(text))
@@
- logger.error("open_path(%r) failed: %s", path, err)
+ logger.error("open_path failed: %s", err)
@@
- logger.debug("open_path(%r)", path)
+ logger.debug("open_path() succeeded")Also applies to: 198-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_actions.py` around
lines 153 - 160, The logger currently emits raw typed text in type_text and full
filesystem paths in other logger.debug calls, which can leak secrets; create and
use a small redaction helper (e.g., redact_text(value: str) and
redact_path(path: str)) that returns a short preview (first N chars) plus a
"<redacted>" marker or masks the middle, then replace direct uses of text/path
in logger.debug calls (e.g., the logger.debug in type_text and any
logger.debug/logger.error calls that print paths around lines 198-203) to log
redact_text(text) or redact_path(path) instead; apply the same helper
consistently across the module.
| async def open_path(path: str) -> str: | ||
| """Open a file or folder using the OS default handler. | ||
|
|
||
| Args: | ||
| path: Absolute path to the file or folder to open. | ||
| """ | ||
| system = platform.system() | ||
| if system == "Darwin": | ||
| cmd = ["open", path] | ||
| elif system == "Linux": | ||
| cmd = ["xdg-open", path] | ||
| elif system == "Windows": | ||
| cmd = ["explorer", path] | ||
| else: | ||
| return f"Unsupported platform: {system}" | ||
|
|
||
| proc = await asyncio.create_subprocess_exec( | ||
| *cmd, | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.PIPE, | ||
| ) | ||
| _, stderr = await proc.communicate() |
There was a problem hiding this comment.
Validate path before invoking the OS opener.
The docstring and tool description say this action accepts an absolute path, but there is no check before the subprocess is spawned. A relative value will resolve against the agent's working directory, which makes the tool non-deterministic and easy for the model to mis-target.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_actions.py` around
lines 176 - 197, The open_path function should validate the provided path before
spawning the subprocess: ensure the path is absolute (use os.path.isabs or
pathlib.Path.is_absolute), verify it exists (os.path.exists or Path.exists) and
is a file or directory as expected (Path.is_file/Path.is_dir), and optionally
resolve symlinks (os.path.realpath or Path.resolve()) to normalize it; if
validation fails, return a clear error string (e.g., "Invalid path: must be
absolute and exist: {path}") instead of invoking the OS opener. Keep these
checks at the top of open_path, and only build/execute the cmd and
create_subprocess_exec after the path passes validation. Ensure you reference
open_path when making changes.
| self._video_track = QueuedVideoTrack() | ||
| self._shutdown = False | ||
|
|
||
| def publish_video_track(self) -> aiortc.VideoStreamTrack: | ||
| return self._video_track |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file mentioned in the review
find . -type f -name "_processor.py" | grep -i computer_useRepository: GetStream/Vision-Agents
Length of output: 138
🏁 Script executed:
# Get the full content of the processor file to understand the shutdown logic
cat -n plugins/computer_use/vision_agents/plugins/computer_use/_processor.pyRepository: GetStream/Vision-Agents
Length of output: 3837
🏁 Script executed:
# Search for the QueuedVideoTrack definition and usage
rg "QueuedVideoTrack" --type py -B 2 -A 5Repository: GetStream/Vision-Agents
Length of output: 27213
🏁 Script executed:
# Check the QueuedVideoTrack implementation and stop() method
cat -n agents-core/vision_agents/core/utils/video_track.py | head -100Repository: GetStream/Vision-Agents
Length of output: 3687
🏁 Script executed:
# Look for the recv() method in QueuedVideoTrack to understand behavior after stop
rg "async def recv" agents-core/vision_agents/core/utils/video_track.py -A 10Repository: GetStream/Vision-Agents
Length of output: 981
🏁 Script executed:
# Check if any other similar processors implement track.stop() in their close methods
rg "async def close" --type py -A 3 | grep -A 3 "_video_track.stop"Repository: GetStream/Vision-Agents
Length of output: 408
🏁 Script executed:
# Verify the line numbers mentioned in "Also applies to: 94-96"
sed -n '88,96p' plugins/computer_use/vision_agents/plugins/computer_use/_processor.pyRepository: GetStream/Vision-Agents
Length of output: 436
Stop the published track during shutdown.
Line 48 returns a long-lived QueuedVideoTrack, but close() only detaches the frame handler. After shutdown, downstream consumers can keep waiting on a track that will never produce another frame.
Suggested fix
async def close(self) -> None:
self._shutdown = True
await self.stop_processing()
+ self._video_track.stop()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_processor.py` around
lines 44 - 48, The published QueuedVideoTrack returned by publish_video_track
currently can remain waiting after shutdown; update the shutdown flow to
explicitly terminate the track so downstream consumers stop waiting: add a
shutdown handler that calls a termination method on the track (e.g., a
stop/terminate/end_of_stream method on self._video_track) as part of setting
self._shutdown True, and ensure QueuedVideoTrack exposes and uses that method to
inject an end-of-stream frame/stop the async frame generator (so close() alone
is not relied on). Keep publish_video_track returning the same track but ensure
shutdown triggers the track's termination logic so consumers receive EOF and
stop waiting.
Made-with: Cursor
… to instructions Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
examples/10_computer_use_example/README.md (1)
33-38:⚠️ Potential issue | 🟡 MinorUse
dotenvlanguage identifier for the.envcode block.The code block at line 34 shows
.envfile content but usesbashas the language identifier. Since this is environment variable syntax (not bash commands), usedotenv,env, orpropertiesfor accurate syntax highlighting and linting compliance.📝 Suggested fix
3. Set up your `.env`: - ```bash + ```dotenv GOOGLE_API_KEY=your_google_key STREAM_API_KEY=your_stream_key STREAM_API_SECRET=your_stream_secret ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/10_computer_use_example/README.md` around lines 33 - 38, Update the fenced code block that currently uses "bash" for the .env snippet to use an environment-file language identifier (e.g., "dotenv" or "env") so syntax highlighting/linting is correct; locate the fenced block containing the three keys (GOOGLE_API_KEY, STREAM_API_KEY, STREAM_API_SECRET) and replace the opening backticks language tag from "bash" to "dotenv".examples/10_computer_use_example/instructions.md (1)
5-5:⚠️ Potential issue | 🟡 MinorAdd an explicit
double_clickexample.The tool list now includes
double_click, but the behavioral examples still only teachclickandmouse_move. Adding a"double-click on X"example here will make the prompt more likely to select the dedicated tool.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/10_computer_use_example/instructions.md` at line 5, Add an explicit double-click example sentence alongside the existing click/mouse_move examples so the prompt demonstrates calling the double_click tool; specifically, add a line such as "If the user says 'double-click on X', call the `double_click` tool (do not just describe it)" and ensure the surrounding guidance still enforces using tool functions like `click`, `double_click`, `mouse_move`, `type_text`, `key_press`, `scroll`, and `open_path` rather than describing actions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/10_computer_use_example/instructions.md`:
- Line 18: The example "Use keyboard shortcuts." currently uses macOS-only
examples (cmd+tab, cmd+space, Spotlight) which will be wrong on other OSes;
update the instruction around the "Use keyboard shortcuts." line to either (A)
explicitly scope the guidance to macOS by adding a note like "On macOS: use cmd
shortcuts (e.g. cmd+c, cmd+tab, cmd+space to open Spotlight)" or (B) make it
OS-neutral by referencing the `key_press` action and using generic modifier
placeholders (e.g. "use modifier+key (Ctrl/⌘/Alt) such as copy: Ctrl/⌘+C, app
switch: Alt+Tab") and remove Spotlight-specific references so the guidance
applies across platforms.
- Line 17: Update the guideline that currently advises "Prefer open_path for
files and folders" to specify that open_path should only be used with absolute
filesystem paths (reference the open_path tool name in the rule), and change the
agent behavior: when the user supplies only a name or relative identifier the
agent must either ask for the absolute path or use alternative tools/strategies
(e.g., file browser/navigation actions) rather than calling open_path with an
invalid argument; ensure the rule text makes clear the contract is absolute-path
based and instructs prompting for the absolute path when missing.
---
Duplicate comments:
In `@examples/10_computer_use_example/instructions.md`:
- Line 5: Add an explicit double-click example sentence alongside the existing
click/mouse_move examples so the prompt demonstrates calling the double_click
tool; specifically, add a line such as "If the user says 'double-click on X',
call the `double_click` tool (do not just describe it)" and ensure the
surrounding guidance still enforces using tool functions like `click`,
`double_click`, `mouse_move`, `type_text`, `key_press`, `scroll`, and
`open_path` rather than describing actions.
In `@examples/10_computer_use_example/README.md`:
- Around line 33-38: Update the fenced code block that currently uses "bash" for
the .env snippet to use an environment-file language identifier (e.g., "dotenv"
or "env") so syntax highlighting/linting is correct; locate the fenced block
containing the three keys (GOOGLE_API_KEY, STREAM_API_KEY, STREAM_API_SECRET)
and replace the opening backticks language tag from "bash" to "dotenv".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 08c0f006-90e5-4695-bf4d-28ff211a0407
📒 Files selected for processing (2)
examples/10_computer_use_example/README.mdexamples/10_computer_use_example/instructions.md
|
|
||
| 1. **Always use tools.** When asked to perform an action, call the tool immediately. Say briefly what you'll do, then call the tool. | ||
| 2. **Use cell references.** Look at the grid labels on screen and pass the `cell` parameter (e.g. "C2") for coordinate-based tools. | ||
| 3. **Prefer open_path for files and folders.** If the user asks to open something by name or path, use `open_path` instead of trying to find and double-click an icon. |
There was a problem hiding this comment.
Restrict open_path guidance to absolute paths.
This rule currently tells the model to use open_path even when the user only provides a file/folder name, but the tool contract is absolute-path based. That will push the agent toward invalid calls instead of asking for the path or navigating another way.
Suggested wording
-3. **Prefer open_path for files and folders.** If the user asks to open something by name or path, use `open_path` instead of trying to find and double-click an icon.
+3. **Prefer open_path for files and folders.** If the user provides an absolute file or folder path, use `open_path`. If they only provide a name, ask for the path or locate it through the UI instead of guessing.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 3. **Prefer open_path for files and folders.** If the user asks to open something by name or path, use `open_path` instead of trying to find and double-click an icon. | |
| 3. **Prefer open_path for files and folders.** If the user provides an absolute file or folder path, use `open_path`. If they only provide a name, ask for the path or locate it through the UI instead of guessing. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/10_computer_use_example/instructions.md` at line 17, Update the
guideline that currently advises "Prefer open_path for files and folders" to
specify that open_path should only be used with absolute filesystem paths
(reference the open_path tool name in the rule), and change the agent behavior:
when the user supplies only a name or relative identifier the agent must either
ask for the absolute path or use alternative tools/strategies (e.g., file
browser/navigation actions) rather than calling open_path with an invalid
argument; ensure the rule text makes clear the contract is absolute-path based
and instructs prompting for the absolute path when missing.
| 1. **Always use tools.** When asked to perform an action, call the tool immediately. Say briefly what you'll do, then call the tool. | ||
| 2. **Use cell references.** Look at the grid labels on screen and pass the `cell` parameter (e.g. "C2") for coordinate-based tools. | ||
| 3. **Prefer open_path for files and folders.** If the user asks to open something by name or path, use `open_path` instead of trying to find and double-click an icon. | ||
| 4. **Use keyboard shortcuts.** When possible, prefer `key_press` over clicking through menus (e.g. `cmd+c` to copy, `cmd+tab` to switch apps, `cmd+space` to open Spotlight). |
There was a problem hiding this comment.
Avoid macOS-only shortcut examples in a generic prompt.
cmd+tab, cmd+space, and Spotlight are macOS-specific, so this guidance will produce wrong actions on other desktops unless the example is explicitly macOS-only. Either scope the example to macOS up front or make the shortcut advice OS-neutral.
Suggested wording
-4. **Use keyboard shortcuts.** When possible, prefer `key_press` over clicking through menus (e.g. `cmd+c` to copy, `cmd+tab` to switch apps, `cmd+space` to open Spotlight).
+4. **Use keyboard shortcuts.** When possible, prefer `key_press` over clicking through menus, using shortcuts appropriate for the current OS (e.g. `cmd+c` on macOS or `ctrl+c` on Windows/Linux).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 4. **Use keyboard shortcuts.** When possible, prefer `key_press` over clicking through menus (e.g. `cmd+c` to copy, `cmd+tab` to switch apps, `cmd+space` to open Spotlight). | |
| 4. **Use keyboard shortcuts.** When possible, prefer `key_press` over clicking through menus, using shortcuts appropriate for the current OS (e.g. `cmd+c` on macOS or `ctrl+c` on Windows/Linux). |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/10_computer_use_example/instructions.md` at line 18, The example
"Use keyboard shortcuts." currently uses macOS-only examples (cmd+tab,
cmd+space, Spotlight) which will be wrong on other OSes; update the instruction
around the "Use keyboard shortcuts." line to either (A) explicitly scope the
guidance to macOS by adding a note like "On macOS: use cmd shortcuts (e.g.
cmd+c, cmd+tab, cmd+space to open Spotlight)" or (B) make it OS-neutral by
referencing the `key_press` action and using generic modifier placeholders (e.g.
"use modifier+key (Ctrl/⌘/Alt) such as copy: Ctrl/⌘+C, app switch: Alt+Tab") and
remove Spotlight-specific references so the guidance applies across platforms.
Both now accept an optional `grid=` parameter so tools and overlay share a single source of truth for grid dimensions. Made-with: Cursor
Made-with: Cursor
This reverts commit 9980adc.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
plugins/computer_use/tests/test_computer_use.py (1)
101-102: Prefer public API over private attribute access.Line 101 reaches into
_functionsdirectly. Consider deriving registered names from the publicget_tool_schemas()method for consistency with the rest of the test class.♻️ Suggested refactor
- registered = set(llm.function_registry._functions.keys()) + schemas = llm.function_registry.get_tool_schemas() + registered = {s["name"] for s in schemas} assert registered == EXPECTED_TOOLS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/computer_use/tests/test_computer_use.py` around lines 101 - 102, The test is accessing the private attribute _functions on llm.function_registry; instead, call the public get_tool_schemas() on function_registry, extract the tool names from its returned schemas (e.g., map schema["name"] or equivalent key) to build the registered set, and assert that equals EXPECTED_TOOLS; update the assertion that currently uses registered = set(llm.function_registry._functions.keys()) to derive registered from llm.function_registry.get_tool_schemas() so the test uses the public API (keep EXPECTED_TOOLS unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/computer_use/vision_agents/plugins/computer_use/_processor.py`:
- Around line 91-93: Replace the bare except in the draw_overlay call with a
targeted exception tuple: catch the likely errors raised by PIL and av (e.g.,
AttributeError, TypeError, ValueError, OSError, PIL.UnidentifiedImageError,
av.AVError) and bind the exception to a variable so you can log it via
logger.exception when falling back to annotated = frame; update imports to
reference PIL.UnidentifiedImageError and av.AVError so the except clause refers
to concrete exception classes instead of Exception.
---
Nitpick comments:
In `@plugins/computer_use/tests/test_computer_use.py`:
- Around line 101-102: The test is accessing the private attribute _functions on
llm.function_registry; instead, call the public get_tool_schemas() on
function_registry, extract the tool names from its returned schemas (e.g., map
schema["name"] or equivalent key) to build the registered set, and assert that
equals EXPECTED_TOOLS; update the assertion that currently uses registered =
set(llm.function_registry._functions.keys()) to derive registered from
llm.function_registry.get_tool_schemas() so the test uses the public API (keep
EXPECTED_TOOLS unchanged).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eeb7291a-42fc-4a8e-8dc4-10eab919f130
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
examples/10_computer_use_example/computer_use_example.pyplugins/computer_use/README.mdplugins/computer_use/tests/test_computer_use.pyplugins/computer_use/vision_agents/plugins/computer_use/_processor.pyplugins/computer_use/vision_agents/plugins/computer_use/_toolkit.py
🚧 Files skipped from review as they are similar to previous changes (1)
- plugins/computer_use/README.md
| except Exception: | ||
| logger.exception("draw_overlay failed, forwarding original frame") | ||
| annotated = frame |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check what exceptions draw_overlay can raise
ast-grep --pattern $'def draw_overlay($_, $_):
$$$
'Repository: GetStream/Vision-Agents
Length of output: 49
🏁 Script executed:
# Search for draw_overlay method definition
rg "def draw_overlay" --type py -A 5Repository: GetStream/Vision-Agents
Length of output: 707
🏁 Script executed:
# Examine the _processor.py file around the exception handler
fd "_processor.py" --type f -x cat -n {} \; | head -100Repository: GetStream/Vision-Agents
Length of output: 4560
🏁 Script executed:
# Search for what grid object is and its draw_overlay method
rg "self\._grid\s*=" --type py -B 2 -A 2Repository: GetStream/Vision-Agents
Length of output: 612
🏁 Script executed:
# Get the full draw_overlay method from _grid.py
fd "_grid.py" --type f | xargs cat -n | head -200Repository: GetStream/Vision-Agents
Length of output: 5666
🏁 Script executed:
# Search for exceptions that could be raised in image processing operations
rg "raise|except" plugins/computer_use/vision_agents/plugins/computer_use/_grid.pyRepository: GetStream/Vision-Agents
Length of output: 439
🏁 Script executed:
# Look at the full context of the try-except in _processor.py around line 91
cat -n plugins/computer_use/vision_agents/plugins/computer_use/_processor.py | sed -n '85,98p'Repository: GetStream/Vision-Agents
Length of output: 697
Catch specific exceptions instead of bare Exception.
The code violates the coding guideline that prohibits except Exception. Since draw_overlay relies on PIL and av library calls that can raise various exceptions (AttributeError, TypeError, ValueError, OSError, etc.), you should catch specific exception types rather than a catch-all.
Suggested fix
try:
annotated = self._grid.draw_overlay(frame)
- except Exception:
+ except (AttributeError, TypeError, ValueError, OSError):
logger.exception("draw_overlay failed, forwarding original frame")
annotated = frameAdjust the exception tuple based on what PIL and av actually raise for the operations in draw_overlay.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/computer_use/vision_agents/plugins/computer_use/_processor.py` around
lines 91 - 93, Replace the bare except in the draw_overlay call with a targeted
exception tuple: catch the likely errors raised by PIL and av (e.g.,
AttributeError, TypeError, ValueError, OSError, PIL.UnidentifiedImageError,
av.AVError) and bind the exception to a variable so you can log it via
logger.exception when falling back to annotated = frame; update imports to
reference PIL.UnidentifiedImageError and av.AVError so the except clause refers
to concrete exception classes instead of Exception.
computer-useplugin that lets any VLM control the user's desktop via screen share. Tools includeclick,double_click,type_text,key_press,scroll,mouse_move, andopen_path, all backed by PyAutoGUI.GridOverlayProcessorthat draws a labeled grid (default 15x15, columns A-O, rows 1-15) on screen share frames, so the model can target UI elements by cell reference (e.g.cell="H8") with optional sub-cell positioning (position="top-right"). Grid dimensions are customizable viacols/rowsparams.computer_use.register(llm)function for tools andcomputer_use.GridOverlayProcessor(fps=2)for the processor.examples/10_computer_use_example/) using Gemini Realtime with the grid overlay.Summary by CodeRabbit
New Features
Documentation
Tests
Chores