feat: add open-operator — AI agent browser automation#614
Open
feat: add open-operator — AI agent browser automation#614
Conversation
af5040a to
4dc40ab
Compare
Add `opencli operate` command that uses an LLM-driven loop to autonomously control the browser and complete tasks described in natural language. Successful operations can be saved as reusable CLI skills via --save-as. Architecture: - Phase 1: CDP passthrough via chrome.debugger (allowlisted methods) - Phase 2: DOM context builder (reuses existing dom-snapshot + coordinates) - Phase 3: Agent loop (context → LLM → execute → observe → repeat) - Phase 4: CLI integration (`opencli operate/op`) - Phase 5: Skill sedimentation (trace → YAML pipeline) New dependencies: zod, @anthropic-ai/sdk
- ActionExecutor now tries nativeClick/nativeType first, catches errors, and falls back to page.click/typeText (JS injection) automatically - Add empty response guard in LLMClient for third-party API proxies
Replace YAML skill sedimentation with intelligent TypeScript adapter generation: - Stage 1: Rich context capture — network interceptor captures all fetch/XHR responses with bodies, plus agent thinking/memory log - Stage 2: API discovery — scores captured requests by field overlap with extracted data, recommends optimal strategy (PUBLIC/COOKIE/UI) - Stage 3: LLM code generation — sends full context (API responses, auth state, action trace, reference patterns) to generate production TS adapters - Stage 4: Validation & self-repair — imports generated adapter to verify syntax, feeds errors back to LLM for auto-fix (2 retries) The generated .ts adapters can discover and use APIs directly instead of replaying brittle UI actions, producing much more stable skills.
Replace import-based validation (fails due to path resolution) with static syntax checks that catch common LLM code generation issues: - page.evaluate() with arrow function instead of string - page.waitForSelector (doesn't exist on IPage) - Missing .js in import paths (ESM requirement) - Missing cli() call or registry import Also add these constraints to the generation prompt so the LLM avoids them in the first place.
Security: - Sanitize response bodies before writing trace to disk (redact tokens, passwords, API keys) - CSRF/bearer tokens stored as boolean flags only, never actual values - Path traversal protection on --save-as (alphanumeric/dash/underscore only) Robustness: - LLM response parsing: require done action with code, no JSON.stringify fallback - needsAuth: check auth-related cookie patterns, not all cookies - Import path regex: fix contradictory && condition - Call recordFinalSnapshot before trace finalization
…mpt, actions Closes all high and medium priority gaps vs Browser Use: Planning System (#1): - PlanItem state machine (pending/current/done/skipped) - LLM can output `plan` field to update/create plans - Plan auto-advances on successful steps - Replan nudge after 3 consecutive failures Self-Evaluation (#3): - New `evaluationPreviousGoal` field in AgentResponse - Pre-done verification rules in system prompt (5-step checklist) - `success` field on DoneAction for explicit failure signaling Action System (#4): - New actions: select_dropdown, switch_tab, open_tab, close_tab, search_page - Auto-detect <select> and redirect to select_dropdown - Element scroll (scroll within a specific element by index) - Wait capped at 10s Loop Detection (#5): - SHA-256 hashed sliding window (15 steps) - 3 severity tiers: mild (4x), strong (7x), critical (10x) - Page fingerprint stall detection (URL + element count + DOM hash) System Prompt (#6): - Expanded from 65 to ~170 lines with structured sections - Action chaining rules (page-changing vs safe) - Reasoning pattern guidance - Examples for evaluation, memory, planning LLM Timeout (#7): - Configurable `llmTimeout` (default 60s) - Promise-based timeout wrapper Message Compaction (#8): - Builds structured summary of compacted messages - Extracts URLs visited, goals achieved, past errors - Maintains Anthropic API user/assistant alternation AX Tree Enrichment (#9): - Fetches accessibility role/name via CDP when available - Enriches ElementInfo with axRole/axName - Falls back to DOM attributes if CDP unavailable Sensitive Data Masking (#10): - Configurable sensitivePatterns map - Applied to all user messages before LLM Prompt Caching (#2): - System prompt uses cache_control: ephemeral - Last user message uses cache_control: ephemeral - Token tracking includes cache_read and cache_creation Screenshot Control (#11): - Configurable maxScreenshotDim (default 1200px) - Zero-size element filtering in DOM context
…tion, timeout #1 AX tree: remove dead CDP calls (DOM.getDocument + Accessibility.getFullAXTree were called but axLookup never used). Replace with single batched evaluate() that reads ARIA attributes for up to 100 elements in one call. #2 Loop detection: detectLoop() now uses only previously recorded state (no domContext param). Fixes off-by-one where current step wasn't yet recorded. #3 Message compaction: prevent consecutive user messages by merging summary into preceding user message if roles collide, and skipping duplicate roles at the tail boundary. #4 JS injection: all evaluate() calls now use JSON.stringify for user-controlled values (element indices, option text, scroll amounts) instead of template interpolation. #5 updatePlan: moved after consecutiveErrors update so plan advancement uses current step's error state, not the previous step's. #6 LLM timeout: pass AbortController signal to Anthropic SDK so timed-out requests are actually cancelled instead of continuing in the background.
- Generated TS skills use '@jackwener/opencli/registry' instead of relative '../../registry.js' (fixes module resolution for user CLIs) - Tell LLM to use plain Error() instead of importing error classes - AX tree: single batched evaluate(), removed dead CDP calls - Loop detection: uses only previously recorded state (off-by-one fix) - Message compaction: prevents consecutive user messages (API compliance) - JS injection: JSON.stringify all user-controlled values in evaluate() - updatePlan: moved after consecutiveErrors update - LLM timeout: AbortController signal passed to Anthropic SDK
Critical: - C1: Sanitize API response bodies before sending to LLM prompt (was only sanitized at disk-write time, not in generation prompt) - C2: CSRF JS returns boolean flag only, never extracts actual value - C3: Remove Runtime.evaluate from CDP allowlist (use 'exec' action) Important: - I1: Add generateRaw() to LLMClient for code generation — no longer forces AgentResponse JSON wrapping around TypeScript output - I2: Preserve memory fields in message compaction summary (was discarded, contradicting "refer to your memory" prompt) - I3: Re-install network interceptor at start of each step (fetch/XHR monkey-patches are destroyed on navigation) - I4: Use nativeKeyPress for Control+a in type action (consistent with nativeType for the actual text input) - I5: Remove .sort() from loop detection hash — action order matters ([click,type] and [type,click] are different sequences) - I6: Replace greedy extractJson regex with balanced-brace parser (prevents matching stray } after the JSON object)
60ada19 to
916b4a3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
opencli operatecommand — an AI agent that autonomously controls the browser to complete tasks described in natural language, then saves successful operations as reusable TypeScript CLI skills.End-to-end verified: Successfully tested on example.com (data extraction) and Twitter/X (DM listing with login state reuse).
Key capabilities
opencli operate "task"— LLM-driven browser control loop (observe → reason → act → repeat)opencli operate --save-as site/name— Saves successful operations as.tsadapters via LLM code generationInput.dispatchMouseEvent/KeyEventforisTrusted:trueevents, with automatic JS injection fallbackArchitecture
New files
src/agent/agent-loop.tssrc/agent/action-executor.tssrc/agent/dom-context.tssrc/agent/prompts.tssrc/agent/llm-client.tsANTHROPIC_BASE_URL)src/agent/types.tssrc/agent/trace-recorder.tssrc/agent/api-discovery.tssrc/agent/skill-saver.tsSecurity
--save-asnames validated against path traversal ([a-zA-Z0-9_-]only)Dependencies
zod— Runtime validation of LLM structured output@anthropic-ai/sdk— Anthropic API clientTest plan
opencli operate --helpshows correct usageopencli operate --url https://example.com "extract heading"— completes in 1 stepopencli operate --url https://x.com "get my DMs"— navigates, extracts 10 DMs with login stateopencli operate --save-as test/heading "extract heading"— generates.tsadapternpm run buildsucceeds (392 manifest entries)opencli test heading