feat: add open-operator — AI agent browser automation by jackwener · Pull Request #614 · jackwener/opencli

jackwener · 2026-03-30T15:13:14Z

Summary

Add opencli operate command — an AI agent that autonomously controls the browser to complete tasks described in natural language, then saves successful operations as reusable TypeScript CLI skills.

End-to-end verified: Successfully tested on example.com (data extraction) and Twitter/X (DM listing with login state reuse).

Key capabilities

opencli operate "task" — LLM-driven browser control loop (observe → reason → act → repeat)
opencli operate --save-as site/name — Saves successful operations as .ts adapters via LLM code generation
Native CDP input — Input.dispatchMouseEvent/KeyEvent for isTrusted:true events, with automatic JS injection fallback
Rich trace capture — Network interceptor captures API responses for intelligent strategy selection
API discovery — Analyzes captured requests to find the "golden API", recommends optimal strategy (PUBLIC/COOKIE/UI)
Self-repair — Generated adapters are syntax-validated; errors are fed back to LLM for auto-fix

Architecture

CLI → Agent Loop → DOM Snapshot + LLM → Execute Actions → Observe → Repeat
                                                              ↓
                                              Rich Trace (actions + network + auth)
                                                              ↓
                                              API Discovery → LLM Code Gen → .ts Adapter

New files

File	Purpose
`src/agent/agent-loop.ts`	Core LLM-driven control loop
`src/agent/action-executor.ts`	Action dispatch with CDP→JS fallback
`src/agent/dom-context.ts`	DOM snapshot + element coordinate map
`src/agent/prompts.ts`	System prompt & step message builder
`src/agent/llm-client.ts`	Anthropic SDK wrapper (supports `ANTHROPIC_BASE_URL`)
`src/agent/types.ts`	Zod schemas for actions & responses
`src/agent/trace-recorder.ts`	Rich context capture (network + auth + thinking)
`src/agent/api-discovery.ts`	API scoring & strategy recommendation
`src/agent/skill-saver.ts`	LLM-powered TS adapter generation + validation

Security

CDP passthrough uses a method allowlist (22 permitted methods)
Network response bodies are sanitized before writing to disk (tokens/passwords redacted)
Auth tokens stored as boolean flags only, never actual values
--save-as names validated against path traversal ([a-zA-Z0-9_-] only)

Dependencies

zod — Runtime validation of LLM structured output
@anthropic-ai/sdk — Anthropic API client

Test plan

opencli operate --help shows correct usage
opencli operate --url https://example.com "extract heading" — completes in 1 step
opencli operate --url https://x.com "get my DMs" — navigates, extracts 10 DMs with login state
opencli operate --save-as test/heading "extract heading" — generates .ts adapter
TypeScript compilation passes (main project + extension)
npm run build succeeds (392 manifest entries)
Replay generated adapter: opencli test heading

Add `opencli operate` command that uses an LLM-driven loop to autonomously control the browser and complete tasks described in natural language. Successful operations can be saved as reusable CLI skills via --save-as. Architecture: - Phase 1: CDP passthrough via chrome.debugger (allowlisted methods) - Phase 2: DOM context builder (reuses existing dom-snapshot + coordinates) - Phase 3: Agent loop (context → LLM → execute → observe → repeat) - Phase 4: CLI integration (`opencli operate/op`) - Phase 5: Skill sedimentation (trace → YAML pipeline) New dependencies: zod, @anthropic-ai/sdk

- ActionExecutor now tries nativeClick/nativeType first, catches errors, and falls back to page.click/typeText (JS injection) automatically - Add empty response guard in LLMClient for third-party API proxies

Replace YAML skill sedimentation with intelligent TypeScript adapter generation: - Stage 1: Rich context capture — network interceptor captures all fetch/XHR responses with bodies, plus agent thinking/memory log - Stage 2: API discovery — scores captured requests by field overlap with extracted data, recommends optimal strategy (PUBLIC/COOKIE/UI) - Stage 3: LLM code generation — sends full context (API responses, auth state, action trace, reference patterns) to generate production TS adapters - Stage 4: Validation & self-repair — imports generated adapter to verify syntax, feeds errors back to LLM for auto-fix (2 retries) The generated .ts adapters can discover and use APIs directly instead of replaying brittle UI actions, producing much more stable skills.

Replace import-based validation (fails due to path resolution) with static syntax checks that catch common LLM code generation issues: - page.evaluate() with arrow function instead of string - page.waitForSelector (doesn't exist on IPage) - Missing .js in import paths (ESM requirement) - Missing cli() call or registry import Also add these constraints to the generation prompt so the LLM avoids them in the first place.

Security: - Sanitize response bodies before writing trace to disk (redact tokens, passwords, API keys) - CSRF/bearer tokens stored as boolean flags only, never actual values - Path traversal protection on --save-as (alphanumeric/dash/underscore only) Robustness: - LLM response parsing: require done action with code, no JSON.stringify fallback - needsAuth: check auth-related cookie patterns, not all cookies - Import path regex: fix contradictory && condition - Call recordFinalSnapshot before trace finalization

…mpt, actions Closes all high and medium priority gaps vs Browser Use: Planning System (#1): - PlanItem state machine (pending/current/done/skipped) - LLM can output `plan` field to update/create plans - Plan auto-advances on successful steps - Replan nudge after 3 consecutive failures Self-Evaluation (#3): - New `evaluationPreviousGoal` field in AgentResponse - Pre-done verification rules in system prompt (5-step checklist) - `success` field on DoneAction for explicit failure signaling Action System (#4): - New actions: select_dropdown, switch_tab, open_tab, close_tab, search_page - Auto-detect <select> and redirect to select_dropdown - Element scroll (scroll within a specific element by index) - Wait capped at 10s Loop Detection (#5): - SHA-256 hashed sliding window (15 steps) - 3 severity tiers: mild (4x), strong (7x), critical (10x) - Page fingerprint stall detection (URL + element count + DOM hash) System Prompt (#6): - Expanded from 65 to ~170 lines with structured sections - Action chaining rules (page-changing vs safe) - Reasoning pattern guidance - Examples for evaluation, memory, planning LLM Timeout (#7): - Configurable `llmTimeout` (default 60s) - Promise-based timeout wrapper Message Compaction (#8): - Builds structured summary of compacted messages - Extracts URLs visited, goals achieved, past errors - Maintains Anthropic API user/assistant alternation AX Tree Enrichment (#9): - Fetches accessibility role/name via CDP when available - Enriches ElementInfo with axRole/axName - Falls back to DOM attributes if CDP unavailable Sensitive Data Masking (#10): - Configurable sensitivePatterns map - Applied to all user messages before LLM Prompt Caching (#2): - System prompt uses cache_control: ephemeral - Last user message uses cache_control: ephemeral - Token tracking includes cache_read and cache_creation Screenshot Control (#11): - Configurable maxScreenshotDim (default 1200px) - Zero-size element filtering in DOM context

…tion, timeout #1 AX tree: remove dead CDP calls (DOM.getDocument + Accessibility.getFullAXTree were called but axLookup never used). Replace with single batched evaluate() that reads ARIA attributes for up to 100 elements in one call. #2 Loop detection: detectLoop() now uses only previously recorded state (no domContext param). Fixes off-by-one where current step wasn't yet recorded. #3 Message compaction: prevent consecutive user messages by merging summary into preceding user message if roles collide, and skipping duplicate roles at the tail boundary. #4 JS injection: all evaluate() calls now use JSON.stringify for user-controlled values (element indices, option text, scroll amounts) instead of template interpolation. #5 updatePlan: moved after consecutiveErrors update so plan advancement uses current step's error state, not the previous step's. #6 LLM timeout: pass AbortController signal to Anthropic SDK so timed-out requests are actually cancelled instead of continuing in the background.

- Generated TS skills use '@jackwener/opencli/registry' instead of relative '../../registry.js' (fixes module resolution for user CLIs) - Tell LLM to use plain Error() instead of importing error classes - AX tree: single batched evaluate(), removed dead CDP calls - Loop detection: uses only previously recorded state (off-by-one fix) - Message compaction: prevents consecutive user messages (API compliance) - JS injection: JSON.stringify all user-controlled values in evaluate() - updatePlan: moved after consecutiveErrors update - LLM timeout: AbortController signal passed to Anthropic SDK

Critical: - C1: Sanitize API response bodies before sending to LLM prompt (was only sanitized at disk-write time, not in generation prompt) - C2: CSRF JS returns boolean flag only, never extracts actual value - C3: Remove Runtime.evaluate from CDP allowlist (use 'exec' action) Important: - I1: Add generateRaw() to LLMClient for code generation — no longer forces AgentResponse JSON wrapping around TypeScript output - I2: Preserve memory fields in message compaction summary (was discarded, contradicting "refer to your memory" prompt) - I3: Re-install network interceptor at start of each step (fetch/XHR monkey-patches are destroyed on navigation) - I4: Use nativeKeyPress for Control+a in type action (consistent with nativeType for the actual text input) - I5: Remove .sort() from loop detection hash — action order matters ([click,type] and [type,click] are different sequences) - I6: Replace greedy extractJson regex with balanced-brace parser (prevents matching stray } after the JSON object)

jackwener force-pushed the open-operator branch from af5040a to 4dc40ab Compare March 31, 2026 05:33

jackwener added 9 commits March 31, 2026 14:55

fix(agent): fallback to JS injection when CDP click/type fails

827ef6d

- ActionExecutor now tries nativeClick/nativeType first, catches errors, and falls back to page.click/typeText (JS injection) automatically - Add empty response guard in LLMClient for third-party API proxies

jackwener force-pushed the open-operator branch from 60ada19 to 916b4a3 Compare March 31, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add open-operator — AI agent browser automation#614

feat: add open-operator — AI agent browser automation#614
jackwener wants to merge 9 commits intomainfrom
open-operator

jackwener commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackwener commented Mar 30, 2026

Summary

Key capabilities

Architecture

New files

Security

Dependencies

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant