Skip to content

feat: add open-operator — AI agent browser automation#614

Open
jackwener wants to merge 9 commits intomainfrom
open-operator
Open

feat: add open-operator — AI agent browser automation#614
jackwener wants to merge 9 commits intomainfrom
open-operator

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Add opencli operate command — an AI agent that autonomously controls the browser to complete tasks described in natural language, then saves successful operations as reusable TypeScript CLI skills.

End-to-end verified: Successfully tested on example.com (data extraction) and Twitter/X (DM listing with login state reuse).

Key capabilities

  • opencli operate "task" — LLM-driven browser control loop (observe → reason → act → repeat)
  • opencli operate --save-as site/name — Saves successful operations as .ts adapters via LLM code generation
  • Native CDP inputInput.dispatchMouseEvent/KeyEvent for isTrusted:true events, with automatic JS injection fallback
  • Rich trace capture — Network interceptor captures API responses for intelligent strategy selection
  • API discovery — Analyzes captured requests to find the "golden API", recommends optimal strategy (PUBLIC/COOKIE/UI)
  • Self-repair — Generated adapters are syntax-validated; errors are fed back to LLM for auto-fix

Architecture

CLI → Agent Loop → DOM Snapshot + LLM → Execute Actions → Observe → Repeat
                                                              ↓
                                              Rich Trace (actions + network + auth)
                                                              ↓
                                              API Discovery → LLM Code Gen → .ts Adapter

New files

File Purpose
src/agent/agent-loop.ts Core LLM-driven control loop
src/agent/action-executor.ts Action dispatch with CDP→JS fallback
src/agent/dom-context.ts DOM snapshot + element coordinate map
src/agent/prompts.ts System prompt & step message builder
src/agent/llm-client.ts Anthropic SDK wrapper (supports ANTHROPIC_BASE_URL)
src/agent/types.ts Zod schemas for actions & responses
src/agent/trace-recorder.ts Rich context capture (network + auth + thinking)
src/agent/api-discovery.ts API scoring & strategy recommendation
src/agent/skill-saver.ts LLM-powered TS adapter generation + validation

Security

  • CDP passthrough uses a method allowlist (22 permitted methods)
  • Network response bodies are sanitized before writing to disk (tokens/passwords redacted)
  • Auth tokens stored as boolean flags only, never actual values
  • --save-as names validated against path traversal ([a-zA-Z0-9_-] only)

Dependencies

  • zod — Runtime validation of LLM structured output
  • @anthropic-ai/sdk — Anthropic API client

Test plan

  • opencli operate --help shows correct usage
  • opencli operate --url https://example.com "extract heading" — completes in 1 step
  • opencli operate --url https://x.com "get my DMs" — navigates, extracts 10 DMs with login state
  • opencli operate --save-as test/heading "extract heading" — generates .ts adapter
  • TypeScript compilation passes (main project + extension)
  • npm run build succeeds (392 manifest entries)
  • Replay generated adapter: opencli test heading

Add `opencli operate` command that uses an LLM-driven loop to
autonomously control the browser and complete tasks described in
natural language. Successful operations can be saved as reusable
CLI skills via --save-as.

Architecture:
- Phase 1: CDP passthrough via chrome.debugger (allowlisted methods)
- Phase 2: DOM context builder (reuses existing dom-snapshot + coordinates)
- Phase 3: Agent loop (context → LLM → execute → observe → repeat)
- Phase 4: CLI integration (`opencli operate/op`)
- Phase 5: Skill sedimentation (trace → YAML pipeline)

New dependencies: zod, @anthropic-ai/sdk
- ActionExecutor now tries nativeClick/nativeType first, catches errors,
  and falls back to page.click/typeText (JS injection) automatically
- Add empty response guard in LLMClient for third-party API proxies
Replace YAML skill sedimentation with intelligent TypeScript adapter
generation:

- Stage 1: Rich context capture — network interceptor captures all
  fetch/XHR responses with bodies, plus agent thinking/memory log
- Stage 2: API discovery — scores captured requests by field overlap
  with extracted data, recommends optimal strategy (PUBLIC/COOKIE/UI)
- Stage 3: LLM code generation — sends full context (API responses,
  auth state, action trace, reference patterns) to generate production
  TS adapters
- Stage 4: Validation & self-repair — imports generated adapter to
  verify syntax, feeds errors back to LLM for auto-fix (2 retries)

The generated .ts adapters can discover and use APIs directly instead
of replaying brittle UI actions, producing much more stable skills.
Replace import-based validation (fails due to path resolution) with
static syntax checks that catch common LLM code generation issues:
- page.evaluate() with arrow function instead of string
- page.waitForSelector (doesn't exist on IPage)
- Missing .js in import paths (ESM requirement)
- Missing cli() call or registry import

Also add these constraints to the generation prompt so the LLM
avoids them in the first place.
Security:
- Sanitize response bodies before writing trace to disk (redact tokens,
  passwords, API keys)
- CSRF/bearer tokens stored as boolean flags only, never actual values
- Path traversal protection on --save-as (alphanumeric/dash/underscore only)

Robustness:
- LLM response parsing: require done action with code, no JSON.stringify fallback
- needsAuth: check auth-related cookie patterns, not all cookies
- Import path regex: fix contradictory && condition
- Call recordFinalSnapshot before trace finalization
…mpt, actions

Closes all high and medium priority gaps vs Browser Use:

Planning System (#1):
- PlanItem state machine (pending/current/done/skipped)
- LLM can output `plan` field to update/create plans
- Plan auto-advances on successful steps
- Replan nudge after 3 consecutive failures

Self-Evaluation (#3):
- New `evaluationPreviousGoal` field in AgentResponse
- Pre-done verification rules in system prompt (5-step checklist)
- `success` field on DoneAction for explicit failure signaling

Action System (#4):
- New actions: select_dropdown, switch_tab, open_tab, close_tab, search_page
- Auto-detect <select> and redirect to select_dropdown
- Element scroll (scroll within a specific element by index)
- Wait capped at 10s

Loop Detection (#5):
- SHA-256 hashed sliding window (15 steps)
- 3 severity tiers: mild (4x), strong (7x), critical (10x)
- Page fingerprint stall detection (URL + element count + DOM hash)

System Prompt (#6):
- Expanded from 65 to ~170 lines with structured sections
- Action chaining rules (page-changing vs safe)
- Reasoning pattern guidance
- Examples for evaluation, memory, planning

LLM Timeout (#7):
- Configurable `llmTimeout` (default 60s)
- Promise-based timeout wrapper

Message Compaction (#8):
- Builds structured summary of compacted messages
- Extracts URLs visited, goals achieved, past errors
- Maintains Anthropic API user/assistant alternation

AX Tree Enrichment (#9):
- Fetches accessibility role/name via CDP when available
- Enriches ElementInfo with axRole/axName
- Falls back to DOM attributes if CDP unavailable

Sensitive Data Masking (#10):
- Configurable sensitivePatterns map
- Applied to all user messages before LLM

Prompt Caching (#2):
- System prompt uses cache_control: ephemeral
- Last user message uses cache_control: ephemeral
- Token tracking includes cache_read and cache_creation

Screenshot Control (#11):
- Configurable maxScreenshotDim (default 1200px)
- Zero-size element filtering in DOM context
…tion, timeout

#1 AX tree: remove dead CDP calls (DOM.getDocument + Accessibility.getFullAXTree
   were called but axLookup never used). Replace with single batched evaluate()
   that reads ARIA attributes for up to 100 elements in one call.

#2 Loop detection: detectLoop() now uses only previously recorded state (no
   domContext param). Fixes off-by-one where current step wasn't yet recorded.

#3 Message compaction: prevent consecutive user messages by merging summary
   into preceding user message if roles collide, and skipping duplicate roles
   at the tail boundary.

#4 JS injection: all evaluate() calls now use JSON.stringify for user-controlled
   values (element indices, option text, scroll amounts) instead of template
   interpolation.

#5 updatePlan: moved after consecutiveErrors update so plan advancement uses
   current step's error state, not the previous step's.

#6 LLM timeout: pass AbortController signal to Anthropic SDK so timed-out
   requests are actually cancelled instead of continuing in the background.
- Generated TS skills use '@jackwener/opencli/registry' instead of
  relative '../../registry.js' (fixes module resolution for user CLIs)
- Tell LLM to use plain Error() instead of importing error classes
- AX tree: single batched evaluate(), removed dead CDP calls
- Loop detection: uses only previously recorded state (off-by-one fix)
- Message compaction: prevents consecutive user messages (API compliance)
- JS injection: JSON.stringify all user-controlled values in evaluate()
- updatePlan: moved after consecutiveErrors update
- LLM timeout: AbortController signal passed to Anthropic SDK
Critical:
- C1: Sanitize API response bodies before sending to LLM prompt
  (was only sanitized at disk-write time, not in generation prompt)
- C2: CSRF JS returns boolean flag only, never extracts actual value
- C3: Remove Runtime.evaluate from CDP allowlist (use 'exec' action)

Important:
- I1: Add generateRaw() to LLMClient for code generation — no longer
  forces AgentResponse JSON wrapping around TypeScript output
- I2: Preserve memory fields in message compaction summary (was
  discarded, contradicting "refer to your memory" prompt)
- I3: Re-install network interceptor at start of each step (fetch/XHR
  monkey-patches are destroyed on navigation)
- I4: Use nativeKeyPress for Control+a in type action (consistent with
  nativeType for the actual text input)
- I5: Remove .sort() from loop detection hash — action order matters
  ([click,type] and [type,click] are different sequences)
- I6: Replace greedy extractJson regex with balanced-brace parser
  (prevents matching stray } after the JSON object)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant