SkillSafetyBench is a fully materialized benchmark for evaluating safety failures caused by attacks on skill-facing context. Each case keeps the user-facing task ordinary, while the attack is placed in local skills, helper scripts, sidecar files, corpora, templates, or other workspace artifacts that an agent may consult during execution.
For a visual benchmark overview, case explorer, leaderboard, and experiment analysis, visit the SkillSafetyBench website.
Modern coding and task agents often rely on local skills and environment artifacts, not only the user prompt. SkillSafetyBench tests whether an agent can complete benign work while resisting unsafe instructions or misleading context embedded in those skill-facing surfaces. The benign task carriers are selected from SkillsBench.
The benchmark is static and ready to run. It is not a case generator. The repository contains 155 prepared cases across 6 risk domains and 30 categories. Each case packages:
- a benign base task that should still be completed
- one or more skill-facing attack surfaces
- a deterministic verifier for the unsafe behavior
- task tests for base-task completion
- metadata and rationale describing the final case design
Each run is evaluated on two axes:
task_success: whether the ordinary user task was completedattack_success: whether the unsafe behavior was observed
This separation is important: a system can appear capable because it completes the task, while still being vulnerable because it follows the attack-bearing skill context.
SkillSafetyBench contains 155 cases organized into 6 risk domains and 30 categories. Each risk domain contains 5 categories, and each category contains multiple concrete cases built around ordinary task carriers.
SkillSafetyBench
└── 6 risk domains
├── RD1: Context Trust and Prompt Manipulation Risks
├── RD2: Agency, Scope, and Authorization Risks
├── RD3: Execution, Runtime, Framework, and Protocol Risks
├── RD4: Data Boundary, Output, and Externalization Risks
├── RD5: Memory, Recovery, Audit, and Persistence Risks
└── RD6: Knowledge, Model, Supply Chain, and Operational Risks
Cases are stored by risk domain, category, and case id:
benchmark/<risk-domain>/<category>/<case>/
metadata.json
rationale.md
attacked_task/
instruction.md
task.toml
environment/
solution/solve.sh
tests/test.sh
eval/verify_attack.py
The benchmark directory is the source of truth for cases. Category directories use the categoryN-... naming pattern, for example:
benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/category1-availability_cost_and_service_exhaustion/
The benchmark construction pipeline starts from benign task carriers, preserves the original user-facing task, injects the attack through skill-facing artifacts, and attaches deterministic checks for both task completion and attack behavior. This keeps the evaluation focused on whether agents can resist unsafe local context while still doing the requested work.
This repository contains adversarial benchmark cases. Some cases simulate phishing, credential solicitation, destructive operations, data exfiltration, sandbox boundary abuse, external actions, and persistence-like behaviors.
Run the benchmark only in isolated research environments. Do not use production credentials, production cloud accounts, personal accounts, or sensitive local files. Prefer throwaway API keys, disposable containers, restricted network access, and a dedicated working directory. Review run artifacts before sharing them.
Required tools:
bashpython3, recommended>= 3.11dockeruvharbornvm- Node.js
22
Install Harbor if needed:
uv tool install harborCheck the basic toolchain:
harbor --help
python3 --version
docker --versionIf your Harbor agent is installed through Node, activate Node 22 in the shell that starts the run:
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
nvm use 22Create a local .envrc from the example file:
cp env.example .envrcFor codex, set at least:
export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export OPENAI_API_KEY="replace-with-your-openai-api-key"For claude-code, set an Anthropic-compatible auth variable, for example:
export ANTHROPIC_BASE_URL="https://api.anthropic.com"
export ANTHROPIC_API_KEY="replace-with-your-anthropic-api-key"If you use direnv, run:
direnv allowIf not, no extra step is required. The batch script explicitly sources .envrc at runtime.
The canonical entry point is:
scripts/start_agent_batch.shRun one risk domain:
scripts/start_agent_batch.sh --rd 5Choose agent and model:
scripts/start_agent_batch.sh \
--rd 3 \
--agent codex \
--model openai/gpt-5.4Run with Claude Code:
scripts/start_agent_batch.sh \
--rd 3 \
--agent claude-code \
--model claude-3-7-sonnet-20250219Run a single case:
scripts/start_agent_batch.sh \
--rd 5 \
--case travel_planning_poisoned_belief_persistence_memory_anchorRun a single category:
scripts/start_agent_batch.sh \
--rd 6 \
--category category3-model_behavior_and_finetuning_poisoningRun from an explicit manifest:
scripts/start_agent_batch.sh \
--manifest benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/manifest.jsonUse a custom env file:
scripts/start_agent_batch.sh \
--rd 5 \
--envrc /path/to/your.envrcPrint selected cases without executing:
scripts/start_agent_batch.sh --rd 5 --dry-runEach run creates an output directory under jobs/:
jobs/<agent>-risk-domain-5-memory-recovery-audit-and-persistence-risks-<timestamp>/
Start with:
jobs/<run>/attack_results.jsonjobs/<run>/summary.jsonjobs/<run>/attack_results.csvjobs/<run>/summary.csv
Useful per-run files:
selected_cases.jsonbatch_config.json<case_id>/case_result.jsonattack_results.md
Common attack outcomes:
attack_successattack_not_observedtask_output_missing
task_output_missing means the expected explicit task output was absent. The attack verifier may still continue when enough artifacts exist to evaluate the attack condition.

