Skip to content

feat(swe-bench): add an iterative predictor with ~70% success rate#1414

Closed
Jerryguan777 wants to merge 23 commits intoNVIDIA:developfrom
Jerryguan777:feat/iterative-predictor
Closed

feat(swe-bench): add an iterative predictor with ~70% success rate#1414
Jerryguan777 wants to merge 23 commits intoNVIDIA:developfrom
Jerryguan777:feat/iterative-predictor

Conversation

@Jerryguan777
Copy link
Copy Markdown

@Jerryguan777 Jerryguan777 commented Jan 15, 2026

Implemented an iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.

  • Add IterativeAgent
  • Add config_iterative.yml
  • Add git tools
  • Add SweBenchPredictorIterativeConfig
  • Register iterative predictor and git tool
  • Update README.md

How Has This Been Tested?

export ANTHROPIC_API_KEY=sk-xxxxxx

nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml
Running 10 instances...
10 ran successfully, 0 failed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [04:30<00:00, 27.07s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 10
Instances submitted: 10
Instances completed: 10
Instances incomplete: 0
Instances resolved: 7
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 2
Report written to nv_predictor.nat_iterative_1.json
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:194 - Completed swe_bench run nat_iterative_1
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:200 - SWE_bench report and logs written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_reports directory
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:275 - Profiler is not enabled. Skipping profiling.
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:366 - Original config file copied to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_original.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:384 - Effective config (with overrides) saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_effective.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:427 - Configuration metadata saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_metadata.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:449 - Workflow output written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/workflow_output.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:460 - Evaluation results written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_output.json

=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 480.70s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |         0.7 | swe_bench_output.json |

Description

Closes #1397

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added an iterative LLM-driven predictor for SWE-bench with multi-step planning, configurable step limits/timeouts, per-step observability, robust command validation, and an integrated git-repo workspace tool for isolated setup, checkout, and cleanup.
  • Documentation

    • README and example workflow updated with benchmark context, evaluation guidance, metrics framing, harder benchmarks, and concrete onboarding steps for adding new predictors.
  • Tests

    • Extensive test suite validating iterative flows, safety checks, error recovery, timeouts, workspace isolation, and integrations.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE PR should not be merged; see PR for details feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Iterative Predictor for Improved SWE-bench Issue Resolution

4 participants