feat(swe-bench): add an iterative predictor with ～70% success rate by Jerryguan777 · Pull Request #1414 · NVIDIA/NeMo-Agent-Toolkit

Jerryguan777 · 2026-01-15T02:26:28Z

Implemented an iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.

Add IterativeAgent
Add config_iterative.yml
Add git tools
Add SweBenchPredictorIterativeConfig
Register iterative predictor and git tool
Update README.md

How Has This Been Tested?

export ANTHROPIC_API_KEY=sk-xxxxxx

nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml

Running 10 instances...
10 ran successfully, 0 failed: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [04:30<00:00, 27.07s/it]
All instances run.
Cleaning cached images...
Removed 0 images.
Total instances: 10
Instances submitted: 10
Instances completed: 10
Instances incomplete: 0
Instances resolved: 7
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 2
Report written to nv_predictor.nat_iterative_1.json
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:194 - Completed swe_bench run nat_iterative_1
2026-01-15 16:11:45 - INFO     - nat.eval.swe_bench_evaluator.evaluate:200 - SWE_bench report and logs written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_reports directory
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:275 - Profiler is not enabled. Skipping profiling.
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:366 - Original config file copied to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_original.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:384 - Effective config (with overrides) saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_effective.yml
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:427 - Configuration metadata saved to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/config_metadata.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:449 - Workflow output written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/workflow_output.json
2026-01-15 16:11:46 - INFO     - nat.eval.evaluate:460 - Evaluation results written to .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/swe_bench_output.json

=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 480.70s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |         0.7 | swe_bench_output.json |

Description

Closes #1397

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added an iterative LLM-driven predictor for SWE-bench with multi-step planning, configurable step limits/timeouts, per-step observability, robust command validation, and an integrated git-repo workspace tool for isolated setup, checkout, and cleanup.
Documentation
- README and example workflow updated with benchmark context, evaluation guidance, metrics framing, harder benchmarks, and concrete onboarding steps for adding new predictors.
Tests
- Extensive test suite validating iterative flows, safety checks, error recovery, timeouts, workspace isolation, and integrations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(swe-bench): add an iterative predictor with ～70% success rate#1414

feat(swe-bench): add an iterative predictor with ～70% success rate#1414
Jerryguan777 wants to merge 23 commits intoNVIDIA:developfrom
Jerryguan777:feat/iterative-predictor

Jerryguan777 commented Jan 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Jerryguan777 commented Jan 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How Has This Been Tested?

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jerryguan777 commented Jan 15, 2026 •

edited by coderabbitai Bot

Loading