feat: add fault injection utilities for testing fault tolerance by yashaswikarnati · Pull Request #1868 · NVIDIA-NeMo/RL

yashaswikarnati · 2026-02-03T05:21:54Z

Add configurable fault injection infrastructure to test fault tolerance mechanisms in GRPO training. Supports injecting various fault types (GPU errors, signals, exceptions) at specific training sections and ranks.

Changes:

Add FaultInjector class for controller-side fault planning
Add @with_fault_injection decorator for worker methods
Integrate with DTensor, Megatron policy workers and vLLM generation
Add fault_tolerance config section with MTTI and deterministic modes

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added fault injection configuration capability to enable controlled fault simulation during training and generation with configurable sections, ranks, fault types, and timing strategies.
Tests
- Added comprehensive unit tests for fault injection utilities.
Chores
- Updated configuration schema with fault tolerance settings.

github-actions · 2026-02-03T05:22:15Z

⚠️ File Consistency Check

Check based on commit: 5752dde (PR #1868 from yash/fault-injection-testing)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-03T05:24:54Z

⚠️ File Consistency Check

Check based on commit: 594ed2e (PR #1868 from yash/fault-injection-testing)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

coderabbitai · 2026-02-03T05:26:15Z

📝 Walkthrough

Walkthrough

A comprehensive fault injection framework is introduced for testing fault tolerance in NeMo RL systems. A new utilities module provides configurable fault injection capabilities with global registry, while GRPO training is updated to support fault tolerance configuration. Policy workers and generation workers are instrumented with fault injection decorators, and unit tests validate the framework behavior.

Changes

Cohort / File(s)	Summary
Core Fault Injection Module `nemo_rl/utils/fault_injection.py`	Implements fault injection framework with FaultInjector class for configuration parsing, FaultPlan dataclass for fault specifications, global injector registry (set_global_fault_injector/get_global_fault_injector), with_fault_injection decorator for method instrumentation, and worker-side dispatch helpers (dispatch_fault_if_target, check_workload_exception). Includes Section enum (TRAINING, GENERATION, ENVIRONMENT), FaultToleranceConfig and FaultInjectionConfig TypedDicts, and support for deterministic timing via seed and delay calculations.
GRPO Algorithm Integration `nemo_rl/algorithms/grpo.py`, `examples/configs/grpo_math_1B.yaml`	Extends MasterConfig TypedDict with fault_tolerance field (NotRequired[FaultToleranceConfig]). GRPO training initialization reads fault tolerance configuration, creates and registers FaultInjector globally if enabled. YAML configuration adds fault_tolerance block with fields for enabled, target_section, target_rank, fault_type, delay_seconds, and optional MTDI/MTTI timing distributions and seed.
Worker Instrumentation `nemo_rl/models/policy/workers/dtensor_policy_worker.py`, `nemo_rl/models/policy/workers/megatron_policy_worker.py`, `nemo_rl/models/generation/vllm/vllm_worker.py`	Applies with_fault_injection decorator to worker methods: train method in both policy workers decorated with Section.TRAINING, and generate method in VllmGenerationWorker decorated with Section.GENERATION. Imports fault injection utilities (Section, with_fault_injection) into each worker module.
Fault Injection Tests `tests/unit/utils/test_fault_injection.py`	Comprehensive unit test suite validating FaultInjector lifecycle, disabled/enabled configurations, section and rank targeting, MTDI/MTTI-like timing behavior with seed determinism, FaultPlan dispatch logic, and decorator integration with worker-side exception handling. Uses fixtures for injector reset, mocks for dispatch and exception verification, and parameterized tests for configuration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config/<br/>Master
    participant GRPO as GRPO Train
    participant Injector as Fault Injector
    participant Worker as Policy Worker
    participant Dispatch as Dispatch<br/>Helper

    Config->>GRPO: Load fault_tolerance config
    GRPO->>Injector: Create & register if enabled
    GRPO->>Worker: Start training loop
    Note over Worker: Method decorated with<br/>@with_fault_injection
    Worker->>Injector: Get fault plan for section
    Injector-->>Worker: Return optional FaultPlan
    Worker->>Dispatch: dispatch_fault_if_target(plan, rank)
    alt Plan exists & rank matches
        Dispatch->>Dispatch: Clear workload exceptions
        Dispatch->>Worker: Issue fault with delay
    else Plan missing or rank mismatch
        Dispatch-->>Worker: Skip injection
    end
    Worker->>Worker: check_workload_exception()
    Note over Worker: Fault triggered if scheduled

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

yfw
terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically describes the main change: adding fault injection utilities for testing fault tolerance in the GRPO training system, which is the primary focus of all file changes.
Test Results For Major Changes	✅ Passed	PR introduces major testing infrastructure changes (305+ lines) with comprehensive unit test coverage (7 test functions) demonstrating feature robustness and behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yash/fault-injection-testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add configurable fault injection infrastructure to test fault tolerance mechanisms in GRPO training. Supports injecting various fault types (GPU errors, signals, exceptions) at specific training sections and ranks. Changes: - Add FaultInjector class for controller-side fault planning - Add @with_fault_injection decorator for worker methods - Integrate with DTensor, Megatron policy workers and vLLM generation - Add fault_tolerance config section with MTTI and deterministic modes - Add comprehensive unit tests (12 tests covering all behaviors) The injector uses a one-shot model and supports both fixed delays and exponential distribution (MTTI) for realistic fault timing simulation.

github-actions · 2026-02-03T05:33:11Z

ℹ️ File Consistency Check

Check based on commit: 1ecb315 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

terrykong

generally lgtm. small comments

github-actions · 2026-02-03T06:06:07Z

ℹ️ File Consistency Check

Check based on commit: 0493b6e (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

yashaswikarnati · 2026-02-03T18:54:38Z

@slikhite-1 FYI

github-actions · 2026-02-03T19:16:04Z

ℹ️ File Consistency Check

Check based on commit: 37f85d7 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

terrykong · 2026-02-04T01:02:48Z

@yashaswikarnati FYI there is lint and DCO checks to fix

Signed-off-by: ykarnati <ykarnati@nvidia.com>

github-actions · 2026-02-04T02:30:06Z

ℹ️ File Consistency Check

Check based on commit: 535520e (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

- Replace TypedDict with @DataClass for FaultInjectionConfig - Add default values directly in config class - Simplify FaultInjector constructor (remove verbose try/except) - Update tests to use typed config instead of dicts

github-actions · 2026-02-04T06:36:29Z

ℹ️ File Consistency Check

Check based on commit: 5f9a05d (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-04T07:08:30Z

ℹ️ File Consistency Check

Check based on commit: cbaae52 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

…list

github-actions · 2026-02-04T07:18:04Z

ℹ️ File Consistency Check

Check based on commit: fed5eaf (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: ykarnati <ykarnati@nvidia.com>

github-actions · 2026-02-04T07:20:50Z

ℹ️ File Consistency Check

Check based on commit: bb9c9f9 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/workers/dtensor_policy_worker.py
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

terrykong

@yashaswikarnati actually could you revert the typeddict -> dataclass change since this is actually a config exposed at the yaml level? currently we use typeddict's everywhere and for consistency we should match everything

we can revisit dataclasses in general in a later PR. i'm okay with dataclasses in non config uses, but for configs we should try to be consistent

yashaswikarnati requested review from a team as code owners February 3, 2026 05:21

yashaswikarnati force-pushed the yash/fault-injection-testing branch from 5752dde to 594ed2e Compare February 3, 2026 05:24

yashaswikarnati force-pushed the yash/fault-injection-testing branch from 594ed2e to 1ecb315 Compare February 3, 2026 05:32

terrykong reviewed Feb 3, 2026

View reviewed changes

Comment thread nemo_rl/utils/fault_injection.py Outdated

Comment thread nemo_rl/utils/fault_injection.py Outdated

Raise ValueError for invalid fault config instead of silently disabling

0493b6e

fix lint issues

37f85d7

Fix import order in test_fault_injection.py

535520e

Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026

yashaswikarnati temporarily deployed to nemo-ci February 4, 2026 02:32 — with GitHub Actions Inactive

rau=ise valueerror when to time specified

cbaae52

refactor: Raise error for missing timing config, add to pyrefly white…

fed5eaf

…list

chore: empty commit for DCO sign-off

bb9c9f9

Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati added the CI:L0 Run doctests and unit tests label Feb 4, 2026

yashaswikarnati temporarily deployed to nemo-ci February 4, 2026 07:29 — with GitHub Actions Inactive

terrykong reviewed Feb 4, 2026

View reviewed changes

yashaswikarnati temporarily deployed to nemo-ci February 4, 2026 09:44 — with GitHub Actions Inactive

terrykong added the super-v3 label Feb 12, 2026

terrykong assigned yashaswikarnati Feb 12, 2026

Conversation

yashaswikarnati commented Feb 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Feb 3, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions Bot commented Feb 3, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

coderabbitai Bot commented Feb 3, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Uh oh!

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

yashaswikarnati commented Feb 3, 2026

Uh oh!

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

terrykong commented Feb 4, 2026

Uh oh!

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yashaswikarnati commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading