Skip to content

feat: add fault injection utilities for testing fault tolerance#1868

Open
yashaswikarnati wants to merge 8 commits intomainfrom
yash/fault-injection-testing
Open

feat: add fault injection utilities for testing fault tolerance#1868
yashaswikarnati wants to merge 8 commits intomainfrom
yash/fault-injection-testing

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Contributor

@yashaswikarnati yashaswikarnati commented Feb 3, 2026

Add configurable fault injection infrastructure to test fault tolerance mechanisms in GRPO training. Supports injecting various fault types (GPU errors, signals, exceptions) at specific training sections and ranks.

Changes:

  • Add FaultInjector class for controller-side fault planning
  • Add @with_fault_injection decorator for worker methods
  • Integrate with DTensor, Megatron policy workers and vLLM generation
  • Add fault_tolerance config section with MTTI and deterministic modes

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added fault injection configuration capability to enable controlled fault simulation during training and generation with configurable sections, ranks, fault types, and timing strategies.
  • Tests

    • Added comprehensive unit tests for fault injection utilities.
  • Chores

    • Updated configuration schema with fault tolerance settings.

@yashaswikarnati yashaswikarnati requested review from a team as code owners February 3, 2026 05:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

⚠️ File Consistency Check

Check based on commit: 5752dde (PR #1868 from yash/fault-injection-testing)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@yashaswikarnati yashaswikarnati force-pushed the yash/fault-injection-testing branch from 5752dde to 594ed2e Compare February 3, 2026 05:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

⚠️ File Consistency Check

Check based on commit: 594ed2e (PR #1868 from yash/fault-injection-testing)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

A comprehensive fault injection framework is introduced for testing fault tolerance in NeMo RL systems. A new utilities module provides configurable fault injection capabilities with global registry, while GRPO training is updated to support fault tolerance configuration. Policy workers and generation workers are instrumented with fault injection decorators, and unit tests validate the framework behavior.

Changes

Cohort / File(s) Summary
Core Fault Injection Module
nemo_rl/utils/fault_injection.py
Implements fault injection framework with FaultInjector class for configuration parsing, FaultPlan dataclass for fault specifications, global injector registry (set_global_fault_injector/get_global_fault_injector), with_fault_injection decorator for method instrumentation, and worker-side dispatch helpers (dispatch_fault_if_target, check_workload_exception). Includes Section enum (TRAINING, GENERATION, ENVIRONMENT), FaultToleranceConfig and FaultInjectionConfig TypedDicts, and support for deterministic timing via seed and delay calculations.
GRPO Algorithm Integration
nemo_rl/algorithms/grpo.py, examples/configs/grpo_math_1B.yaml
Extends MasterConfig TypedDict with fault_tolerance field (NotRequired[FaultToleranceConfig]). GRPO training initialization reads fault tolerance configuration, creates and registers FaultInjector globally if enabled. YAML configuration adds fault_tolerance block with fields for enabled, target_section, target_rank, fault_type, delay_seconds, and optional MTDI/MTTI timing distributions and seed.
Worker Instrumentation
nemo_rl/models/policy/workers/dtensor_policy_worker.py, nemo_rl/models/policy/workers/megatron_policy_worker.py, nemo_rl/models/generation/vllm/vllm_worker.py
Applies with_fault_injection decorator to worker methods: train method in both policy workers decorated with Section.TRAINING, and generate method in VllmGenerationWorker decorated with Section.GENERATION. Imports fault injection utilities (Section, with_fault_injection) into each worker module.
Fault Injection Tests
tests/unit/utils/test_fault_injection.py
Comprehensive unit test suite validating FaultInjector lifecycle, disabled/enabled configurations, section and rank targeting, MTDI/MTTI-like timing behavior with seed determinism, FaultPlan dispatch logic, and decorator integration with worker-side exception handling. Uses fixtures for injector reset, mocks for dispatch and exception verification, and parameterized tests for configuration scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config/<br/>Master
    participant GRPO as GRPO Train
    participant Injector as Fault Injector
    participant Worker as Policy Worker
    participant Dispatch as Dispatch<br/>Helper

    Config->>GRPO: Load fault_tolerance config
    GRPO->>Injector: Create & register if enabled
    GRPO->>Worker: Start training loop
    Note over Worker: Method decorated with<br/>@with_fault_injection
    Worker->>Injector: Get fault plan for section
    Injector-->>Worker: Return optional FaultPlan
    Worker->>Dispatch: dispatch_fault_if_target(plan, rank)
    alt Plan exists & rank matches
        Dispatch->>Dispatch: Clear workload exceptions
        Dispatch->>Worker: Issue fault with delay
    else Plan missing or rank mismatch
        Dispatch-->>Worker: Skip injection
    end
    Worker->>Worker: check_workload_exception()
    Note over Worker: Fault triggered if scheduled
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • yfw
  • terrykong
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: adding fault injection utilities for testing fault tolerance in the GRPO training system, which is the primary focus of all file changes.
Test Results For Major Changes ✅ Passed PR introduces major testing infrastructure changes (305+ lines) with comprehensive unit test coverage (7 test functions) demonstrating feature robustness and behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yash/fault-injection-testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Add configurable fault injection infrastructure to test fault tolerance
mechanisms in GRPO training. Supports injecting various fault types
(GPU errors, signals, exceptions) at specific training sections and ranks.

Changes:
- Add FaultInjector class for controller-side fault planning
- Add @with_fault_injection decorator for worker methods
- Integrate with DTensor, Megatron policy workers and vLLM generation
- Add fault_tolerance config section with MTTI and deterministic modes
- Add comprehensive unit tests (12 tests covering all behaviors)

The injector uses a one-shot model and supports both fixed delays and
exponential distribution (MTTI) for realistic fault timing simulation.
@yashaswikarnati yashaswikarnati force-pushed the yash/fault-injection-testing branch from 594ed2e to 1ecb315 Compare February 3, 2026 05:32
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

Check based on commit: 1ecb315 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm. small comments

Comment thread nemo_rl/utils/fault_injection.py Outdated
Comment thread nemo_rl/utils/fault_injection.py Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

Check based on commit: 0493b6e (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@yashaswikarnati
Copy link
Copy Markdown
Contributor Author

@slikhite-1 FYI

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 3, 2026

ℹ️ File Consistency Check

Check based on commit: 37f85d7 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@terrykong
Copy link
Copy Markdown
Collaborator

@yashaswikarnati FYI there is lint and DCO checks to fix

Signed-off-by: ykarnati <ykarnati@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

Check based on commit: 535520e (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@yashaswikarnati yashaswikarnati added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026
- Replace TypedDict with @DataClass for FaultInjectionConfig
- Add default values directly in config class
- Simplify FaultInjector constructor (remove verbose try/except)
- Update tests to use typed config instead of dicts
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

Check based on commit: 5f9a05d (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

Check based on commit: cbaae52 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

Check based on commit: fed5eaf (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: ykarnati <ykarnati@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

ℹ️ File Consistency Check

Check based on commit: bb9c9f9 (PR #1868 from yash/fault-injection-testing)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@yashaswikarnati yashaswikarnati added the CI:L0 Run doctests and unit tests label Feb 4, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashaswikarnati actually could you revert the typeddict -> dataclass change since this is actually a config exposed at the yaml level? currently we use typeddict's everywhere and for consistency we should match everything

we can revisit dataclasses in general in a later PR. i'm okay with dataclasses in non config uses, but for configs we should try to be consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests CI:L1 Run doctests, unit tests, and functional tests super-v3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants