feat: add fault injection utilities for testing fault tolerance#1868
feat: add fault injection utilities for testing fault tolerance#1868yashaswikarnati wants to merge 8 commits intomainfrom
Conversation
|
5752dde to
594ed2e
Compare
|
📝 WalkthroughWalkthroughA comprehensive fault injection framework is introduced for testing fault tolerance in NeMo RL systems. A new utilities module provides configurable fault injection capabilities with global registry, while GRPO training is updated to support fault tolerance configuration. Policy workers and generation workers are instrumented with fault injection decorators, and unit tests validate the framework behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Config/<br/>Master
participant GRPO as GRPO Train
participant Injector as Fault Injector
participant Worker as Policy Worker
participant Dispatch as Dispatch<br/>Helper
Config->>GRPO: Load fault_tolerance config
GRPO->>Injector: Create & register if enabled
GRPO->>Worker: Start training loop
Note over Worker: Method decorated with<br/>@with_fault_injection
Worker->>Injector: Get fault plan for section
Injector-->>Worker: Return optional FaultPlan
Worker->>Dispatch: dispatch_fault_if_target(plan, rank)
alt Plan exists & rank matches
Dispatch->>Dispatch: Clear workload exceptions
Dispatch->>Worker: Issue fault with delay
else Plan missing or rank mismatch
Dispatch-->>Worker: Skip injection
end
Worker->>Worker: check_workload_exception()
Note over Worker: Fault triggered if scheduled
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add configurable fault injection infrastructure to test fault tolerance mechanisms in GRPO training. Supports injecting various fault types (GPU errors, signals, exceptions) at specific training sections and ranks. Changes: - Add FaultInjector class for controller-side fault planning - Add @with_fault_injection decorator for worker methods - Integrate with DTensor, Megatron policy workers and vLLM generation - Add fault_tolerance config section with MTTI and deterministic modes - Add comprehensive unit tests (12 tests covering all behaviors) The injector uses a one-shot model and supports both fixed delays and exponential distribution (MTTI) for realistic fault timing simulation.
594ed2e to
1ecb315
Compare
ℹ️ File Consistency CheckCheck based on commit: 1ecb315 (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
terrykong
left a comment
There was a problem hiding this comment.
generally lgtm. small comments
ℹ️ File Consistency CheckCheck based on commit: 0493b6e (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
|
@slikhite-1 FYI |
ℹ️ File Consistency CheckCheck based on commit: 37f85d7 (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
|
@yashaswikarnati FYI there is lint and DCO checks to fix |
Signed-off-by: ykarnati <ykarnati@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: 535520e (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
- Replace TypedDict with @DataClass for FaultInjectionConfig - Add default values directly in config class - Simplify FaultInjector constructor (remove verbose try/except) - Update tests to use typed config instead of dicts
ℹ️ File Consistency CheckCheck based on commit: 5f9a05d (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: cbaae52 (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: fed5eaf (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: ykarnati <ykarnati@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: bb9c9f9 (PR #1868 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
terrykong
left a comment
There was a problem hiding this comment.
@yashaswikarnati actually could you revert the typeddict -> dataclass change since this is actually a config exposed at the yaml level? currently we use typeddict's everywhere and for consistency we should match everything
we can revisit dataclasses in general in a later PR. i'm okay with dataclasses in non config uses, but for configs we should try to be consistent
Add configurable fault injection infrastructure to test fault tolerance mechanisms in GRPO training. Supports injecting various fault types (GPU errors, signals, exceptions) at specific training sections and ranks.
Changes:
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Tests
Chores