Mdp#1849
Conversation
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
i know this is a draft and things are in flux, but leaving some feedback from a first pass. this definitely needs more passes.
cc @ashors1 @ananthsub @yaoyu-33 for design of mcore inf in the policy worker
my opinion is we now need to maybe move the inference to a mixin so that the regular training policy methods are clearly separated from the generation ones because now there are many. After separating, the MegatronInferenceMixin can be added as one of the parent classes (multi-ineheritance). right now the megatron policy worker class has ballooned quite significantly and is very intimidating.
a general feedback is i would love to have more boilerplate get pushed into megatron inference. the amount of code change needed seems like a lot and we seem to need to set state that i would imagine mcore inference APIs might handle (like the local/none thing)
| self.dynamic_inference_engine = None | ||
| self.inference_client = None | ||
| self.inference_context = None | ||
| self.inference_wrapped_model = None | ||
| self._inference_engine_initialized = False | ||
| self._inference_engine_paused = True # Start paused since we begin with training | ||
| self._inference_loop = None # Event loop for inference operations | ||
| self._inference_thread = None # Thread running the event loop | ||
|
|
There was a problem hiding this comment.
high level q: why does mcore inference require so much book keeping by the application?
There was a problem hiding this comment.
Hmm, yeah we definitely have to do a better job of abstracting all of this out. There is a task we have to do that. Mocre inference with coordinator is waht requires all of this. Just mcore inference is a little bit better . (we need to do a lot more here especially because we are using ray on top of things, and we are doing colocated (which means we need to suspend engine, resume3 etc) I agree we should make this simpler though . Will definitely push for this soon.
| model_cfg = cfg_from_pretrained.model | ||
| cfg_from_pretrained.logger = LoggerConfig() | ||
|
|
||
| # Ensure make_vocab_size_divisible_by has a reasonable default (128 is standard) |
There was a problem hiding this comment.
@ananthsub @yaoyu-33 can you comment on this?
i feel this is potentially an unsafe thing to default especially if we don't currently have a way to chop off the vocab during HF export.
this is good to have in general though b/c i know vocab parallel will have issues w/o but prob not a good thing to enable globally yet
| # Setting moe_router_dtype to higher precision (e.g. fp64) can improve numerical stability, | ||
| # especially when using many experts. | ||
| model_cfg.moe_router_dtype = self.cfg["megatron_cfg"]["moe_router_dtype"] | ||
| model_cfg.moe_token_dispatcher_type = "alltoall" |
There was a problem hiding this comment.
what is this hard coded? can this be plumbed in
RL/nemo_rl/models/policy/__init__.py
Line 193 in dacac7e
| unified_memory_level = mcore_generation_config["unified_memory_level"] | ||
| model_config = self.model.config | ||
| # Enable CUDA graphs for inference | ||
| model_config.cuda_graph_impl = "local" |
There was a problem hiding this comment.
another place it's set from local (vs none). is this potentially error prone since we have many places where this needs to be set?
There was a problem hiding this comment.
Removed this.
| self._inference_engine_paused = True | ||
| print(f"[Rank {self.rank}] paused inference engine") | ||
|
|
||
| async def pause_engine(self): |
There was a problem hiding this comment.
should this be protected? it doesn't appear to be something the user needs to be aware of
| async def pause_engine(self): | |
| async def _pause_engine(self): |
|
|
||
| self._inference_engine_paused = False | ||
|
|
||
| def pause_inference_engine(self): |
There was a problem hiding this comment.
shall we use the nomenclature sleep/wake to match the other inference engines at least in the nemo-rl API? i think it's okay that mcore inf calls it pause/resume, but i do think that could potentially be a little confusing if we ever do partial rollouts or in-flight weight updates when n actual pause may be needed
0c7d97e to
e970643
Compare
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information