Currently in #1300, there is a restriction when using the mp backend that
inference_engine_size = dp_size * tp_size * pp_size
num_gpus_per_node = cfg.trainer.placement.policy_num_gpus_per_node
if inference_engine_size > num_gpus_per_node and ie_cfg.distributed_executor_backend == "mp":
raise ValueError(
"Each inference engine must fit within a single node with the vLLM mp backend. Use the ray backend for per engine multi-node serving instead."
)
This can be lifted by adding logic to create multiple placement groups per engine (creating one ray actor for every min(inference_engine_size, num_gpus_per_node) gpus, and setting master_address and headless as needed.
We also have the following restriction:
if cfg.generator.inference_engine.distributed_executor_backend == "mp":
raise ValueError(
"the mp backend for vLLM is not yet fully supported for the new inference backend. See https://github.com/NovaSky-AI/SkyRL/issues/1309. Use the ray backend instead."
)
Colocated mode + the mp backend is blocking on #1291 for enabling colocated + mp backed on the new inference stack, and non-colocation has flaky issues with the new native weight syncing APIs. See below for details. cc: @hao-aaron
Currently in #1300, there is a restriction when using the mp backend that
This can be lifted by adding logic to create multiple placement groups per engine (creating one ray actor for every
min(inference_engine_size, num_gpus_per_node)gpus, and settingmaster_addressandheadlessas needed.We also have the following restriction:
Colocated mode + the mp backend is blocking on #1291 for enabling colocated + mp backed on the new inference stack, and non-colocation has flaky issues with the new native weight syncing APIs. See below for details. cc: @hao-aaron