Skip to content

[Bug] train.py num_rollout==0 error #1785

@LittleYouEr

Description

@LittleYouEr

Bug Description

I have done the demo training: example/search-r1/, but i want to do an eval only test.
I found the train.py:

# special case for eval-only
    if args.num_rollout == 0 and args.eval_interval is not None:
        ray.get(rollout_manager.eval.remote(rollout_id=0))

so i set the num_rollout 0, and also set the eval args, but errors happenned.

Anyone who has the similar experience, or how to do a eval only test.

Steps to Reproduce

here is my run args:

ROLLOUT_ARGS=(
   --prompt-data /mnt/workspace/data/al_training/slime/data/train.parquet
   --input-key prompt
   --label-key reward_model
   --apply-chat-template
   --rollout-shuffle 
   --num-rollout 0
   --rollout-batch-size 32
   --n-samples-per-prompt 8
   --rollout-max-response-len 1024
   --rollout-temperature 1
   
   --eval-interval 25
   --eval-prompt-data nq_hotpotqa /mnt/workspace/data/al_training/slime/data/test.parquet
   --eval-input-key prompt
   --eval-label-key reward_model
   --n-samples-per-eval-prompt 1
   
   --global-batch-size 256
   --balance-data
)

Expected Behavior

Do a eval only when i set the right parameters.

Actual Behavior

Error

Environment

i use the slime:test docker image, 20250324 pull from docker source.

Logs

`
Traceback (most recent call last):
  File "/root/slime/train.py", line 106, in <module>
    train(args)
  File "/root/slime/train.py", line 20, in train
    actor_model, critic_model = create_training_models(args, pgs, rollout_manager)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/ray/placement_group.py", line 150, in create_training_models
    start_rollout_ids = ray.get(
                        ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2981, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1012, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::MegatronTrainRayActor.init() (pid=910684, ip=172.17.0.2, actor_id=a39a1f56d660f292ad0b1a8402000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f2c1ff54680>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/utils/timer.py", line 97, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 91, in init
    (self.model, self.optimizer, self.opt_param_scheduler, loaded_rollout_id) = initialize_model_and_optimizer(
                                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/model.py", line 773, in initialize_model_and_optimizer
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(args, role)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/model.py", line 125, in setup_model_and_optimizer
    opt_param_scheduler = get_optimizer_param_scheduler(args, optimizer)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/model.py", line 63, in get_optimizer_param_scheduler
    opt_param_scheduler = OptimizerParamScheduler(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megatron-LM/megatron/core/optimizer_param_scheduler.py", line 73, in __init__
    assert self.lr_decay_steps > 0
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
`

Additional Context

No response

Pre-submission Checklist

  • I have read the CONTRIBUTING.md and understand the collaboration scope.
  • I have read the documentation and my issue is not addressed there.
  • I have searched for existing issues and this is not a duplicate.
  • I have provided a minimal, reproducible example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions