Skip to content

Multiprocessed EVAL #307

Open
WaelDLZ wants to merge 13 commits into3.0_betafrom
wbd/eval_refacto
Open

Multiprocessed EVAL #307
WaelDLZ wants to merge 13 commits into3.0_betafrom
wbd/eval_refacto

Conversation

@WaelDLZ
Copy link

@WaelDLZ WaelDLZ commented Feb 19, 2026

Introducing multiprocessed eval that enables to eval the model on 10k maps very fast.

For now I put it in a separate eval function called eval_womd, but it should later replace the eval function (I think we should have a separated eval and render function btw)

What it does:

  • num_maps is the number of maps you want to evaluate (let's say 10k)
  • If you have 16 workers each will evaluate 10k/16 =625 maps
  • Then each worker will process the 625 maps per chunk of eval_batch_size (let's say 128). So each worker will instantiate an env of 128 maps 4 times followed by an env of 113 maps.
  • I removed the eval_batch_size as it was actually useless. Now you just fix a value for num_agents like you would do in training, and we put as much maps as possible inside the env, making sure that no map is cropped.

I tried to handle every edge case gracefully and tested on a lot of configs.

At the end the command:
puffer eval_womd puffer_drive --eval.num-maps=10000

Prints a summary of the metrics and logs every map's results in a csv file, allowing you to identify which map you failed, or to compute more advanced statistics like boxplots.

image image

IMPORTANT THINGS TO READ:

  • This PR currently breaks the WOSAC eval, that's why I merge it to 3.0 rather than 2.0, I plan to add WOSAC functionnality later in the week
  • On purpose I kept things very simple so at the end I just save the metrics in result.csv but we should think about a proper filename and path for this
  • Once we validate this way of evaluating it will be easy to modify a bit to add the possibility to run it at the end of every training epoch and send the results to wandb
  • Also the code should be easily adapted to run Gigaflow evaluations at scale, but I'm not sure the Gigaflow code is mature enough yet.

WaelDLZ and others added 3 commits January 16, 2026 18:23
Add a Flag to build_ocean so Raylib can work on Debian 11
…gn choices, like using a subprocess or not, separate eval from rendering...
@WaelDLZ WaelDLZ requested a review from Victorbares February 19, 2026 23:06
@WaelDLZ WaelDLZ self-assigned this Feb 19, 2026
@WaelDLZ WaelDLZ added the enhancement New feature or request label Feb 24, 2026
…ocessing.

Next steps:

- Add multiprocessing
- Save logs in a nice csv (with nice filenames and stuff)
- Add a nice looking tqdm
- Add support to run this during training
- Add support for Gigaflow and eventually WOSAC
Next steps:

- Make some tests on the cluster
- Add a nice tqdm integration
- log the results to a csv
@WaelDLZ WaelDLZ force-pushed the wbd/eval_refacto branch from f73240f to 6df1be9 Compare March 1, 2026 16:09
@WaelDLZ WaelDLZ changed the title WIP: refacto of the eval utilities. Multiprocessed EVAL Mar 1, 2026
@WaelDLZ WaelDLZ marked this pull request as ready for review March 1, 2026 19:33
@greptile-apps
Copy link

greptile-apps bot commented Mar 1, 2026

Greptile Summary

This PR implements multiprocessed evaluation for WOMD (Waymo Open Motion Dataset) by distributing map batches across workers. The changes add eval mode detection via eval_batch_size, track map progress with eval_map_counter, and create a new eval_womd() function that parallelizes evaluation across workers.

Major changes:

  • Added batch-based eval mode in binding.c and drive.py with map counter tracking
  • Modified vec_log to return per-episode metrics in eval mode vs aggregated metrics in training
  • Implemented eval_womd() function to distribute maps across workers and collect results
  • Ensured SDC (self-driving car) is initialized first for WOMD consistency

Critical issues found:

  • Infinite loop bug in eval_womd if workers finish before collecting all expected results
  • Map index wrapping via modulo can cause the same maps to be evaluated multiple times
  • Missing error handling for edge case where a single map has more agents than num_agents

Confidence Score: 2/5

  • This PR has critical logic bugs that will cause incorrect behavior in production
  • Score reflects two critical logic bugs: (1) infinite loop when workers complete early, and (2) map reprocessing due to index wrapping. Both will cause incorrect evaluation results or hangs
  • Pay close attention to pufferlib/pufferl.py (infinite loop) and pufferlib/ocean/drive/binding.c (map wrapping)

Important Files Changed

Filename Overview
pufferlib/config/ocean/drive.ini Updated eval config with num_agents=512, num_maps=10000, eval_batch_size=128, and control_mode setting
pufferlib/ocean/drive/binding.c Added eval mode logic with batch processing; contains logic bug where map indices wrap via modulo, potentially reprocessing maps
pufferlib/ocean/drive/drive.h Ensures SDC is initialized first in active agents for WOMD evaluation consistency
pufferlib/ocean/drive/drive.py Added eval_batch_size, map counter tracking, and worker termination logic; map_counter could exceed num_maps causing wrapping
pufferlib/ocean/env_binding.h Modified vec_log to support per-episode metrics in eval mode vs aggregated in training; potential division by zero not guarded
pufferlib/pufferl.py Added eval_womd function with multiprocessing; contains infinite loop bug if workers finish before num_maps results collected

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[eval_womd starts] --> B[Distribute maps across workers]
    B --> C{num_workers > num_maps?}
    C -->|Yes| ERR[Raise error]
    C -->|No| D[Create worker envs with map ranges]
    D --> E[Initialize vecenv and policy]
    E --> F{maps_processed < num_maps?}
    F -->|Yes| G[Reset RNN state if needed]
    G --> H[Run episode_length timesteps]
    H --> I[Collect info from workers]
    I --> J{info_list has results?}
    J -->|Yes| K[Process results, increment maps_processed]
    J -->|No| L{All workers finished?}
    L -->|Yes| M[INFINITE LOOP BUG]
    L -->|No| F
    K --> F
    F -->|No| N[Close vecenv]
    N --> O[Save results to CSV]
    O --> P[Print average metrics]
    
    subgraph Worker Lifecycle
    W1[Worker loads batch] --> W2{eval_map_counter >= eval_last_map?}
    W2 -->|Yes| W3[Set resample_freq=0, stop processing]
    W2 -->|No| W4[Process maps]
    W4 --> W5{Resample needed?}
    W5 -->|Yes| W6[Load next batch]
    W6 --> W1
    W5 -->|No| W4
    end
    
    style M fill:#f99
    style ERR fill:#f99
Loading

Last reviewed commit: b1e9707

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

while (use_all_maps ? map_idx < max_envs : total_agent_count < num_agents && env_count < max_envs) {
int map_id = use_all_maps ? map_idx++ : rand() % num_maps;
while (total_agent_count < num_agents && env_count < max_envs) {
int map_id = (eval_mode ? map_idx++ : rand()) % num_maps;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo operation causes map index wrapping, potentially reprocessing maps

when map_idx exceeds num_maps, the modulo wraps it back to 0, which could cause the same maps to be evaluated multiple times if fewer than expected environments are created in earlier batches

if a worker is assigned maps 0-99 but only creates 95 envs in the first batch, it will start at map 95 next time, then wrap to maps 0-89, causing duplication

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents the code from crashing if a map is skipped, but it will also be useful in the CARLA setting where we want to sample the same 5 maps multiple times.

Once again no maps should be skipped if we initialize the SDC first

Comment on lines +237 to +242
// In a case a map has more than num_agents the eval will break
// I don't put an assert here, but I think in a coming PR we should set MAX_AGENTS directly in python
// And we will add an assert to avoid this.
if (uncomplete_last_env) {
break;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled edge case: map with more agents than num_agents will break eval

the comment acknowledges this but doesn't add error handling - consider validating or asserting that no single map exceeds num_agents

Comment on lines +1499 to +1506
// If we have a SDC index (WOMD), initialize it first:
int sdc_index = env->sdc_track_index;
if (sdc_index >= 0) {
active_agent_indices[0] = sdc_index;
env->num_created_agents++;
env->active_agent_count++;
env->agents[sdc_index].active_agent = 1;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialized SDC (self-driving car) as first active agent to ensure consistent agent ordering for WOMD evaluation

Comment on lines +398 to +399
if self.eval_mode and self.resample_frequency == 0 and self.tick >= self.episode_length:
return (self.observations, self.rewards, self.terminals, self.truncations, [])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipped further processing when eval worker completes all assigned maps, preventing unnecessary computation

@greptile-apps
Copy link

greptile-apps bot commented Mar 1, 2026

Additional Comments (1)

pufferlib/ocean/env_binding.h
potential division by zero if goals_sampled_this_episode is 0

same issue as eval mode path - should check before division

num_maps = 20
num_agents = 512
num_maps=10000
eval_batch_size = 128

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should pull out eval configs into a separate struct or config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should have multiple config structures somewhere so we can easily switch between:

  • eval in womd (self-play, log-replay, mix-play)
  • eval in gigaflow
  • wosac eval
  • ...

I didn't think about a way to do it that isn't messy, I think we should like think about how do we want to structure the code in general, so I let things very basic in the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants