Multiprocessed EVAL by WaelDLZ · Pull Request #307 · Emerge-Lab/PufferDrive

WaelDLZ · 2026-02-19T23:06:00Z

Introducing multiprocessed eval that enables to eval the model on 10k maps very fast.

For now I put it in a separate eval function called eval_womd, but it should later replace the eval function (I think we should have a separated eval and render function btw)

What it does:

num_maps is the number of maps you want to evaluate (let's say 10k)
If you have 16 workers each will evaluate 10k/16 =625 maps
~~Then each worker will process the 625 maps per chunk of eval_batch_size (let's say 128). So each worker will instantiate an env of 128 maps 4 times followed by an env of 113 maps.~~
I removed the eval_batch_size as it was actually useless. Now you just fix a value for num_agents like you would do in training, and we put as much maps as possible inside the env, making sure that no map is cropped.

I tried to handle every edge case gracefully and tested on a lot of configs.

At the end the command:
puffer eval_womd puffer_drive --eval.num-maps=10000

Prints a summary of the metrics and logs every map's results in a csv file, allowing you to identify which map you failed, or to compute more advanced statistics like boxplots.

IMPORTANT THINGS TO READ:

This PR currently breaks the WOSAC eval, that's why I merge it to 3.0 rather than 2.0, I plan to add WOSAC functionnality later in the week
On purpose I kept things very simple so at the end I just save the metrics in result.csv but we should think about a proper filename and path for this
Once we validate this way of evaluating it will be easy to modify a bit to add the possibility to run it at the end of every training epoch and send the results to wandb
Also the code should be easily adapted to run Gigaflow evaluations at scale, but I'm not sure the Gigaflow code is mature enough yet.

Add a Flag to build_ocean so Raylib can work on Debian 11

…gn choices, like using a subprocess or not, separate eval from rendering...

…l_refacto

…ocessing. Next steps: - Add multiprocessing - Save logs in a nice csv (with nice filenames and stuff) - Add a nice looking tqdm - Add support to run this during training - Add support for Gigaflow and eventually WOSAC

Next steps: - Make some tests on the cluster - Add a nice tqdm integration - log the results to a csv

greptile-apps · 2026-03-01T19:45:59Z

Greptile Summary

This PR implements multiprocessed evaluation for WOMD (Waymo Open Motion Dataset) by distributing map batches across workers. The changes add eval mode detection via eval_batch_size, track map progress with eval_map_counter, and create a new eval_womd() function that parallelizes evaluation across workers.

Major changes:

Added batch-based eval mode in binding.c and drive.py with map counter tracking
Modified vec_log to return per-episode metrics in eval mode vs aggregated metrics in training
Implemented eval_womd() function to distribute maps across workers and collect results
Ensured SDC (self-driving car) is initialized first for WOMD consistency

Critical issues found:

Infinite loop bug in eval_womd if workers finish before collecting all expected results
Map index wrapping via modulo can cause the same maps to be evaluated multiple times
Missing error handling for edge case where a single map has more agents than num_agents

Confidence Score: 2/5

This PR has critical logic bugs that will cause incorrect behavior in production
Score reflects two critical logic bugs: (1) infinite loop when workers complete early, and (2) map reprocessing due to index wrapping. Both will cause incorrect evaluation results or hangs
Pay close attention to pufferlib/pufferl.py (infinite loop) and pufferlib/ocean/drive/binding.c (map wrapping)

Important Files Changed

Filename	Overview
pufferlib/config/ocean/drive.ini	Updated eval config with num_agents=512, num_maps=10000, eval_batch_size=128, and control_mode setting
pufferlib/ocean/drive/binding.c	Added eval mode logic with batch processing; contains logic bug where map indices wrap via modulo, potentially reprocessing maps
pufferlib/ocean/drive/drive.h	Ensures SDC is initialized first in active agents for WOMD evaluation consistency
pufferlib/ocean/drive/drive.py	Added eval_batch_size, map counter tracking, and worker termination logic; map_counter could exceed num_maps causing wrapping
pufferlib/ocean/env_binding.h	Modified vec_log to support per-episode metrics in eval mode vs aggregated in training; potential division by zero not guarded
pufferlib/pufferl.py	Added eval_womd function with multiprocessing; contains infinite loop bug if workers finish before num_maps results collected

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[eval_womd starts] --> B[Distribute maps across workers]
    B --> C{num_workers > num_maps?}
    C -->|Yes| ERR[Raise error]
    C -->|No| D[Create worker envs with map ranges]
    D --> E[Initialize vecenv and policy]
    E --> F{maps_processed < num_maps?}
    F -->|Yes| G[Reset RNN state if needed]
    G --> H[Run episode_length timesteps]
    H --> I[Collect info from workers]
    I --> J{info_list has results?}
    J -->|Yes| K[Process results, increment maps_processed]
    J -->|No| L{All workers finished?}
    L -->|Yes| M[INFINITE LOOP BUG]
    L -->|No| F
    K --> F
    F -->|No| N[Close vecenv]
    N --> O[Save results to CSV]
    O --> P[Print average metrics]
    
    subgraph Worker Lifecycle
    W1[Worker loads batch] --> W2{eval_map_counter >= eval_last_map?}
    W2 -->|Yes| W3[Set resample_freq=0, stop processing]
    W2 -->|No| W4[Process maps]
    W4 --> W5{Resample needed?}
    W5 -->|Yes| W6[Load next batch]
    W6 --> W1
    W5 -->|No| W4
    end
    
    style M fill:#f99
    style ERR fill:#f99

_{Last reviewed commit: b1e9707}

greptile-apps

_{6 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

pufferlib/pufferl.py

greptile-apps · 2026-03-01T19:46:04Z

pufferlib/ocean/drive/binding.c

-    while (use_all_maps ? map_idx < max_envs : total_agent_count < num_agents && env_count < max_envs) {
-        int map_id = use_all_maps ? map_idx++ : rand() % num_maps;
+    while (total_agent_count < num_agents && env_count < max_envs) {
+        int map_id = (eval_mode ? map_idx++ : rand()) % num_maps;


modulo operation causes map index wrapping, potentially reprocessing maps

when map_idx exceeds num_maps, the modulo wraps it back to 0, which could cause the same maps to be evaluated multiple times if fewer than expected environments are created in earlier batches

if a worker is assigned maps 0-99 but only creates 95 envs in the first batch, it will start at map 95 next time, then wrap to maps 0-89, causing duplication

It prevents the code from crashing if a map is skipped, but it will also be useful in the CARLA setting where we want to sample the same 5 maps multiple times.

Once again no maps should be skipped if we initialize the SDC first

greptile-apps · 2026-03-01T19:46:05Z

pufferlib/ocean/drive/binding.c

+            // In a case a map has more than num_agents the eval will break
+            // I don't put an assert here, but I think in a coming PR we should set MAX_AGENTS directly in python
+            // And we will add an assert to avoid this.
+            if (uncomplete_last_env) {
+                break;
+            }


unhandled edge case: map with more agents than num_agents will break eval

the comment acknowledges this but doesn't add error handling - consider validating or asserting that no single map exceeds num_agents

pufferlib/ocean/env_binding.h

greptile-apps · 2026-03-01T19:46:10Z

pufferlib/ocean/drive/drive.h

+    // If we have a SDC index (WOMD), initialize it first:
+    int sdc_index = env->sdc_track_index;
+    if (sdc_index >= 0) {
+        active_agent_indices[0] = sdc_index;
+        env->num_created_agents++;
+        env->active_agent_count++;
+        env->agents[sdc_index].active_agent = 1;
+    }


initialized SDC (self-driving car) as first active agent to ensure consistent agent ordering for WOMD evaluation

greptile-apps · 2026-03-01T19:46:11Z

pufferlib/ocean/drive/drive.py

+        if self.eval_mode and self.resample_frequency == 0 and self.tick >= self.episode_length:
+            return (self.observations, self.rewards, self.terminals, self.truncations, [])


skipped further processing when eval worker completes all assigned maps, preventing unnecessary computation

greptile-apps · 2026-03-01T19:46:13Z

Additional Comments (1)

pufferlib/ocean/env_binding.h
potential division by zero if goals_sampled_this_episode is 0

same issue as eval mode path - should check before division

eugenevinitsky · 2026-03-01T20:19:16Z

pufferlib/config/ocean/drive.ini

-num_maps = 20
+num_agents = 512
+num_maps=10000
+eval_batch_size = 128


maybe we should pull out eval configs into a separate struct or config?

Ideally we should have multiple config structures somewhere so we can easily switch between:

eval in womd (self-play, log-replay, mix-play)

eval in gigaflow

wosac eval

...

I didn't think about a way to do it that isn't messy, I think we should like think about how do we want to structure the code in general, so I let things very basic in the PR

WaelDLZ and others added 3 commits January 16, 2026 18:23

Add a Flag to build_ocean so Raylib can work on Debian 11

ced67e2

Merge pull request #263 from Emerge-Lab/wbd/debian_issue

77898d1

Add a Flag to build_ocean so Raylib can work on Debian 11

WIP: refacto of the eval utilities. Still need to decide on some desi…

7ee7429

…gn choices, like using a subprocess or not, separate eval from rendering...

WaelDLZ requested a review from Victorbares February 19, 2026 23:06

WaelDLZ self-assigned this Feb 19, 2026

WaelDLZ added the enhancement New feature or request label Feb 24, 2026

WaelDLZ added 6 commits February 27, 2026 10:47

Merge branch '3.0_beta', remote-tracking branch 'origin' into wbd/eva…

e3a2fd6

…l_refacto

Multiprocessing of the eval !!

d581903

Next steps: - Make some tests on the cluster - Add a nice tqdm integration - log the results to a csv

Proper timing and logging

ad201a3

Handle the last map gracefully

f296258

Add a print

6df1be9

WaelDLZ force-pushed the wbd/eval_refacto branch from f73240f to 6df1be9 Compare March 1, 2026 16:09

WaelDLZ added 2 commits March 1, 2026 16:25

Handle zombie envs gracefully

f2c8888

Merge remote-tracking branch 'origin/3.0_beta' into wbd/eval_refacto

b1e9707

WaelDLZ changed the title ~~WIP: refacto of the eval utilities.~~ Multiprocessed EVAL Mar 1, 2026

WaelDLZ marked this pull request as ready for review March 1, 2026 19:33

greptile-apps bot reviewed Mar 1, 2026

View reviewed changes

eugenevinitsky reviewed Mar 1, 2026

View reviewed changes

WaelDLZ added 2 commits March 2, 2026 08:38

Fix the +1 in episode length

b8f45c9

Eval batch size was actually useless

5936f6a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessed EVAL #307

Multiprocessed EVAL #307
WaelDLZ wants to merge 13 commits into3.0_betafrom
wbd/eval_refacto

WaelDLZ commented Feb 19, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

WaelDLZ Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

eugenevinitsky Mar 1, 2026

Uh oh!

WaelDLZ Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if self.eval_mode and self.resample_frequency == 0 and self.tick >= self.episode_length:
		return (self.observations, self.rewards, self.terminals, self.truncations, [])

Conversation

WaelDLZ commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

WaelDLZ Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

eugenevinitsky Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

WaelDLZ Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WaelDLZ commented Feb 19, 2026 •

edited

Loading