ricky/merge conflicts 3.0 beta and 3.0 by riccardosavorgnan · Pull Request #325 · Emerge-Lab/PufferDrive

riccardosavorgnan · 2026-03-03T03:52:08Z

Add unit tests badge to README (Add unit tests badge to README #238)
Improve Carla train docs.
Run tests when PR is opened to 2.0. (Fix: Change default PR branch from main to 2.0 #256)
Always initialize SDC first (Always initialize SDC first #259)
Fix incorrect argument name for online WOSAC logging. (Fix incorrect argument name for online WOSAC logging. #260)
Fixes: Align WOSAC metric calculation with original (Fixes: Align WOSAC metric calculation with original #258)
Fix Config Read in env_config.h for demo and visualizer (Fix Config Read in env_config.h for demo and visualizer #272)
Demo/visualize fixes and docs changes (Demo/visualize fixes and docs changes #273)
Change the collision count to be a Mean over the rollout axis instead of a Sum, it makes way more sense now (Change the collision count to be a Mean over the rollout axis instead… #277)
More complete and Robust evaluation of WOSAC rollouts (More complete and Robust evaluation of WOSAC rollouts #271)
Ensure we always add a placeholder if scenario_id is not present in metadata. (Fix: Ensure scenario_id placeholder in map binaries. #279)
Enable fast WOSAC evals on large datasets with resampling (Enable fast WOSAC evals on large datasets with resampling #280)
Enable simple eval with GT. (Enable simple eval with GT #282)
Solve some merge messy stuff

Authored by vinitsky.eugene@gmail.com

* Make SDC agent 0 and make sure it is always intialized and controlled * Oops little fix * Small fix: Make sure to increment num_actors. * Make SDC agent 0 and make sure it is always intialized and controlled * Oops little fix * Small fix: Make sure to increment num_actors. --------- Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>

* Fix incorrect argument name. * Small revert. * Incorporate PR feedback.

* quick commit so you can read the code * Improve naming of sampling argument to better describe its function. * Ensure that initialization works with Carla maps or other, when sdc_index=-1. * Simplify CARLA compatibility * Replace num_maps by wosac_num_maps in all the eval scripts * Comment about random baseline. * Bug fix: do not reweight by the total weight (0.95). * Add this back for now. * Delete agent shrinking code. * Ignore ttc metric when agents are not vehicles * Update WOSAC weights to align with 2024 challenge since we don't have the traffic light metric. * Update table * Revert MAX_AGENTS to original. * Update formatting. --------- Co-authored-by: Wael Boumediene Doulazmi <wbd2016@cs713.hpc.nyu.edu> Co-authored-by: Waël Doulazmi <waeldoulazmi@gmail.com>

* Visualizer and control mode, init mode config description updates in docs * Fix Wrong names in control modes * add configs to drive.c, init_only_controlled and demo bug fixes, fix tables in docs * minor fix * Remove control_tracks_to_predict in another place

… of a Sum, it makes way more sense now (#277)

* Step 0: Adding the scenario_id in the Drive Struct and loading it in the Wosac bindings - Requires to recompile the maps though - Didn't think about CARLA compatibility yet Next steps: - Do the same work with track_ids (should be easy) - Adapt the SMART evaluation code (I hope easy as well) * Step 2: Give the real WOMD ids in the evaluator bindings. Also give is_track_to_predict as a bool. Next step: update the import trajectories evaluations script. * Evaluate imported trajectories is way better now * Warning in evaluate_imported_trajectories * Update the map_000.bin file * Checked that CARLA is still fine, removed comment

…etadata. (#279)

* Improve random map resampling code. * Refactor env: Create separate resampling function. * Works with wosac_use_map_as_resampling_target = False. * More elegant and clean solution to map resampling. * Clean up. * Put batch iteration in the WOSACEvaluator class to avoid repetition (keep puffer code clean). * Improve naming. * Clean up code and drop duplicates. * Fix util functions so that we can run wosac during training. * Log more metrics to wandb. * Remove unused variable map_idex. * Fix human replay eval. * Drop last scenario from batch as a safety measure.

greptile-apps · 2026-03-03T04:19:21Z

Greptile Summary

This PR merges a large batch of improvements from 3.0 into 3.0_beta, covering WOSAC evaluation robustness, simulator correctness fixes, improved demo/visualizer CLI tooling, documentation updates, and CI branch alignment.

Key changes:

WOSAC evaluation overhaul: WOSACEvaluator gains a batched evaluate() loop that samples scenarios until a target count is reached, a ground_truth eval mode for upper-bound scoring, and a collect_wosac_random_baseline() method. The metric pipeline is corrected — log-likelihoods are now averaged per-scenario first, then exponentiated, which is the mathematically sound approach.
Metric weight alignment: wosac.ini is updated to the 2024 WOSAC challenge config (traffic_light_violation weight = 0.0, time_to_collision and distance_to_road_edge weights adjusted to 0.1 each). Weights still sum to 1.0, so the removed normalization division in _compute_metametric is correct.
Simulator correctness: SDC is now guaranteed to be initialized first (drive.h), collision/observation loops guard against zero static_agent_count, scenario_id is stored as a fixed char[16] buffer instead of a pointer, and env_config.h correctly parses string-valued init_mode/control_mode settings.
Trajectory alignment improvement: KD-tree spatial alignment in evaluate_imported_trajectories.py replaced with exact (scenario_id, id) pair lookup, removing the scipy dependency.
Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.
Duplicate include in drive.c (lines 4 & 6): #include "../env_config.h" appears twice — harmless due to header guards but should be removed.
Redundant branch in utils.py: elif len(model_files) > 0 is always True after if not model_files: returns False.

Confidence Score: 3/5

This PR is mostly safe but contains a logic issue in per-scenario metric aggregation that produces misleading per-scene data and a duplicate include in C source.
The core evaluation pipeline change (average log-likelihoods then exponentiate) is mathematically sound, simulator safety guards are improved, and the _compute_metametric normalization removal is valid. However, the per-scenario kinematic/interactive/map-based metric columns are computed as global scalars rather than per-scenario means due to incorrect use of np.mean([series1, series2, ...]). This doesn't break the aggregate WandB metrics but produces misleading per-scenario data in the returned DataFrame. The duplicate include in drive.c and the redundant branch in utils.py are minor cleanup items.
pufferlib/ocean/benchmark/evaluator.py (per-scenario metric columns), pufferlib/ocean/drive/drive.c (duplicate include)

Important Files Changed

Filename	Overview
pufferlib/ocean/benchmark/evaluator.py	Major overhaul of WOSAC evaluation: adds batched `evaluate()` loop, ground-truth eval mode, random baseline, and correct log-likelihood→likelihood pipeline (average log-likelihoods per scene, then exponentiate). The `_compute_metametric` normalization was correctly removed because weights now sum to 1.0. Key issue: per-scenario kinematic/interactive/map-based metric columns are assigned a global scalar instead of per-scenario means.
pufferlib/ocean/drive/drive.c	Added CLI argument parsing for demo/visualizer tool and error handling when no active agents are found. Contains a duplicate `#include "../env_config.h"` (lines 4 and 6) introduced by the merge, which is harmless due to header guards but should be cleaned up.
pufferlib/ocean/drive/drive.h	Changed `scenario_id` from `char*` to a fixed `char[16]` buffer, added SDC-first initialization logic, fixed collision_check/observation to guard `static_agent_count > 0`, renamed `get_track_id_or_placeholder` to `is_in_track_to_predicts` (now returns bool), and updated `c_get_global_ground_truth_trajectories` to expose `is_vehicle` and `is_track_to_predict` fields.
pufferlib/ocean/drive/drive.py	Removed `use_all_maps` parameter, refactored resampling into a standalone `resample_maps()` method, updated ground-truth trajectory retrieval to include `is_vehicle`, `is_track_to_predict`, and string `scenario_id` (S16 dtype). The `truncations` reset was correctly removed from `step()` because the C layer uses `memset` to clear it.
pufferlib/pufferl.py	Eval interval fix: changed `epoch % eval_interval` to `(epoch - 1) % eval_interval` to avoid running at epoch 0; updated eval flow to use new batched `WOSACEvaluator.evaluate()` and exposes new WandB metrics (kinematic, interactive, map-based scores, and std of realism score).
pufferlib/utils.py	WOSAC subprocess now gracefully handles missing model files (runs with random policy instead of early-return). Contains a redundant `elif len(model_files) > 0` condition that should be a plain `else:`. New eval hyperparameters (`wosac_batch_size`, `wosac_target_scenarios`, `wosac_scenario_pool_size`) are forwarded to the subprocess.
pufferlib/ocean/env_binding.h	Updated ground-truth trajectory bindings to accept two new arrays (`is_vehicle`, `is_track_to_predict`) and changed `scenario_id` from `int` to `char` (16-byte fixed string). Argument count checks updated correctly (non-vec: 7→9, vec: 8→10 reflecting additional drive env handle + 2 new arrays).
pufferlib/ocean/env_config.h	Fixed config parsing for `init_mode` and `control_mode` to accept human-readable string values (e.g., `"create_all_valid"`, `"control_vehicles"`) instead of raw integers, with fallback warnings for unknown values.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py	Replaced KD-tree spatial alignment with an exact `(scenario_id, id)` pair lookup (`align_trajectories`), removing the `scipy` dependency. Now uses `is_track_to_predict` flag instead of `id >= 0` to count evaluated agents. Logic is cleaner and more robust.
pufferlib/ocean/benchmark/wosac.ini	Updated from 2025 to 2024 WOSAC challenge config: `time_to_collision` weight increased from 0.05 to 0.1, `traffic_light_violation` set to 0.0, `distance_to_road_edge` weight increased from 0.05 to 0.1. Weights still sum to 1.0, so removing the normalization in `_compute_metametric` is valid.

Sequence Diagram

sequenceDiagram
    participant PL as pufferl.py (eval)
    participant E as WOSACEvaluator
    participant VE as VecEnv (Drive)
    participant C as C Simulator

    PL->>E: evaluate(args, vecenv, policy)
    loop Until target scenarios covered OR max_batches
        E->>VE: resample_maps() [batch > 0]
        VE->>C: vec_close / binding.shared / env_init / vec_reset
        E->>VE: collect_ground_truth_trajectories()
        VE->>C: vec_get_global_ground_truth_trajectories()<br/>(x, y, z, heading, valid, id, is_vehicle, is_track_to_predict, scenario_id)
        C-->>VE: GT trajectory arrays
        alt eval_mode == "policy"
            E->>VE: collect_simulated_trajectories(policy)
            VE->>C: vec_step (num_rollouts × sim_steps)
            C-->>VE: simulated x/y/heading
        else eval_mode == "ground_truth"
            E->>E: repeat GT trajectories as simulated
        end
        E->>E: compute_metrics(gt, sim, agent_state, road_edges)
        Note over E: avg log-likelihoods per scenario → exp → realism_meta_score
        E->>E: track unique scenario_ids
    end
    E-->>PL: df_scene_level (per-scenario scores)
    PL->>PL: df.mean() → aggregate metrics dict
    PL->>PL: log to WandB

_{Last reviewed commit: b5adcd6}

greptile-apps

_{27 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-03T04:19:26Z

pufferlib/ocean/drive/drive.c

+#include "../env_config.h"
 #include <string.h>
 #include "../env_config.h"


Duplicate #include "../env_config.h"

env_config.h is included twice — once on line 4 and again on line 6. Although the header's #ifndef ENV_CONFIG_H guard prevents a double-definition compilation error, the second include is clearly an artifact from the merge and should be removed.

Suggested change

#include "../env_config.h"

#include <string.h>

#include "../env_config.h"

#include "drivenet.h"

#include "error.h"

#include "libgen.h"

#include "../env_config.h"

#include <string.h>

greptile-apps · 2026-03-03T04:19:31Z

pufferlib/utils.py

+        if not model_files:
+            print("No model files found for WOSAC evaluation. Running WOSAC with random policy.")
+        elif len(model_files) > 0:
+            latest_cpt = max(model_files, key=os.path.getctime)
+            cmd.extend(["--load-model-path", latest_cpt])


Redundant elif len(model_files) > 0 condition

After the if not model_files: branch, model_files is guaranteed to be non-empty, so elif len(model_files) > 0 is always True at that point — it's dead conditional logic. Replacing it with a plain else: makes the intent clearer and avoids the misleading branch.

greptile-apps · 2026-03-03T04:19:33Z

Additional Comments (1)

pufferlib/ocean/benchmark/evaluator.py
Per-scenario metric columns receive a global scalar, not per-scenario averages

np.mean([series1, series2, series3, series4]) where each series_i is a pandas Series of length N (one row per scenario) does not compute the element-wise (axis=1) mean. NumPy stacks the four series into a shape (4, N) array and then takes the mean over all 4 × N elements, producing a single scalar. That scalar is then broadcast to every row of df_scene_level.

As a result, df_scene_level["kinematic_metrics"], ["interactive_metrics"], and ["map_based_metrics"] are identical constants across all scenarios. While the final aggregate logged to WandB happens to be numerically correct (mean of a constant column = that constant), any per-scenario inspection of these columns will be misleading, and the intent is clearly to compute per-scenario means.

Fix by using DataFrame.mean(axis=1):

kinematic_cols = [
    "likelihood_linear_speed",
    "likelihood_linear_acceleration",
    "likelihood_angular_speed",
    "likelihood_angular_acceleration",
]
interactive_cols = [
    "likelihood_collision_indication",
    "likelihood_distance_to_nearest_object",
    "likelihood_time_to_collision",
]
map_cols = [
    "likelihood_distance_to_road_edge",
    "likelihood_offroad_indication",
]

df_scene_level["kinematic_metrics"] = df_scene_level[kinematic_cols].mean(axis=1)
df_scene_level["interactive_metrics"] = df_scene_level[interactive_cols].mean(axis=1)
df_scene_level["map_based_metrics"] = df_scene_level[map_cols].mean(axis=1)

eugenevinitsky and others added 16 commits January 3, 2026 15:45

Add unit tests badge to README (#238)

6799ca1

Authored by vinitsky.eugene@gmail.com

Improve Carla train docs.

1a63440

Run tests when PR is opened to 2.0. (#256)

84cf206

Fix incorrect argument name for online WOSAC logging. (#260)

61a1e84

* Fix incorrect argument name. * Small revert. * Incorporate PR feedback.

Fix Config Read in env_config.h for demo and visualizer (#272)

5bd7fd2

Change the collision count to be a Mean over the rollout axis instead…

c9aaeb2

… of a Sum, it makes way more sense now (#277)

Ensure we always add a placeholder if scenario_id is not present in m…

bb98ad5

…etadata. (#279)

Enable simple eval with GT. (#282)

81dd49e

Merge remote-tracking branch 'origin/2.0' into 3.0

a1b1388

Merge branch '3.0' into 3.0_beta

a5b02a5

Solve some merge messy stuff

b5adcd6

riccardosavorgnan requested review from Aditya-Gupta26, daphne-cornelisse and mpragnay March 3, 2026 03:53

riccardosavorgnan marked this pull request as ready for review March 3, 2026 04:11

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ricky/merge conflicts 3.0 beta and 3.0#325

ricky/merge conflicts 3.0 beta and 3.0#325
riccardosavorgnan wants to merge 16 commits into3.0_betafrom
ricky/merge_conflicts_3.0_beta_and_3.0

riccardosavorgnan commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

greptile-apps bot Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

riccardosavorgnan commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants