Skip to content

ricky/merge conflicts 3.0 beta and 3.0#325

Open
riccardosavorgnan wants to merge 16 commits into3.0_betafrom
ricky/merge_conflicts_3.0_beta_and_3.0
Open

ricky/merge conflicts 3.0 beta and 3.0#325
riccardosavorgnan wants to merge 16 commits into3.0_betafrom
ricky/merge_conflicts_3.0_beta_and_3.0

Conversation

@riccardosavorgnan
Copy link
Collaborator

eugenevinitsky and others added 16 commits January 3, 2026 15:45
Authored by vinitsky.eugene@gmail.com
* Make SDC agent 0 and make sure it is always intialized and controlled

* Oops little fix

* Small fix: Make sure to increment num_actors.

* Make SDC agent 0 and make sure it is always intialized and controlled

* Oops little fix

* Small fix: Make sure to increment num_actors.

---------

Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
* Fix incorrect argument name.

* Small revert.

* Incorporate PR feedback.
* quick commit so you can read the code

* Improve naming of sampling argument to better describe its function.

* Ensure that initialization works with Carla maps or other, when sdc_index=-1.

* Simplify CARLA compatibility

* Replace num_maps by wosac_num_maps in all the eval scripts

* Comment about random baseline.

* Bug fix: do not reweight by the total weight (0.95).

* Add this back for now.

* Delete agent shrinking code.

* Ignore ttc metric when agents are not vehicles

* Update WOSAC weights to align with 2024 challenge since we don't have the traffic light metric.

* Update table

* Revert MAX_AGENTS to original.

* Update formatting.

---------

Co-authored-by: Wael Boumediene Doulazmi <wbd2016@cs713.hpc.nyu.edu>
Co-authored-by: Waël Doulazmi <waeldoulazmi@gmail.com>
* Visualizer and control mode, init mode config description updates in docs

* Fix Wrong names in control modes

* add configs to drive.c, init_only_controlled and demo bug fixes, fix tables in docs

* minor fix

* Remove control_tracks_to_predict in another place
* Step 0: Adding the scenario_id in the Drive Struct and loading it in the Wosac bindings

- Requires to recompile the maps though
- Didn't think about CARLA compatibility yet

Next steps:

- Do the same work with track_ids (should be easy)
- Adapt the SMART evaluation code (I hope easy as well)

* Step 2: Give the real WOMD ids in the evaluator bindings. Also give is_track_to_predict as a bool.

Next step: update the import trajectories evaluations script.

* Evaluate imported trajectories is way better now

* Warning in evaluate_imported_trajectories

* Update the map_000.bin file

* Checked that CARLA is still fine, removed comment
* Improve random map resampling code.

* Refactor env: Create separate resampling function.

* Works with wosac_use_map_as_resampling_target = False.

* More elegant and clean solution to map resampling.

* Clean up.

* Put batch iteration in the WOSACEvaluator class to avoid repetition (keep puffer code clean).

* Improve naming.

* Clean up code and drop duplicates.

* Fix util functions so that we can run wosac during training.

* Log more metrics to wandb.

* Remove unused variable map_idex.

* Fix human replay eval.

* Drop last scenario from batch as a safety measure.
@riccardosavorgnan riccardosavorgnan marked this pull request as ready for review March 3, 2026 04:11
@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR merges a large batch of improvements from 3.0 into 3.0_beta, covering WOSAC evaluation robustness, simulator correctness fixes, improved demo/visualizer CLI tooling, documentation updates, and CI branch alignment.

Key changes:

  • WOSAC evaluation overhaul: WOSACEvaluator gains a batched evaluate() loop that samples scenarios until a target count is reached, a ground_truth eval mode for upper-bound scoring, and a collect_wosac_random_baseline() method. The metric pipeline is corrected — log-likelihoods are now averaged per-scenario first, then exponentiated, which is the mathematically sound approach.
  • Metric weight alignment: wosac.ini is updated to the 2024 WOSAC challenge config (traffic_light_violation weight = 0.0, time_to_collision and distance_to_road_edge weights adjusted to 0.1 each). Weights still sum to 1.0, so the removed normalization division in _compute_metametric is correct.
  • Simulator correctness: SDC is now guaranteed to be initialized first (drive.h), collision/observation loops guard against zero static_agent_count, scenario_id is stored as a fixed char[16] buffer instead of a pointer, and env_config.h correctly parses string-valued init_mode/control_mode settings.
  • Trajectory alignment improvement: KD-tree spatial alignment in evaluate_imported_trajectories.py replaced with exact (scenario_id, id) pair lookup, removing the scipy dependency.
  • Per-scenario metric columns are global scalars (evaluator.py lines 601–608): np.mean([series1, series2, …]) produces a single scalar broadcast to all rows. Per-scenario breakdowns of kinematic_metrics, interactive_metrics, and map_based_metrics are therefore incorrect; only the final aggregate is unaffected.
  • Duplicate include in drive.c (lines 4 & 6): #include "../env_config.h" appears twice — harmless due to header guards but should be removed.
  • Redundant branch in utils.py: elif len(model_files) > 0 is always True after if not model_files: returns False.

Confidence Score: 3/5

  • This PR is mostly safe but contains a logic issue in per-scenario metric aggregation that produces misleading per-scene data and a duplicate include in C source.
  • The core evaluation pipeline change (average log-likelihoods then exponentiate) is mathematically sound, simulator safety guards are improved, and the _compute_metametric normalization removal is valid. However, the per-scenario kinematic/interactive/map-based metric columns are computed as global scalars rather than per-scenario means due to incorrect use of np.mean([series1, series2, ...]). This doesn't break the aggregate WandB metrics but produces misleading per-scenario data in the returned DataFrame. The duplicate include in drive.c and the redundant branch in utils.py are minor cleanup items.
  • pufferlib/ocean/benchmark/evaluator.py (per-scenario metric columns), pufferlib/ocean/drive/drive.c (duplicate include)

Important Files Changed

Filename Overview
pufferlib/ocean/benchmark/evaluator.py Major overhaul of WOSAC evaluation: adds batched evaluate() loop, ground-truth eval mode, random baseline, and correct log-likelihood→likelihood pipeline (average log-likelihoods per scene, then exponentiate). The _compute_metametric normalization was correctly removed because weights now sum to 1.0. Key issue: per-scenario kinematic/interactive/map-based metric columns are assigned a global scalar instead of per-scenario means.
pufferlib/ocean/drive/drive.c Added CLI argument parsing for demo/visualizer tool and error handling when no active agents are found. Contains a duplicate #include "../env_config.h" (lines 4 and 6) introduced by the merge, which is harmless due to header guards but should be cleaned up.
pufferlib/ocean/drive/drive.h Changed scenario_id from char* to a fixed char[16] buffer, added SDC-first initialization logic, fixed collision_check/observation to guard static_agent_count > 0, renamed get_track_id_or_placeholder to is_in_track_to_predicts (now returns bool), and updated c_get_global_ground_truth_trajectories to expose is_vehicle and is_track_to_predict fields.
pufferlib/ocean/drive/drive.py Removed use_all_maps parameter, refactored resampling into a standalone resample_maps() method, updated ground-truth trajectory retrieval to include is_vehicle, is_track_to_predict, and string scenario_id (S16 dtype). The truncations reset was correctly removed from step() because the C layer uses memset to clear it.
pufferlib/pufferl.py Eval interval fix: changed epoch % eval_interval to (epoch - 1) % eval_interval to avoid running at epoch 0; updated eval flow to use new batched WOSACEvaluator.evaluate() and exposes new WandB metrics (kinematic, interactive, map-based scores, and std of realism score).
pufferlib/utils.py WOSAC subprocess now gracefully handles missing model files (runs with random policy instead of early-return). Contains a redundant elif len(model_files) > 0 condition that should be a plain else:. New eval hyperparameters (wosac_batch_size, wosac_target_scenarios, wosac_scenario_pool_size) are forwarded to the subprocess.
pufferlib/ocean/env_binding.h Updated ground-truth trajectory bindings to accept two new arrays (is_vehicle, is_track_to_predict) and changed scenario_id from int* to char* (16-byte fixed string). Argument count checks updated correctly (non-vec: 7→9, vec: 8→10 reflecting additional drive env handle + 2 new arrays).
pufferlib/ocean/env_config.h Fixed config parsing for init_mode and control_mode to accept human-readable string values (e.g., "create_all_valid", "control_vehicles") instead of raw integers, with fallback warnings for unknown values.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py Replaced KD-tree spatial alignment with an exact (scenario_id, id) pair lookup (align_trajectories), removing the scipy dependency. Now uses is_track_to_predict flag instead of id >= 0 to count evaluated agents. Logic is cleaner and more robust.
pufferlib/ocean/benchmark/wosac.ini Updated from 2025 to 2024 WOSAC challenge config: time_to_collision weight increased from 0.05 to 0.1, traffic_light_violation set to 0.0, distance_to_road_edge weight increased from 0.05 to 0.1. Weights still sum to 1.0, so removing the normalization in _compute_metametric is valid.

Sequence Diagram

sequenceDiagram
    participant PL as pufferl.py (eval)
    participant E as WOSACEvaluator
    participant VE as VecEnv (Drive)
    participant C as C Simulator

    PL->>E: evaluate(args, vecenv, policy)
    loop Until target scenarios covered OR max_batches
        E->>VE: resample_maps() [batch > 0]
        VE->>C: vec_close / binding.shared / env_init / vec_reset
        E->>VE: collect_ground_truth_trajectories()
        VE->>C: vec_get_global_ground_truth_trajectories()<br/>(x, y, z, heading, valid, id, is_vehicle, is_track_to_predict, scenario_id)
        C-->>VE: GT trajectory arrays
        alt eval_mode == "policy"
            E->>VE: collect_simulated_trajectories(policy)
            VE->>C: vec_step (num_rollouts × sim_steps)
            C-->>VE: simulated x/y/heading
        else eval_mode == "ground_truth"
            E->>E: repeat GT trajectories as simulated
        end
        E->>E: compute_metrics(gt, sim, agent_state, road_edges)
        Note over E: avg log-likelihoods per scenario → exp → realism_meta_score
        E->>E: track unique scenario_ids
    end
    E-->>PL: df_scene_level (per-scenario scores)
    PL->>PL: df.mean() → aggregate metrics dict
    PL->>PL: log to WandB
Loading

Last reviewed commit: b5adcd6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +4 to 6
#include "../env_config.h"
#include <string.h>
#include "../env_config.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate #include "../env_config.h"

env_config.h is included twice — once on line 4 and again on line 6. Although the header's #ifndef ENV_CONFIG_H guard prevents a double-definition compilation error, the second include is clearly an artifact from the merge and should be removed.

Suggested change
#include "../env_config.h"
#include <string.h>
#include "../env_config.h"
#include "drivenet.h"
#include "error.h"
#include "libgen.h"
#include "../env_config.h"
#include <string.h>

Comment on lines +124 to +128
if not model_files:
print("No model files found for WOSAC evaluation. Running WOSAC with random policy.")
elif len(model_files) > 0:
latest_cpt = max(model_files, key=os.path.getctime)
cmd.extend(["--load-model-path", latest_cpt])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant elif len(model_files) > 0 condition

After the if not model_files: branch, model_files is guaranteed to be non-empty, so elif len(model_files) > 0 is always True at that point — it's dead conditional logic. Replacing it with a plain else: makes the intent clearer and avoids the misleading branch.

@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

pufferlib/ocean/benchmark/evaluator.py
Per-scenario metric columns receive a global scalar, not per-scenario averages

np.mean([series1, series2, series3, series4]) where each series_i is a pandas Series of length N (one row per scenario) does not compute the element-wise (axis=1) mean. NumPy stacks the four series into a shape (4, N) array and then takes the mean over all 4 × N elements, producing a single scalar. That scalar is then broadcast to every row of df_scene_level.

As a result, df_scene_level["kinematic_metrics"], ["interactive_metrics"], and ["map_based_metrics"] are identical constants across all scenarios. While the final aggregate logged to WandB happens to be numerically correct (mean of a constant column = that constant), any per-scenario inspection of these columns will be misleading, and the intent is clearly to compute per-scenario means.

Fix by using DataFrame.mean(axis=1):

kinematic_cols = [
    "likelihood_linear_speed",
    "likelihood_linear_acceleration",
    "likelihood_angular_speed",
    "likelihood_angular_acceleration",
]
interactive_cols = [
    "likelihood_collision_indication",
    "likelihood_distance_to_nearest_object",
    "likelihood_time_to_collision",
]
map_cols = [
    "likelihood_distance_to_road_edge",
    "likelihood_offroad_indication",
]

df_scene_level["kinematic_metrics"] = df_scene_level[kinematic_cols].mean(axis=1)
df_scene_level["interactive_metrics"] = df_scene_level[interactive_cols].mean(axis=1)
df_scene_level["map_based_metrics"] = df_scene_level[map_cols].mean(axis=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants