Dense startup time profiling#5072
Conversation
- Add error handling for YAML whitelist loading (OSError, YAMLError, type validation) - Use None instead of 0.0 for missing Timer sub-timings with info log - Warn when IsaacLab source directory is missing (empty prefix list) - Warn on unmatched whitelist patterns to catch typos - Use try/finally for cProfile disable to handle exceptions cleanly
- Fix orphaned docstring (convert to comment) - Improve main() and module docstrings for accuracy - Remove phase numbering from section comments (avoid confusion) - Fix Timer comment wording (backends -> environment types) - Fix top_n docstring (remove "per phase") - Remove dead print_startup_summary function
The humanoid observations pattern needs a leading wildcard to match the full module path (isaaclab_tasks.manager_based.classic.humanoid...).
Greptile SummaryThis PR introduces a dense startup-time profiling benchmark ( Key changes:
Three logic-level issues were found in Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as __main__
participant LS as launch_simulation()
participant M as main()
participant GYM as gym.make()
participant ENV as env
participant BP as BaseIsaacLabBenchmark
participant BE as Backend (SummaryMetrics etc.)
CLI->>LS: app_launch_profile.enable() → with launch_simulation(env_cfg, args_cli)
LS-->>CLI: context entered (app_launch_profile.disable())
CLI->>M: main(env_cfg, app_launch_profile, app_launch_wall_ms)
M->>GYM: env_creation_profile.enable() → gym.make(task, cfg)
GYM-->>M: env
M->>ENV: env.reset() [inner try/finally disables env_creation_profile]
ENV-->>M: obs
M->>ENV: env.step(actions) [first_step_profile try/finally]
ENV-->>M: obs, reward, ...
M->>BP: benchmark.add_measurement(phase, SingleMeasurement) ×N phases
M->>BP: benchmark.update_manual_recorders()
M->>BP: benchmark._finalize_impl()
BP->>BE: backend.finalize(output_path, output_filename)
BE-->>CLI: JSON + console summary printed
M->>ENV: env.close() [outer finally]
Last reviewed commit: "Document reliance on..." |
- Make --task required to fail early on missing argument
- Validate whitelist YAML values are list[str] per phase
- Warn on unknown phase names in whitelist config
- Wrap post-env-creation code in try/finally for env.close()
- Remove [UNMATCHED] prefix from placeholder labels to keep
dashboard keys stable (matches docstring/RST contract)
- Move fnmatch import to function top, replace warnings.warn
with print("[WARNING]") for consistency
- Remove stale "Launch Isaac Sim Simulator first" comment
- Use _ for unused lineno in pstats tuple unpacking
|
@greptile |
Move the try/finally guarding env.close() to wrap everything after gym.make() succeeds, so env.reset() failures also trigger cleanup.
|
@greptile |
| {"name": "seed", "data": args_cli.seed}, | ||
| {"name": "num_envs", "data": args_cli.num_envs}, | ||
| {"name": "top_n", "data": args_cli.top_n}, | ||
| {"name": "presets", "data": get_preset_string(hydra_args)}, |
There was a problem hiding this comment.
env_creation_profile left enabled on BaseException
except Exception does not catch KeyboardInterrupt, SystemExit, or any other BaseException subclass. If gym.make() is interrupted by a signal or a hard abort, the profile object is left in the enabled state for the remainder of the process lifetime, which silently continues accumulating CPU-time into the stale profile.
The standard pattern is to use a finally clause (or a bare except) so the profile is always stopped:
env_creation_profile.enable()
try:
env = gym.make(args_cli.task, cfg=env_cfg)
except BaseException:
env_creation_profile.disable()
raiseOr, more idiomatically, fold the disable into a try/finally around the whole block, the same way first_step_profile is handled a few lines below.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Antoine RICHARD <antoiner@nvidia.com>
except Exception misses KeyboardInterrupt and SystemExit, leaving the profiler enabled for the rest of the process lifetime.
There was a problem hiding this comment.
Startup Profiling Benchmark — Code Review
Overall this is a solid, well-structured addition. The five-phase profiling architecture is clean, the whitelist/fnmatch system is genuinely useful for dashboard stability, and the error handling has clearly gotten multiple passes of improvement. A few issues below — one actual bug, a couple of design nits.
Verdict: Approve with nits. The seed bug (comment 1) should be fixed before merge; the rest are minor.
| # Override config with CLI args | ||
| env_cfg.scene.num_envs = args_cli.num_envs if args_cli.num_envs is not None else env_cfg.scene.num_envs | ||
| env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device | ||
| env_cfg.seed = args_cli.seed |
There was a problem hiding this comment.
Bug: unconditional seed override clobbers config default.
env_cfg.seed = args_cli.seedWhen --seed is not passed, args_cli.seed is None. This unconditionally overwrites whatever env_cfg.seed was — either the task config's default or the conditional assignment at module-level line 166.
Line 166 already does the right thing:
env_cfg.seed = args_cli.seed if args_cli.seed is not None else env_cfg.seedThis line should match that pattern, or just be removed since the module-level assignment already ran:
# Either:
env_cfg.seed = args_cli.seed if args_cli.seed is not None else env_cfg.seed
# Or just delete this line — line 166 already handled it.|
|
||
| # -- Create the benchmark instance ------------------------------------------ | ||
|
|
||
| env_cfg.seed = args_cli.seed if args_cli.seed is not None else env_cfg.seed |
There was a problem hiding this comment.
Nit: module-level seed assignment is redundant with main(). This line correctly does the conditional assignment, but then main() line 206 unconditionally overwrites it (see other comment). If you fix line 206 to be conditional too, consider removing one of the two assignments to avoid confusion about which one "wins".
Since env_cfg is passed into main() and main() also sets num_envs and device there, it'd be cleaner to do all config overrides in one place — inside main() — and remove this line.
|
|
||
| # Finalize benchmark output | ||
| benchmark.update_manual_recorders() | ||
| benchmark._finalize_impl() |
There was a problem hiding this comment.
Minor: calling private _finalize_impl() directly.
I know other benchmark scripts (benchmark_non_rl.py) do this too, so this is a pre-existing pattern rather than a problem you introduced. Still worth noting: if BaseIsaacLabBenchmark ever gains a public finalize() wrapper (which it should), this should switch to it.
|
|
||
| imports_profile.disable() | ||
|
|
||
| if torch.cuda.is_available() and torch.cuda.is_initialized(): |
There was a problem hiding this comment.
Design note: cuda.is_initialized() check may miss lazy CUDA init.
At this point in the script, CUDA may not yet be initialized (torch was just imported but no tensor ops have run). If CUDA isn't initialized, is_initialized() returns False and the synchronize is skipped — which is correct, since there's nothing to sync. But it means the wall-clock imports_time_end doesn't account for any deferred GPU init that gets triggered later.
This is fine for a profiling script (you're measuring import time, not GPU init time), but worth a comment explaining the intent — especially since the same pattern at lines 225 and 245 does catch active GPU work.
| for func_key, (_, _, tottime, cumtime, callers) in stats.stats.items(): | ||
| filename, _, funcname = func_key | ||
| if _is_isaaclab(filename): | ||
| label = _make_label(filename, funcname) |
There was a problem hiding this comment.
Observation (not a bug): external functions can appear with abbreviated labels. When an external function (e.g., torch.nn.modules.linear:forward) is directly called by an IsaacLab function, _make_label truncates the path to site-packages-relative or last-3-components. This is fine for readability but means the label isn't a valid Python import path — it's a display-only string. Worth a brief docstring note on _make_label that labels are for display, not programmatic use.
|
|
||
| # Sample random actions | ||
| actions = ( | ||
| torch.rand(env.unwrapped.num_envs, env.unwrapped.single_action_space.shape[0], device=env.unwrapped.device) |
There was a problem hiding this comment.
Nit: single_action_space may not exist for all env types.
For DirectMARLEnv, the action space structure is different — single_action_space exists but might be a Dict space, not a Box, so .shape[0] would fail. The type signature of main() accepts DirectMARLEnvCfg, but this line assumes a flat action space.
Since this is a benchmarking script (not a general-purpose tool), this is acceptable if the docs/help text clarify it's for single-agent RL envs. But if MARL envs are intended to be supported, this needs a check.
| if isinstance(measurement, StatisticalMeasurement): | ||
| unit_str = f" {measurement.unit.strip()}" if (measurement.unit and measurement.unit.strip()) else "" | ||
| value = f"{self._format_scalar(measurement.mean)}{unit_str}" | ||
| elif isinstance(measurement, SingleMeasurement): |
There was a problem hiding this comment.
Good addition. The catch-all rendering for unknown phases is clean and doesn't break existing behavior. One minor style nit: the unit_str construction is duplicated between the frametime_phase block above and this new block. Could extract a helper like _format_measurement_value(measurement) to DRY it up, but that's cosmetic.
|
@AntoineRichard nice feature
|
Description
Allows for dense startup time profiling.
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there