Skip to content

perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136

Open
ncournia-nv wants to merge 1 commit intoisaac-sim:developfrom
ncournia-nv:perf/ovrtx-optimizations
Open

perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136
ncournia-nv wants to merge 1 commit intoisaac-sim:developfrom
ncournia-nv:perf/ovrtx-optimizations

Conversation

@ncournia-nv
Copy link
Copy Markdown
Collaborator

@ncournia-nv ncournia-nv commented Apr 1, 2026

Summary

Replaces per-environment tile extraction loops with batched GPU kernels that extract all tiles in a single launch. This eliminates N-1 redundant kernel launches per frame per data type (where N = num_envs).

Changes

  • New kernels: extract_all_rgba_tiles_kernel and extract_all_depth_tiles_kernel — 3D-indexed kernels (env_idx, y, x) that extract all environment tiles from the tiled framebuffer in one launch
  • Renderer: _extract_rgba_tiles() and _extract_depth_tiles() now call the batched kernels instead of looping over environments with per-tile kernels
  • Cleanup: Removed unused per-env kernel imports (extract_tile_from_tiled_buffer_kernel, extract_depth_tile_from_tiled_buffer_kernel)

Performance

Task: Isaac-Repose-Cube-Shadow-Vision-Benchmark-Direct-v0
Config: 1225 envs, 5 epochs, Newton + OVRTX renderer, L40 GPU

Avg FPS (step) Wall time
Baseline (develop) 5,657 4m 11s
This PR 6,867 3m 59s
Speedup 1.21× 12s saved

The improvement comes from reducing kernel launch overhead — at 1225 envs, each frame previously required 1225 separate kernel launches per data type. The batched approach does it in one.

Measured without nsys profiling to avoid instrumentation skew. Under nsys the delta appears larger (~3×) because profiling overhead disproportionately affects the per-env loop.

@github-actions github-actions bot added the isaac-lab Related to Isaac Lab team label Apr 1, 2026
@ncournia-nv ncournia-nv force-pushed the perf/ovrtx-optimizations branch from dc82943 to 97ed520 Compare April 1, 2026 00:36
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #5136

Reviewer: 🤖 Bob (automated)

Summary

Replaces N per-env Warp kernel launches with single 3D-batched launches for tile extraction, and defers extraction from render() to write_output(). Well-motivated optimization with solid benchmark numbers (+23.7% FPS).

Design Assessment

Batched kernels (commit 1): Clean, correct, textbook GPU optimization. Eliminates O(num_envs) kernel launch overhead. The 3D (num_envs, height, width) dispatch maps naturally to the tile grid. No correctness concerns.

Deferred extraction (commit 2): Architecturally sound — separates render kick from buffer consumption. The lazy _ensure_extraction() guard pattern is simple and effective. However, there are several issues around state management and frame lifetime that need attention.

Architecture Impact

Low risk. Changes are confined to OVRTXRenderer internals. The BaseRenderer interface contract (render() then write_output()) is preserved. No downstream API changes.

Implementation Verdict

🟡 Approve with nits. The core optimization is correct and valuable. There are pre-commit CI failures that must be fixed, plus a few state management hygiene issues.

CI Status

  • pre-commit: ❌ FAIL — unused imports + quote style (see inline comments)
  • Other checks: pending/skipping (blocked on pre-commit)

Branch Status

  • Mergeable: ✅ MERGEABLE

Findings Summary

# Severity File Issue
1 🔴 blocker ovrtx_renderer.py Unused imports fail pre-commit (CI red)
2 🔴 blocker ovrtx_renderer.py Single quotes in _ensure_extraction fail pre-commit
3 🟡 medium ovrtx_renderer.py _pending_* attrs set via monkey-patching, not initialized in __init__
4 🟡 medium ovrtx_renderer.py _pending_products stored but never read — dead state
5 🟢 low ovrtx_renderer.py cleanup() doesn't clear deferred extraction state
6 🟢 nit ovrtx_renderer_kernels.py Old single-env kernels are now dead code (not in this PR's scope, but worth noting)

create_camera_transforms_kernel,
extract_all_depth_tiles_kernel,
extract_all_rgba_tiles_kernel,
extract_depth_tile_from_tiled_buffer_kernel,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocker: Unused imports fail pre-commit CI.

The old per-env kernels extract_depth_tile_from_tiled_buffer_kernel and extract_tile_from_tiled_buffer_kernel are still imported but no longer called anywhere in this file. Pre-commit's unused-import linter catches this — CI is red.

Suggested change
extract_depth_tile_from_tiled_buffer_kernel,
from .ovrtx_renderer_kernels import (
DEVICE,
create_camera_transforms_kernel,
extract_all_depth_tiles_kernel,
extract_all_rgba_tiles_kernel,
generate_random_colors_from_ids_kernel,
sync_newton_transforms_kernel,
)


def _ensure_extraction(self, render_data: OVRTXRenderData) -> None:
"""Extract tiles from the pending frame if not already done."""
if getattr(render_data, '_extraction_done', True):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocker: Single quotes fail pre-commit quote style.

Project convention is double quotes. Pre-commit flags '_extraction_done' and '_pending_frame'.

Suggested change
if getattr(render_data, '_extraction_done', True):
if getattr(render_data, "_extraction_done", True):
return
frame = getattr(render_data, "_pending_frame", None)

# Store frame for deferred extraction instead of extracting immediately
render_data._pending_frame = products[product_path].frames[0]
render_data._pending_products = products
render_data._extraction_done = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: Monkey-patched attributes on OVRTXRenderData.

_pending_frame, _pending_products, and _extraction_done are dynamically set on render_data instances in render() without being declared in OVRTXRenderData.__init__. This forces _ensure_extraction to use defensive getattr(..., default) calls.

Prefer initializing these in OVRTXRenderData.__init__:

self._pending_frame = None
self._pending_products = None
self._extraction_done = True

This makes the state machine explicit, enables IDE autocompletion/type checking, and removes the need for getattr guards.

)
# Store frame for deferred extraction instead of extracting immediately
render_data._pending_frame = products[product_path].frames[0]
render_data._pending_products = products
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: _pending_products is stored but never read.

render_data._pending_products = products is set in render() and cleared in _ensure_extraction(), but nothing ever reads it. _process_render_frame() only uses the frame object, not the products dict.

If this is reserved for future use (e.g., multi-product extraction), add a comment explaining the intent. Otherwise, remove the dead state to keep the deferred extraction state machine minimal.

self._process_render_frame(render_data, frame, render_data.warp_buffers)
render_data._extraction_done = True
render_data._pending_frame = None
render_data._pending_products = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: cleanup() should clear deferred state.

If cleanup() is called between render() and write_output() (e.g. during error teardown), the stale _pending_frame reference keeps the OVRTX frame object alive longer than expected, potentially holding GPU resources. Consider:

if render_data is not None:
    render_data.sensor = None
    render_data._pending_frame = None
    render_data._pending_products = None  
    render_data._extraction_done = True

@ncournia-nv ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 97ed520 to 8bb4cc0 Compare April 1, 2026 00:38
@ncournia-nv ncournia-nv marked this pull request as ready for review April 1, 2026 21:07
@ncournia-nv ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 8bb4cc0 to c3efe96 Compare April 1, 2026 21:10
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR delivers a ~24 % throughput improvement for vision-based RL training by making two targeted changes to OVRTXRenderer:

  • Batched tile extraction — replaces N per-environment Warp kernel launches (one per env) with a single 3-D kernel (num_envs, H, W) for both RGBA and depth buffers, dramatically reducing kernel-launch overhead at scale (e.g. 1,225 launches → 1 per buffer type).
  • Deferred tile extraction — moves the tile copy from render() to write_output() via an _ensure_extraction() lazy pattern, allowing physics/training compute to overlap with rendering between steps.

The core logic is sound: tile-coordinate arithmetic in the new kernels exactly mirrors the old per-env kernels, _pending_products is kept alive to prevent the renderer from freeing GPU-mapped memory before extraction, and exception/empty-frame paths all correctly set _extraction_done = True to avoid stale state. All remaining findings are P2 style or defensive-hardening suggestions.

Confidence Score: 5/5

  • Safe to merge — all findings are P2 style/hardening suggestions with no correctness risk on the happy path.
  • Both functional changes (batched kernels and deferred extraction) are logically correct. The batched index arithmetic is identical to the old per-env logic. The deferred state is guarded with getattr defaults that prevent crashes even when attributes are absent. Exception and empty-frame paths correctly flush pending state. The only concerns are: an unused suffix parameter, missing __init__ declarations for the three deferred-state fields, and the lack of a guard against a double-render overwriting an unconsumed frame — none of which affect normal operation.
  • No files require special attention; minor cleanup suggestions are confined to ovrtx_renderer.py.

Important Files Changed

Filename Overview
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py Introduces deferred tile extraction via _ensure_extraction() / _pending_frame pattern and switches to batched single-kernel launches; logic is correct but the three deferred-state attributes are not initialised in OVRTXRenderData.__init__ and a double-render without extraction silently drops the first frame.
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer_kernels.py Adds extract_all_rgba_tiles_kernel and extract_all_depth_tiles_kernel — 3-D Warp kernels that replace N per-env launches with a single (num_envs, H, W) dispatch; index arithmetic mirrors the old per-env kernels and is correct.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant OVRTXRenderer
    participant OVRTXRenderData
    participant OVRTX as ovrtx.Renderer
    participant Warp

    Note over Caller,Warp: Per-step training loop

    Caller->>OVRTXRenderer: render(render_data)
    OVRTXRenderer->>OVRTX: _renderer.step(render_products)
    OVRTX-->>OVRTXRenderer: products (frames)
    OVRTXRenderer->>OVRTXRenderData: _pending_frame = frames[0]
    OVRTXRenderer->>OVRTXRenderData: _extraction_done = False
    Note over Caller,Warp: Training / physics work runs here (overlap)

    Caller->>OVRTXRenderer: write_output(render_data, "rgba", tensor)
    OVRTXRenderer->>OVRTXRenderer: _ensure_extraction(render_data)
    alt _extraction_done == False
        OVRTXRenderer->>OVRTXRenderer: _process_render_frame(render_data, frame, warp_buffers)
        OVRTXRenderer->>Warp: launch extract_all_rgba_tiles_kernel(num_envs, H, W)
        OVRTXRenderer->>Warp: launch extract_all_depth_tiles_kernel(num_envs, H, W)
        OVRTXRenderer->>OVRTXRenderData: _extraction_done = True
        OVRTXRenderer->>OVRTXRenderData: _pending_frame = None
    else _extraction_done == True
        Note over OVRTXRenderer: no-op, already extracted
    end
    OVRTXRenderer->>Warp: wp.copy(output_tensor ← warp_buffer["rgba"])
    Warp-->>Caller: output tensor filled
Loading

Comments Outside Diff (2)

  1. source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py, line 423-443 (link)

    P2 Unused suffix parameter retained in signature

    The suffix: str = "" parameter is accepted but never referenced in the method body. It was presumably meaningful in the per-env kernel dispatch (e.g. for tagging log output), but the new batched implementation dropped all uses. Keeping it silently encourages callers to pass values that have no effect.

  2. source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py, line 64-103 (link)

    P2 Deferred-state attributes not initialized in __init__

    _pending_frame, _pending_products, and _extraction_done are only set inside render() and accessed (defensively via getattr) in _ensure_extraction(). If any code path calls write_output() (which calls _ensure_extraction()) before render() has run once, the getattr(..., True) default silently short-circuits — this works, but it means the missing-attribute branch is the "normal" code path on the very first call rather than an edge case.

    Initializing them explicitly in __init__ makes the lifecycle clear and avoids relying on getattr defaults for routine operation:

    # Inside OVRTXRenderData.__init__, after creating warp_buffers:
    self._pending_frame = None
    self._pending_products = None
    self._extraction_done = True

Reviews (1): Last reviewed commit: "perf(ovrtx): deferred tile extraction - ..." | Re-trigger Greptile

Comment on lines +534 to +544
)
product_path = self._render_product_paths[0]
if product_path in products and len(products[product_path].frames) > 0:
self._process_render_frame(
render_data,
products[product_path].frames[0],
render_data.warp_buffers,
)
# Store frame for deferred extraction instead of extracting immediately
render_data._pending_frame = products[product_path].frames[0]
render_data._pending_products = products
render_data._extraction_done = False
else:
render_data._pending_frame = None
render_data._pending_products = None
render_data._extraction_done = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent frame drop when render() is called before the previous extraction completes

If render() is invoked a second time while _extraction_done is still False (i.e. a pending frame has not yet been consumed by write_output()), the pending frame is silently overwritten. Because the call to _renderer.step() at the top of render() replaces _pending_frame unconditionally, the first frame is lost with no warning.

Under the expected single-step call pattern this won't happen, but a defensive guard (log-warning + eager extraction) would make it robust to mis-use:

# At the start of the try block, before calling _renderer.step():
if not getattr(render_data, "_extraction_done", True):
    logger.warning("render() called before previous frame was consumed; forcing extraction.")
    self._ensure_extraction(render_data)

@ncournia-nv ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 9cf03d0 to 4056829 Compare April 3, 2026 06:17
@ncournia-nv ncournia-nv changed the title perf(ovrtx): batched + deferred tile extraction for OVRTXRenderer perf(ovrtx): batched tile extraction — single kernel launch for all envs Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant