perf(ovrtx): batched tile extraction — single kernel launch for all envs by ncournia-nv · Pull Request #5136 · isaac-sim/IsaacLab

ncournia-nv · 2026-04-01T00:33:29Z

Summary

Replaces per-environment tile extraction loops with batched GPU kernels that extract all tiles in a single launch. This eliminates N-1 redundant kernel launches per frame per data type (where N = num_envs).

Changes

New kernels: extract_all_rgba_tiles_kernel and extract_all_depth_tiles_kernel — 3D-indexed kernels (env_idx, y, x) that extract all environment tiles from the tiled framebuffer in one launch
Renderer: _extract_rgba_tiles() and _extract_depth_tiles() now call the batched kernels instead of looping over environments with per-tile kernels
Cleanup: Removed unused per-env kernel imports (extract_tile_from_tiled_buffer_kernel, extract_depth_tile_from_tiled_buffer_kernel)

Performance

Task: Isaac-Repose-Cube-Shadow-Vision-Benchmark-Direct-v0
Config: 1225 envs, 5 epochs, Newton + OVRTX renderer, L40 GPU

	Avg FPS (step)	Wall time
Baseline (develop)	5,657	4m 11s
This PR	6,867	3m 59s
Speedup	1.21×	12s saved

The improvement comes from reducing kernel launch overhead — at 1225 envs, each frame previously required 1225 separate kernel launches per data type. The batched approach does it in one.

Measured without nsys profiling to avoid instrumentation skew. Under nsys the delta appears larger (~3×) because profiling overhead disproportionately affects the per-env loop.

isaaclab-review-bot

Code Review — PR #5136

Reviewer: 🤖 Bob (automated)

Summary

Replaces N per-env Warp kernel launches with single 3D-batched launches for tile extraction, and defers extraction from render() to write_output(). Well-motivated optimization with solid benchmark numbers (+23.7% FPS).

Design Assessment

Batched kernels (commit 1): Clean, correct, textbook GPU optimization. Eliminates O(num_envs) kernel launch overhead. The 3D (num_envs, height, width) dispatch maps naturally to the tile grid. No correctness concerns.

Deferred extraction (commit 2): Architecturally sound — separates render kick from buffer consumption. The lazy _ensure_extraction() guard pattern is simple and effective. However, there are several issues around state management and frame lifetime that need attention.

Architecture Impact

Low risk. Changes are confined to OVRTXRenderer internals. The BaseRenderer interface contract (render() then write_output()) is preserved. No downstream API changes.

Implementation Verdict

🟡 Approve with nits. The core optimization is correct and valuable. There are pre-commit CI failures that must be fixed, plus a few state management hygiene issues.

CI Status

pre-commit: ❌ FAIL — unused imports + quote style (see inline comments)
Other checks: pending/skipping (blocked on pre-commit)

Branch Status

Mergeable: ✅ MERGEABLE

Findings Summary

#	Severity	File	Issue
1	🔴 blocker	ovrtx_renderer.py	Unused imports fail pre-commit (CI red)
2	🔴 blocker	ovrtx_renderer.py	Single quotes in `_ensure_extraction` fail pre-commit
3	🟡 medium	ovrtx_renderer.py	`_pending_*` attrs set via monkey-patching, not initialized in `__init__`
4	🟡 medium	ovrtx_renderer.py	`_pending_products` stored but never read — dead state
5	🟢 low	ovrtx_renderer.py	`cleanup()` doesn't clear deferred extraction state
6	🟢 nit	ovrtx_renderer_kernels.py	Old single-env kernels are now dead code (not in this PR's scope, but worth noting)

isaaclab-review-bot · 2026-04-01T00:37:07Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

    create_camera_transforms_kernel,
+    extract_all_depth_tiles_kernel,
+    extract_all_rgba_tiles_kernel,
    extract_depth_tile_from_tiled_buffer_kernel,


🔴 Blocker: Unused imports fail pre-commit CI.

The old per-env kernels extract_depth_tile_from_tiled_buffer_kernel and extract_tile_from_tiled_buffer_kernel are still imported but no longer called anywhere in this file. Pre-commit's unused-import linter catches this — CI is red.

Suggested change

extract_depth_tile_from_tiled_buffer_kernel,

from .ovrtx_renderer_kernels import (

DEVICE,

create_camera_transforms_kernel,

extract_all_depth_tiles_kernel,

extract_all_rgba_tiles_kernel,

generate_random_colors_from_ids_kernel,

sync_newton_transforms_kernel,

)

isaaclab-review-bot · 2026-04-01T00:37:07Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

+
+    def _ensure_extraction(self, render_data: OVRTXRenderData) -> None:
+        """Extract tiles from the pending frame if not already done."""
+        if getattr(render_data, '_extraction_done', True):


🔴 Blocker: Single quotes fail pre-commit quote style.

Project convention is double quotes. Pre-commit flags '_extraction_done' and '_pending_frame'.

Suggested change

if getattr(render_data, '_extraction_done', True):

if getattr(render_data, "_extraction_done", True):

return

frame = getattr(render_data, "_pending_frame", None)

isaaclab-review-bot · 2026-04-01T00:37:07Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

+                # Store frame for deferred extraction instead of extracting immediately
+                render_data._pending_frame = products[product_path].frames[0]
+                render_data._pending_products = products
+                render_data._extraction_done = False


🟡 Medium: Monkey-patched attributes on OVRTXRenderData.

_pending_frame, _pending_products, and _extraction_done are dynamically set on render_data instances in render() without being declared in OVRTXRenderData.__init__. This forces _ensure_extraction to use defensive getattr(..., default) calls.

Prefer initializing these in OVRTXRenderData.__init__:

self._pending_frame = None self._pending_products = None self._extraction_done = True

This makes the state machine explicit, enables IDE autocompletion/type checking, and removes the need for getattr guards.

isaaclab-review-bot · 2026-04-01T00:37:07Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

-                )
+                # Store frame for deferred extraction instead of extracting immediately
+                render_data._pending_frame = products[product_path].frames[0]
+                render_data._pending_products = products


🟡 Medium: _pending_products is stored but never read.

render_data._pending_products = products is set in render() and cleared in _ensure_extraction(), but nothing ever reads it. _process_render_frame() only uses the frame object, not the products dict.

If this is reserved for future use (e.g., multi-product extraction), add a comment explaining the intent. Otherwise, remove the dead state to keep the deferred extraction state machine minimal.

isaaclab-review-bot · 2026-04-01T00:37:07Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

+            self._process_render_frame(render_data, frame, render_data.warp_buffers)
+        render_data._extraction_done = True
+        render_data._pending_frame = None
+        render_data._pending_products = None


🟢 Nit: cleanup() should clear deferred state.

If cleanup() is called between render() and write_output() (e.g. during error teardown), the stale _pending_frame reference keeps the OVRTX frame object alive longer than expected, potentially holding GPU resources. Consider:

if render_data is not None: render_data.sensor = None render_data._pending_frame = None render_data._pending_products = None render_data._extraction_done = True

greptile-apps · 2026-04-01T21:11:55Z

Greptile Summary

This PR delivers a ~24 % throughput improvement for vision-based RL training by making two targeted changes to OVRTXRenderer:

Batched tile extraction — replaces N per-environment Warp kernel launches (one per env) with a single 3-D kernel (num_envs, H, W) for both RGBA and depth buffers, dramatically reducing kernel-launch overhead at scale (e.g. 1,225 launches → 1 per buffer type).
Deferred tile extraction — moves the tile copy from render() to write_output() via an _ensure_extraction() lazy pattern, allowing physics/training compute to overlap with rendering between steps.

The core logic is sound: tile-coordinate arithmetic in the new kernels exactly mirrors the old per-env kernels, _pending_products is kept alive to prevent the renderer from freeing GPU-mapped memory before extraction, and exception/empty-frame paths all correctly set _extraction_done = True to avoid stale state. All remaining findings are P2 style or defensive-hardening suggestions.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/hardening suggestions with no correctness risk on the happy path.
Both functional changes (batched kernels and deferred extraction) are logically correct. The batched index arithmetic is identical to the old per-env logic. The deferred state is guarded with getattr defaults that prevent crashes even when attributes are absent. Exception and empty-frame paths correctly flush pending state. The only concerns are: an unused suffix parameter, missing __init__ declarations for the three deferred-state fields, and the lack of a guard against a double-render overwriting an unconsumed frame — none of which affect normal operation.
No files require special attention; minor cleanup suggestions are confined to ovrtx_renderer.py.

Important Files Changed

Filename	Overview
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py	Introduces deferred tile extraction via `_ensure_extraction()` / `_pending_frame` pattern and switches to batched single-kernel launches; logic is correct but the three deferred-state attributes are not initialised in `OVRTXRenderData.__init__` and a double-render without extraction silently drops the first frame.
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer_kernels.py	Adds `extract_all_rgba_tiles_kernel` and `extract_all_depth_tiles_kernel` — 3-D Warp kernels that replace N per-env launches with a single `(num_envs, H, W)` dispatch; index arithmetic mirrors the old per-env kernels and is correct.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant OVRTXRenderer
    participant OVRTXRenderData
    participant OVRTX as ovrtx.Renderer
    participant Warp

    Note over Caller,Warp: Per-step training loop

    Caller->>OVRTXRenderer: render(render_data)
    OVRTXRenderer->>OVRTX: _renderer.step(render_products)
    OVRTX-->>OVRTXRenderer: products (frames)
    OVRTXRenderer->>OVRTXRenderData: _pending_frame = frames[0]
    OVRTXRenderer->>OVRTXRenderData: _extraction_done = False
    Note over Caller,Warp: Training / physics work runs here (overlap)

    Caller->>OVRTXRenderer: write_output(render_data, "rgba", tensor)
    OVRTXRenderer->>OVRTXRenderer: _ensure_extraction(render_data)
    alt _extraction_done == False
        OVRTXRenderer->>OVRTXRenderer: _process_render_frame(render_data, frame, warp_buffers)
        OVRTXRenderer->>Warp: launch extract_all_rgba_tiles_kernel(num_envs, H, W)
        OVRTXRenderer->>Warp: launch extract_all_depth_tiles_kernel(num_envs, H, W)
        OVRTXRenderer->>OVRTXRenderData: _extraction_done = True
        OVRTXRenderer->>OVRTXRenderData: _pending_frame = None
    else _extraction_done == True
        Note over OVRTXRenderer: no-op, already extracted
    end
    OVRTXRenderer->>Warp: wp.copy(output_tensor ← warp_buffer["rgba"])
    Warp-->>Caller: output tensor filled

Comments Outside Diff (2)

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py, line 423-443 (link)

Unused suffix parameter retained in signature

The suffix: str = "" parameter is accepted but never referenced in the method body. It was presumably meaningful in the per-env kernel dispatch (e.g. for tagging log output), but the new batched implementation dropped all uses. Keeping it silently encourages callers to pass values that have no effect.
source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py, line 64-103 (link)

Deferred-state attributes not initialized in __init__

_pending_frame, _pending_products, and _extraction_done are only set inside render() and accessed (defensively via getattr) in _ensure_extraction(). If any code path calls write_output() (which calls _ensure_extraction()) before render() has run once, the getattr(..., True) default silently short-circuits — this works, but it means the missing-attribute branch is the "normal" code path on the very first call rather than an edge case.

Initializing them explicitly in __init__ makes the lifecycle clear and avoids relying on getattr defaults for routine operation:
```
# Inside OVRTXRenderData.__init__, after creating warp_buffers:
self._pending_frame = None
self._pending_products = None
self._extraction_done = True
```

_{Reviews (1): Last reviewed commit: "perf(ovrtx): deferred tile extraction - ..." | Re-trigger Greptile}

greptile-apps · 2026-04-01T21:12:05Z

source/isaaclab_ov/isaaclab_ov/renderers/ovrtx_renderer.py

            )
            product_path = self._render_product_paths[0]
            if product_path in products and len(products[product_path].frames) > 0:
-                self._process_render_frame(
-                    render_data,
-                    products[product_path].frames[0],
-                    render_data.warp_buffers,
-                )
+                # Store frame for deferred extraction instead of extracting immediately
+                render_data._pending_frame = products[product_path].frames[0]
+                render_data._pending_products = products
+                render_data._extraction_done = False
+            else:
+                render_data._pending_frame = None
+                render_data._pending_products = None
+                render_data._extraction_done = True


Silent frame drop when render() is called before the previous extraction completes

If render() is invoked a second time while _extraction_done is still False (i.e. a pending frame has not yet been consumed by write_output()), the pending frame is silently overwritten. Because the call to _renderer.step() at the top of render() replaces _pending_frame unconditionally, the first frame is lost with no warning.

Under the expected single-step call pattern this won't happen, but a defensive guard (log-warning + eager extraction) would make it robust to mis-use:

# At the start of the try block, before calling _renderer.step(): if not getattr(render_data, "_extraction_done", True): logger.warning("render() called before previous frame was consumed; forcing extraction.") self._ensure_extraction(render_data)

github-actions bot added the isaac-lab Related to Isaac Lab team label Apr 1, 2026

ncournia-nv force-pushed the perf/ovrtx-optimizations branch from dc82943 to 97ed520 Compare April 1, 2026 00:36

isaaclab-review-bot bot reviewed Apr 1, 2026

View reviewed changes

ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 97ed520 to 8bb4cc0 Compare April 1, 2026 00:38

ncournia-nv marked this pull request as ready for review April 1, 2026 21:07

ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 8bb4cc0 to c3efe96 Compare April 1, 2026 21:10

greptile-apps bot reviewed Apr 1, 2026

View reviewed changes

perf(ovrtx): batched tile extraction - single kernel launch for all envs

4056829

ncournia-nv force-pushed the perf/ovrtx-optimizations branch from 9cf03d0 to 4056829 Compare April 3, 2026 06:17

ncournia-nv changed the title ~~perf(ovrtx): batched + deferred tile extraction for OVRTXRenderer~~ perf(ovrtx): batched tile extraction — single kernel launch for all envs Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136

perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136
ncournia-nv wants to merge 1 commit intoisaac-sim:developfrom
ncournia-nv:perf/ovrtx-optimizations

ncournia-nv commented Apr 1, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot bot left a comment

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Uh oh!

greptile-apps bot commented Apr 1, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        if getattr(render_data, '_extraction_done', True):
+        if getattr(render_data, "_extraction_done", True):
+            return
+        frame = getattr(render_data, "_pending_frame", None)

Conversation

ncournia-nv commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance

Uh oh!

isaaclab-review-bot bot left a comment

Choose a reason for hiding this comment

Code Review — PR #5136

Summary

Design Assessment

Architecture Impact

Implementation Verdict

CI Status

Branch Status

Findings Summary

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ncournia-nv commented Apr 1, 2026 •

edited

Loading

greptile-apps bot commented Apr 1, 2026 •

edited

Loading