perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136
perf(ovrtx): batched tile extraction — single kernel launch for all envs#5136ncournia-nv wants to merge 1 commit intoisaac-sim:developfrom
Conversation
dc82943 to
97ed520
Compare
There was a problem hiding this comment.
Code Review — PR #5136
Reviewer: 🤖 Bob (automated)
Summary
Replaces N per-env Warp kernel launches with single 3D-batched launches for tile extraction, and defers extraction from render() to write_output(). Well-motivated optimization with solid benchmark numbers (+23.7% FPS).
Design Assessment
Batched kernels (commit 1): Clean, correct, textbook GPU optimization. Eliminates O(num_envs) kernel launch overhead. The 3D (num_envs, height, width) dispatch maps naturally to the tile grid. No correctness concerns.
Deferred extraction (commit 2): Architecturally sound — separates render kick from buffer consumption. The lazy _ensure_extraction() guard pattern is simple and effective. However, there are several issues around state management and frame lifetime that need attention.
Architecture Impact
Low risk. Changes are confined to OVRTXRenderer internals. The BaseRenderer interface contract (render() then write_output()) is preserved. No downstream API changes.
Implementation Verdict
🟡 Approve with nits. The core optimization is correct and valuable. There are pre-commit CI failures that must be fixed, plus a few state management hygiene issues.
CI Status
- pre-commit: ❌ FAIL — unused imports + quote style (see inline comments)
- Other checks: pending/skipping (blocked on pre-commit)
Branch Status
- Mergeable: ✅ MERGEABLE
Findings Summary
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | 🔴 blocker | ovrtx_renderer.py | Unused imports fail pre-commit (CI red) |
| 2 | 🔴 blocker | ovrtx_renderer.py | Single quotes in _ensure_extraction fail pre-commit |
| 3 | 🟡 medium | ovrtx_renderer.py | _pending_* attrs set via monkey-patching, not initialized in __init__ |
| 4 | 🟡 medium | ovrtx_renderer.py | _pending_products stored but never read — dead state |
| 5 | 🟢 low | ovrtx_renderer.py | cleanup() doesn't clear deferred extraction state |
| 6 | 🟢 nit | ovrtx_renderer_kernels.py | Old single-env kernels are now dead code (not in this PR's scope, but worth noting) |
| create_camera_transforms_kernel, | ||
| extract_all_depth_tiles_kernel, | ||
| extract_all_rgba_tiles_kernel, | ||
| extract_depth_tile_from_tiled_buffer_kernel, |
There was a problem hiding this comment.
🔴 Blocker: Unused imports fail pre-commit CI.
The old per-env kernels extract_depth_tile_from_tiled_buffer_kernel and extract_tile_from_tiled_buffer_kernel are still imported but no longer called anywhere in this file. Pre-commit's unused-import linter catches this — CI is red.
| extract_depth_tile_from_tiled_buffer_kernel, | |
| from .ovrtx_renderer_kernels import ( | |
| DEVICE, | |
| create_camera_transforms_kernel, | |
| extract_all_depth_tiles_kernel, | |
| extract_all_rgba_tiles_kernel, | |
| generate_random_colors_from_ids_kernel, | |
| sync_newton_transforms_kernel, | |
| ) |
|
|
||
| def _ensure_extraction(self, render_data: OVRTXRenderData) -> None: | ||
| """Extract tiles from the pending frame if not already done.""" | ||
| if getattr(render_data, '_extraction_done', True): |
There was a problem hiding this comment.
🔴 Blocker: Single quotes fail pre-commit quote style.
Project convention is double quotes. Pre-commit flags '_extraction_done' and '_pending_frame'.
| if getattr(render_data, '_extraction_done', True): | |
| if getattr(render_data, "_extraction_done", True): | |
| return | |
| frame = getattr(render_data, "_pending_frame", None) |
| # Store frame for deferred extraction instead of extracting immediately | ||
| render_data._pending_frame = products[product_path].frames[0] | ||
| render_data._pending_products = products | ||
| render_data._extraction_done = False |
There was a problem hiding this comment.
🟡 Medium: Monkey-patched attributes on OVRTXRenderData.
_pending_frame, _pending_products, and _extraction_done are dynamically set on render_data instances in render() without being declared in OVRTXRenderData.__init__. This forces _ensure_extraction to use defensive getattr(..., default) calls.
Prefer initializing these in OVRTXRenderData.__init__:
self._pending_frame = None
self._pending_products = None
self._extraction_done = TrueThis makes the state machine explicit, enables IDE autocompletion/type checking, and removes the need for getattr guards.
| ) | ||
| # Store frame for deferred extraction instead of extracting immediately | ||
| render_data._pending_frame = products[product_path].frames[0] | ||
| render_data._pending_products = products |
There was a problem hiding this comment.
🟡 Medium: _pending_products is stored but never read.
render_data._pending_products = products is set in render() and cleared in _ensure_extraction(), but nothing ever reads it. _process_render_frame() only uses the frame object, not the products dict.
If this is reserved for future use (e.g., multi-product extraction), add a comment explaining the intent. Otherwise, remove the dead state to keep the deferred extraction state machine minimal.
| self._process_render_frame(render_data, frame, render_data.warp_buffers) | ||
| render_data._extraction_done = True | ||
| render_data._pending_frame = None | ||
| render_data._pending_products = None |
There was a problem hiding this comment.
🟢 Nit: cleanup() should clear deferred state.
If cleanup() is called between render() and write_output() (e.g. during error teardown), the stale _pending_frame reference keeps the OVRTX frame object alive longer than expected, potentially holding GPU resources. Consider:
if render_data is not None:
render_data.sensor = None
render_data._pending_frame = None
render_data._pending_products = None
render_data._extraction_done = True97ed520 to
8bb4cc0
Compare
8bb4cc0 to
c3efe96
Compare
Greptile SummaryThis PR delivers a ~24 % throughput improvement for vision-based RL training by making two targeted changes to
The core logic is sound: tile-coordinate arithmetic in the new kernels exactly mirrors the old per-env kernels, Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant OVRTXRenderer
participant OVRTXRenderData
participant OVRTX as ovrtx.Renderer
participant Warp
Note over Caller,Warp: Per-step training loop
Caller->>OVRTXRenderer: render(render_data)
OVRTXRenderer->>OVRTX: _renderer.step(render_products)
OVRTX-->>OVRTXRenderer: products (frames)
OVRTXRenderer->>OVRTXRenderData: _pending_frame = frames[0]
OVRTXRenderer->>OVRTXRenderData: _extraction_done = False
Note over Caller,Warp: Training / physics work runs here (overlap)
Caller->>OVRTXRenderer: write_output(render_data, "rgba", tensor)
OVRTXRenderer->>OVRTXRenderer: _ensure_extraction(render_data)
alt _extraction_done == False
OVRTXRenderer->>OVRTXRenderer: _process_render_frame(render_data, frame, warp_buffers)
OVRTXRenderer->>Warp: launch extract_all_rgba_tiles_kernel(num_envs, H, W)
OVRTXRenderer->>Warp: launch extract_all_depth_tiles_kernel(num_envs, H, W)
OVRTXRenderer->>OVRTXRenderData: _extraction_done = True
OVRTXRenderer->>OVRTXRenderData: _pending_frame = None
else _extraction_done == True
Note over OVRTXRenderer: no-op, already extracted
end
OVRTXRenderer->>Warp: wp.copy(output_tensor ← warp_buffer["rgba"])
Warp-->>Caller: output tensor filled
|
| ) | ||
| product_path = self._render_product_paths[0] | ||
| if product_path in products and len(products[product_path].frames) > 0: | ||
| self._process_render_frame( | ||
| render_data, | ||
| products[product_path].frames[0], | ||
| render_data.warp_buffers, | ||
| ) | ||
| # Store frame for deferred extraction instead of extracting immediately | ||
| render_data._pending_frame = products[product_path].frames[0] | ||
| render_data._pending_products = products | ||
| render_data._extraction_done = False | ||
| else: | ||
| render_data._pending_frame = None | ||
| render_data._pending_products = None | ||
| render_data._extraction_done = True |
There was a problem hiding this comment.
Silent frame drop when
render() is called before the previous extraction completes
If render() is invoked a second time while _extraction_done is still False (i.e. a pending frame has not yet been consumed by write_output()), the pending frame is silently overwritten. Because the call to _renderer.step() at the top of render() replaces _pending_frame unconditionally, the first frame is lost with no warning.
Under the expected single-step call pattern this won't happen, but a defensive guard (log-warning + eager extraction) would make it robust to mis-use:
# At the start of the try block, before calling _renderer.step():
if not getattr(render_data, "_extraction_done", True):
logger.warning("render() called before previous frame was consumed; forcing extraction.")
self._ensure_extraction(render_data)9cf03d0 to
4056829
Compare
Summary
Replaces per-environment tile extraction loops with batched GPU kernels that extract all tiles in a single launch. This eliminates N-1 redundant kernel launches per frame per data type (where N = num_envs).
Changes
extract_all_rgba_tiles_kernelandextract_all_depth_tiles_kernel— 3D-indexed kernels(env_idx, y, x)that extract all environment tiles from the tiled framebuffer in one launch_extract_rgba_tiles()and_extract_depth_tiles()now call the batched kernels instead of looping over environments with per-tile kernelsextract_tile_from_tiled_buffer_kernel,extract_depth_tile_from_tiled_buffer_kernel)Performance
Task:
Isaac-Repose-Cube-Shadow-Vision-Benchmark-Direct-v0Config: 1225 envs, 5 epochs, Newton + OVRTX renderer, L40 GPU
The improvement comes from reducing kernel launch overhead — at 1225 envs, each frame previously required 1225 separate kernel launches per data type. The batched approach does it in one.
Measured without nsys profiling to avoid instrumentation skew. Under nsys the delta appears larger (~3×) because profiling overhead disproportionately affects the per-env loop.