Inference | Per-block MoE routing storage for prefix caching by lmcafee-nvidia · Pull Request #4301 · NVIDIA/Megatron-LM

lmcafee-nvidia · 2026-04-14T15:38:10Z

Summary

Convert MoE routing indices from GPU tensors to CPU numpy arrays after the forward pass, reducing GPU memory pressure at long sequences
Add per-block routing storage on KVBlockAllocator so routing indices are scattered into KV cache blocks and reconstructed at request completion — prefix-cached blocks retain routing from the original request
Add chunk-based routing accumulation (add_routing_indices / finalize_routing_chunks) and numpy serialization support on DynamicInferenceRequest

Test plan

7 new TestPerBlockRouting tests (store/get round-trip, cleared on allocate/reset, persists through deregister, reconstruct from blocks, missing block returns None, survives LRU prefix match)
All 25 prefix caching tests pass with no regressions
CI

🤖 Generated with Claude Code

Convert MoE routing indices from GPU tensors to CPU numpy arrays after the forward pass, and add chunk-based accumulation infrastructure. This is the minimum subset of siddharth/support-nemo-rl-router-replay needed to support per-block routing storage. Changes: - _router_record_bookkeeping returns Dict[int, np.ndarray] (CPU numpy) - DynamicInferenceRequest.routing_indices changed to np.ndarray - add_routing_indices/finalize_routing_chunks for O(1) chunk staging - ndarray serialization/deserialization support - Engine uses chunk-based accumulation instead of torch.cat

…ility Move routing indices from per-request step-by-step accumulation to per-block storage on KVBlockAllocator. At request completion, routing is reconstructed by concatenating per-block routing in block order. Matched (prefix-cached) blocks retain routing from the original request, so reconstruction naturally covers all tokens including skipped prefixes. Key methods: - store_routing_per_block: scatters routing into per-block storage - reconstruct_routing_from_blocks: reassembles routing on completion - store_block_routing / get_block_routing: low-level block storage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-14T15:38:22Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

copy-pr-bot · 2026-04-14T15:39:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

sidsingh-nvidia · 2026-04-14T20:51:34Z

+        ]
+        if not routing_parts:
+            return
+        flat_routing = np.concatenate(routing_parts, axis=0)  # [token_count, num_layers, topk]


It seems like this is the reverse of what we are doing in the text_generation_controller's router_record_bookkeeping function. Over there we get the map as [token_count, num_layers, topk] and convert it into a per request map. Here we are doing the opposite conversion. Do you think we could simply return the per token map from router_record_bookeeping and simplify things?

sidsingh-nvidia · 2026-04-15T04:15:38Z

We should not store the routing indices in both the block store as well inside each request, given how much memory they can use. I suggest we only use the block store to store them, and once a request is finished, we reconstruct the entire data from it's blocks and send it back to the coordinator.

sidsingh-nvidia · 2026-04-15T04:16:35Z

-                    request.routing_indices = torch.cat(
-                        [request.routing_indices, step_routing], dim=0
-                    )
+                request.add_routing_indices(step_routing)


Can't we get rid of this storage now? Let the block store be the only place where we store the routing indices.

shanmugamr1992 · 2026-04-15T18:43:43Z

@lmcafee-nvidia this wont work with sequence parallel
You need to add the fix I suggested

The CUDA-graph static buffer path in get_routing_indices() may return a tensor sliced to active_token_count (global unpadded), which can exceed the per-rank valid count under sequence parallelism. Truncate to padded_active_token_count // tp_size before the all-gather so only valid routing data is collected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sidsingh-nvidia and others added 2 commits April 14, 2026 08:23

lmcafee-nvidia requested review from a team as code owners April 14, 2026 15:38

svcnvidia-nemo-ci marked this pull request as draft April 14, 2026 15:38

shanmugamr1992 approved these changes Apr 14, 2026

View reviewed changes

lmcafee-nvidia self-assigned this Apr 14, 2026

lmcafee-nvidia requested a review from kvareddy April 14, 2026 18:35

lmcafee-nvidia marked this pull request as ready for review April 14, 2026 18:35

svcnvidia-nemo-ci requested a review from a team April 14, 2026 18:35

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 14, 2026

kvareddy requested a review from sidsingh-nvidia April 14, 2026 18:36

svcnvidia-nemo-ci added the complexity: medium label Apr 14, 2026

sidsingh-nvidia reviewed Apr 14, 2026

View reviewed changes

sidsingh-nvidia requested changes Apr 15, 2026

View reviewed changes

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference | Per-block MoE routing storage for prefix caching#4301

Inference | Per-block MoE routing storage for prefix caching#4301
lmcafee-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-router-record-standalone

lmcafee-nvidia commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

sidsingh-nvidia Apr 14, 2026

Uh oh!

sidsingh-nvidia commented Apr 15, 2026 •

edited

Loading

Uh oh!

sidsingh-nvidia Apr 15, 2026

Uh oh!

shanmugamr1992 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lmcafee-nvidia commented Apr 14, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

sidsingh-nvidia Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

sidsingh-nvidia commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidsingh-nvidia Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

shanmugamr1992 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sidsingh-nvidia commented Apr 15, 2026 •

edited

Loading