Skip to content

Inference | Per-block MoE routing storage for prefix caching#4301

Open
lmcafee-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-router-record-standalone
Open

Inference | Per-block MoE routing storage for prefix caching#4301
lmcafee-nvidia wants to merge 3 commits intoNVIDIA:mainfrom
lmcafee-nvidia:prefix-caching-router-record-standalone

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor

Summary

  • Convert MoE routing indices from GPU tensors to CPU numpy arrays after the forward pass, reducing GPU memory pressure at long sequences
  • Add per-block routing storage on KVBlockAllocator so routing indices are scattered into KV cache blocks and reconstructed at request completion — prefix-cached blocks retain routing from the original request
  • Add chunk-based routing accumulation (add_routing_indices / finalize_routing_chunks) and numpy serialization support on DynamicInferenceRequest

Test plan

  • 7 new TestPerBlockRouting tests (store/get round-trip, cleared on allocate/reset, persists through deregister, reconstruct from blocks, missing block returns None, survives LRU prefix match)
  • All 25 prefix caching tests pass with no regressions
  • CI

🤖 Generated with Claude Code

sidsingh-nvidia and others added 2 commits April 14, 2026 08:23
Convert MoE routing indices from GPU tensors to CPU numpy arrays after
the forward pass, and add chunk-based accumulation infrastructure.
This is the minimum subset of siddharth/support-nemo-rl-router-replay
needed to support per-block routing storage.

Changes:
- _router_record_bookkeeping returns Dict[int, np.ndarray] (CPU numpy)
- DynamicInferenceRequest.routing_indices changed to np.ndarray
- add_routing_indices/finalize_routing_chunks for O(1) chunk staging
- ndarray serialization/deserialization support
- Engine uses chunk-based accumulation instead of torch.cat
…ility

Move routing indices from per-request step-by-step accumulation to
per-block storage on KVBlockAllocator. At request completion, routing
is reconstructed by concatenating per-block routing in block order.
Matched (prefix-cached) blocks retain routing from the original request,
so reconstruction naturally covers all tokens including skipped prefixes.

Key methods:
- store_routing_per_block: scatters routing into per-block storage
- reconstruct_routing_from_blocks: reassembles routing on completion
- store_block_routing / get_block_routing: low-level block storage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia lmcafee-nvidia requested review from a team as code owners April 14, 2026 15:38
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft April 14, 2026 15:38
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lmcafee-nvidia lmcafee-nvidia self-assigned this Apr 14, 2026
@lmcafee-nvidia lmcafee-nvidia requested a review from kvareddy April 14, 2026 18:35
@lmcafee-nvidia lmcafee-nvidia marked this pull request as ready for review April 14, 2026 18:35
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 14, 2026 18:35
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 14, 2026
]
if not routing_parts:
return
flat_routing = np.concatenate(routing_parts, axis=0) # [token_count, num_layers, topk]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is the reverse of what we are doing in the text_generation_controller's router_record_bookkeeping function. Over there we get the map as [token_count, num_layers, topk] and convert it into a per request map. Here we are doing the opposite conversion. Do you think we could simply return the per token map from router_record_bookeeping and simplify things?

@sidsingh-nvidia
Copy link
Copy Markdown
Contributor

sidsingh-nvidia commented Apr 15, 2026

We should not store the routing indices in both the block store as well inside each request, given how much memory they can use. I suggest we only use the block store to store them, and once a request is finished, we reconstruct the entire data from it's blocks and send it back to the coordinator.

request.routing_indices = torch.cat(
[request.routing_indices, step_routing], dim=0
)
request.add_routing_indices(step_routing)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get rid of this storage now? Let the block store be the only place where we store the routing indices.

@shanmugamr1992
Copy link
Copy Markdown
Contributor

@lmcafee-nvidia this wont work with sequence parallel
You need to add the fix I suggested

The CUDA-graph static buffer path in get_routing_indices() may return
a tensor sliced to active_token_count (global unpadded), which can
exceed the per-rank valid count under sequence parallelism. Truncate
to padded_active_token_count // tp_size before the all-gather so only
valid routing data is collected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants