Skip to content

Update version#13

Merged
BingooYang merged 1 commit into
PFCCLab:0.6from
BingooYang:update_last
May 13, 2026
Merged

Update version#13
BingooYang merged 1 commit into
PFCCLab:0.6from
BingooYang:update_last

Conversation

@BingooYang
Copy link
Copy Markdown

@BingooYang BingooYang commented May 8, 2026

📌 Description

升级版本到0.6.11
当前适配跑通:

# 测试 状态
1 tests/attention/test_attention_sink_blackwell.py -k test_blackwell_trtllm_gen_context_attention_sink PASS 72/72
2 tests/attention/test_attention_sink_blackwell.py -k test_blackwell_trtllm_gen_decode_attention_sink PASS 72/72
3 tests/moe/test_trtllm_gen_fused_moe.py::test_fp8_block_scale_routed_activation_type_relu2_smoke PASS
4 tests/comm/test_trtllm_allreduce_fusion.py::test_trtllm_allreduce_fusion[True-1024-dtype0-2] PASS
5 test_trtllm_gen_fused_moe.py::test_renormalize_routing[...FP8_Block_DeepSeek-1024-1024-8-RandomHiddenStates] PASS
6 test_trtllm_gen_fused_moe.py::test_sigmoid_routing[...FP8_Block_DeepSeek-1024-1024-8] PASS
7 test_trtllm_gen_fused_moe.py::test_dyn_block_kernel_routing[...FP8_Block_DeepSeek...] PASS
8 test_trtllm_gen_fused_moe.py::test_tier_1024_experts_routing[...FP8_Block_DeepSeek...] PASS
9 test_trtllm_gen_fused_moe.py::test_deepseek_ngroup1_block_per_token_routing[...FP8_Block_DeepSeek...] PASS
10 test_trtllm_gen_fused_moe.py::test_routing_dtype_flexibility[...FP8_Block_DeepSeek...] PASS
11 test_trtllm_gen_fused_moe.py::test_mxfp8_block_scale_moe_relu2_non_gated[...Shuffled E32_K4] PASS
12 test_trtllm_gen_fused_moe.py::test_mxfp8_block_scale_moe_relu2_deepseekv3_topk22 PASS
13 test_trtllm_gen_fused_moe.py::test_fp8_block_scale_autotune_valid_configs[...MxFp8_Relu2] PASS
14 test_trtllm_gen_fused_moe.py::test_fp8_per_tensor_autotune_valid_configs_nonefp8[...PerTensor_Swiglu] PASS
15 test_trtllm_gen_fused_moe.py::test_llama4_routing[...FP8_Tensor-1024-1024-8] FAIL(SM100 无该 kernel,非 paddle 问题)
16 test_trtllm_gen_fused_moe.py::test_deepseekv3_routing SKIP(测试内部 guard)
17 test_trtllm_gen_fused_moe.py::test_nvfp4_moe_gemm_bias SKIP(测试内部 guard)
18 tests/norm/test_fused_rmsnorm_silu.py PASS 102/152(50 skip:torch.float4_e2m1fn_x2 未暴露,与 paddle 适配无关)

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@BingooYang BingooYang force-pushed the update_last branch 2 times, most recently from 9c5aeab to 9dc4281 Compare May 13, 2026 11:27
- enable paddle torch proxy in conftest via paddle.enable_compat(scope={"flashinfer"})
- in tests/attention/test_attention_sink_blackwell.py: prepend paddle.enable_compat(),
  replace torch.manual_seed with paddle.seed, replace torch.testing.assert_close with
  numpy.testing.assert_allclose, parametrize to a minimal shape for quick verification
- flashinfer/utils.py: access TorchVersion via torch.torch_version proxy with fallback
  for paddle compat where paddle.torch_version is not exposed
- flashinfer/cute_dsl/fp4_common.py: add "from __future__ import annotations" to
  defer evaluation of "int | torch.device | str | None" annotation which fails under
  paddle proxy (torch.device is a CallableProxyModule, not a type)

adapt prefill trtllm paged attention for paddle compat

- flashinfer/prefill.py: convert workspace_size (tensor scalar from numel()*element_size())
  to Python int via .item() before passing to the tvm_ffi C++ kernel, which expects int
  but receives ffi.Tensor under paddle (doc item PFCCLab#11)
- tests/conftest.py: revert paddle.enable_compat() to global scope so that `import torch`
  at conftest module level (outside flashinfer scope) also resolves via the proxy

paddle compat: decode workspace_size .item(), moe fp8 index via int8 view, autotuner shape tuple, moe test

support allreduce fusion

dist.group.WORLD compat

modify readme

modify format

fix env issue

fix some issue

paddle compat: fix dtype.itemsize + expand trtllm_allreduce_fusion test

- flashinfer/comm/trtllm_ar.py: paddle.dtype has no `itemsize`; add
  _DTYPE_SIZE_MAP + _dtype_itemsize() fallback used in _should_use_oneshot
  (fixes AttributeError when use_oneshot=None triggers the heuristic).
- tests/comm/test_trtllm_allreduce_fusion.py: restore full parametrize
  scope (patterns/layouts/pdls/oneshots/trigger/fp32_acc); drop leftover
  [DBG] prints; guard `if __name__ == "__main__"` block so mp-spawn
  children do not re-enter it under pytest (was double-initializing
  paddle TCPStore and SIGABRT in libuv).

Verified: pytest tests/comm/test_trtllm_allreduce_fusion.py::test_trtllm_allreduce_fusion[True-1024-dtype0-2] and [False-1024-dtype0-2] both pass on 2xGPU.

add adaptation paddle skill

paddle compat: revert over-adaptation in test_trtllm_gen_fused_moe

`torch.cuda.get_device_capability`, `tensor.device`, and `tensor.to(device)`
are fully aligned under `paddle.enable_compat()`. Revert the earlier
paddle-specific detours (`torch.device.cuda.get_device_capability`,
`paddle.device(x.place)`, `paddle.get_device()`) back to plain torch APIs.

Also record the finding in adaptation-paddle skill (§10, items 31-34) as a
"do-not-over-adapt" reference for future MoE test reviews.

Verified: `pytest tests/moe/test_trtllm_gen_fused_moe.py -k test_moe_quantization_classes`
passes (1 passed).

paddle compat: restore test_trtllm_gen_fused_moe to upstream + minimal patches

The previous adaptation commented out / trimmed ~1800 lines from upstream,
making future rebases painful and dropping valid test coverage. Reset the
file to exact upstream content (github.com/flashinfer-ai/flashinfer main)
and keep only the minimum compat patches needed to run on paddle:

test file patches:
- add `import paddle; paddle.enable_compat()` at top
- `block.aminmax()` -> `block.float().aminmax()`       (paddle missing bf16 kernel)
- fp8 slice assign via `.view(torch.int8)` on both sides (paddle missing fp8 set_value kernel)
- `expertLogits.cpu()` -> `.cpu().float()`             (paddle missing cpu-bf16 topk)
- `torch.random.manual_seed` -> `torch.manual_seed`     (paddle.random lacks manual_seed)
- `torch.device(device="cuda")` -> `torch.device("cuda")` (paddle Device rejects kwarg)

same `torch.device(...)` kwarg fix in tests/moe/utils.py.

library patch (flashinfer/autotuner.py):
- `torch.cuda.OutOfMemoryError` missing under paddle. Use a sentinel placeholder
  class (NOT `RuntimeError` - that would silently swallow real kernel errors).

Verified: `pytest test_trtllm_gen_fused_moe.py::test_fp8_block_scale_routed_activation_type_relu2_smoke`
passes. Larger parametrized cases still need library-side fixes (e.g.
`core.py::_init_packed_topk_ids` bitwise_or dtype mismatch).

Docs (skills/adaptation-paddle): record new patches 31-36 and the
"do-not-trim-upstream" lesson.

paddle compat: fix bitwise_or dtype mismatch in _init_packed_topk_ids

torch implicitly promotes int16->int32 in `(expert_ids << 16) | expert_weights`.
Paddle's bitwise_or does not, so it raises

  ValueError: The type of data we are trying to retrieve (int16) does not
  match the type of data (int32)

Explicitly .to(torch.int32) after .view(torch.int16). Works on both backends.

With this fix, routing-family tests (renormalize/sigmoid/deepseekv3/topk/
llama4/dyn_block/tier_1024/deepseek_ngroup1/routing_dtype_flexibility) all
progress past the dtype check. Remaining failures on this machine are
infrastructure (cubin artifactory unreachable), not paddle-compat.

modify skill

fix some issues

paddle compat: test_fused_rmsnorm_silu zero-patch adaptation

tests/norm/test_fused_rmsnorm_silu.py runs under paddle.enable_compat()
with no source changes (conftest.py already enables compat). Full run:
102 passed, 50 skipped (all skips due to torch.float4_e2m1fn_x2 missing
from paddle torch-proxy, not a kernel adaptation issue).

- adp_test.md: add row 18 recording PASS 102/152
- adaptation_exp.md: add section XI (flashinfer-ai#37-39) documenting zero-patch
  result, rationale, reproduction command, and the methodology
  recommendation (bare-run first, consult adaptation table only on
  failure).

fix format

fix some issue
@BingooYang BingooYang merged commit 2c119af into PFCCLab:0.6 May 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant