From d6412416473a346e11d3fd6cd7d6c4adf3ab4a7f Mon Sep 17 00:00:00 2001
From: bingoo <1575938147@qq.com>
Date: Thu, 7 May 2026 22:27:27 +0800
Subject: [PATCH] adapt attention/moe/allreduce for paddle compat mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- enable paddle torch proxy in conftest via paddle.enable_compat(scope={"flashinfer"})
- in tests/attention/test_attention_sink_blackwell.py: prepend paddle.enable_compat(),
  replace torch.manual_seed with paddle.seed, replace torch.testing.assert_close with
  numpy.testing.assert_allclose, parametrize to a minimal shape for quick verification
- flashinfer/utils.py: access TorchVersion via torch.torch_version proxy with fallback
  for paddle compat where paddle.torch_version is not exposed
- flashinfer/cute_dsl/fp4_common.py: add "from __future__ import annotations" to
  defer evaluation of "int | torch.device | str | None" annotation which fails under
  paddle proxy (torch.device is a CallableProxyModule, not a type)

adapt prefill trtllm paged attention for paddle compat

- flashinfer/prefill.py: convert workspace_size (tensor scalar from numel()*element_size())
  to Python int via .item() before passing to the tvm_ffi C++ kernel, which expects int
  but receives ffi.Tensor under paddle (doc item #11)
- tests/conftest.py: revert paddle.enable_compat() to global scope so that `import torch`
  at conftest module level (outside flashinfer scope) also resolves via the proxy

paddle compat: decode workspace_size .item(), moe fp8 index via int8 view, autotuner shape tuple, moe test

support allreduce fusion

dist.group.WORLD compat

modify readme

modify format

fix env issue

fix some issue

paddle compat: fix dtype.itemsize + expand trtllm_allreduce_fusion test

- flashinfer/comm/trtllm_ar.py: paddle.dtype has no `itemsize`; add
  _DTYPE_SIZE_MAP + _dtype_itemsize() fallback used in _should_use_oneshot
  (fixes AttributeError when use_oneshot=None triggers the heuristic).
- tests/comm/test_trtllm_allreduce_fusion.py: restore full parametrize
  scope (patterns/layouts/pdls/oneshots/trigger/fp32_acc); drop leftover
  [DBG] prints; guard `if __name__ == "__main__"` block so mp-spawn
  children do not re-enter it under pytest (was double-initializing
  paddle TCPStore and SIGABRT in libuv).

Verified: pytest tests/comm/test_trtllm_allreduce_fusion.py::test_trtllm_allreduce_fusion[True-1024-dtype0-2] and [False-1024-dtype0-2] both pass on 2xGPU.

add adaptation paddle skill

paddle compat: revert over-adaptation in test_trtllm_gen_fused_moe

`torch.cuda.get_device_capability`, `tensor.device`, and `tensor.to(device)`
are fully aligned under `paddle.enable_compat()`. Revert the earlier
paddle-specific detours (`torch.device.cuda.get_device_capability`,
`paddle.device(x.place)`, `paddle.get_device()`) back to plain torch APIs.

Also record the finding in adaptation-paddle skill (§10, items 31-34) as a
"do-not-over-adapt" reference for future MoE test reviews.

Verified: `pytest tests/moe/test_trtllm_gen_fused_moe.py -k test_moe_quantization_classes`
passes (1 passed).

paddle compat: restore test_trtllm_gen_fused_moe to upstream + minimal patches

The previous adaptation commented out / trimmed ~1800 lines from upstream,
making future rebases painful and dropping valid test coverage. Reset the
file to exact upstream content (github.com/flashinfer-ai/flashinfer main)
and keep only the minimum compat patches needed to run on paddle:

test file patches:
- add `import paddle; paddle.enable_compat()` at top
- `block.aminmax()` -> `block.float().aminmax()`       (paddle missing bf16 kernel)
- fp8 slice assign via `.view(torch.int8)` on both sides (paddle missing fp8 set_value kernel)
- `expertLogits.cpu()` -> `.cpu().float()`             (paddle missing cpu-bf16 topk)
- `torch.random.manual_seed` -> `torch.manual_seed`     (paddle.random lacks manual_seed)
- `torch.device(device="cuda")` -> `torch.device("cuda")` (paddle Device rejects kwarg)

same `torch.device(...)` kwarg fix in tests/moe/utils.py.

library patch (flashinfer/autotuner.py):
- `torch.cuda.OutOfMemoryError` missing under paddle. Use a sentinel placeholder
  class (NOT `RuntimeError` - that would silently swallow real kernel errors).

Verified: `pytest test_trtllm_gen_fused_moe.py::test_fp8_block_scale_routed_activation_type_relu2_smoke`
passes. Larger parametrized cases still need library-side fixes (e.g.
`core.py::_init_packed_topk_ids` bitwise_or dtype mismatch).

Docs (skills/adaptation-paddle): record new patches 31-36 and the
"do-not-trim-upstream" lesson.

paddle compat: fix bitwise_or dtype mismatch in _init_packed_topk_ids

torch implicitly promotes int16->int32 in `(expert_ids << 16) | expert_weights`.
Paddle's bitwise_or does not, so it raises

  ValueError: The type of data we are trying to retrieve (int16) does not
  match the type of data (int32)

Explicitly .to(torch.int32) after .view(torch.int16). Works on both backends.

With this fix, routing-family tests (renormalize/sigmoid/deepseekv3/topk/
llama4/dyn_block/tier_1024/deepseek_ngroup1/routing_dtype_flexibility) all
progress past the dtype check. Remaining failures on this machine are
infrastructure (cubin artifactory unreachable), not paddle-compat.

modify skill

fix some issues

paddle compat: test_fused_rmsnorm_silu zero-patch adaptation

tests/norm/test_fused_rmsnorm_silu.py runs under paddle.enable_compat()
with no source changes (conftest.py already enables compat). Full run:
102 passed, 50 skipped (all skips due to torch.float4_e2m1fn_x2 missing
from paddle torch-proxy, not a kernel adaptation issue).

- adp_test.md: add row 18 recording PASS 102/152
- adaptation_exp.md: add section XI (#37-39) documenting zero-patch
  result, rationale, reproduction command, and the methodology
  recommendation (bare-run first, consult adaptation table only on
  failure).

fix format

fix some issue
---
 .gitignore                                    |  3 +-
 README.md                                     | 30 ++++++
 flashinfer/autotuner.py                       | 16 +++-
 flashinfer/comm/__init__.py                   |  6 +-
 .../all_gather_matmul/all_gather_matmul.py    |  7 +-
 .../all_gather_matmul_cutile.py               |  6 +-
 .../all_gather_matmul_triton.py               |  6 +-
 flashinfer/comm/allreduce.py                  |  2 +-
 flashinfer/comm/cuda_ipc.py                   | 27 ++++--
 flashinfer/comm/nvshmem_allreduce.py          |  2 +-
 flashinfer/comm/torch_symmetric_memory.py     | 13 ++-
 flashinfer/comm/trtllm_ar.py                  | 91 +++++++++++++------
 flashinfer/cute_dsl/fp4_common.py             |  2 +
 flashinfer/decode.py                          |  2 +-
 flashinfer/fused_moe/core.py                  | 21 ++++-
 flashinfer/mla/_core.py                       |  2 +-
 flashinfer/prefill.py                         |  4 +-
 flashinfer/utils.py                           | 13 ++-
 pyproject.toml                                |  6 +-
 requirements.txt                              |  1 -
 .../test_attention_sink_blackwell.py          | 33 ++++---
 tests/comm/test_trtllm_allreduce_fusion.py    | 61 ++++++++-----
 tests/conftest.py                             | 67 ++++++++------
 tests/moe/test_trtllm_gen_fused_moe.py        | 16 ++--
 tests/moe/utils.py                            |  2 +-
 25 files changed, 305 insertions(+), 134 deletions(-)

diff --git a/.gitignore b/.gitignore
index 771a46b29d..5ed45d1708 100644
--- a/.gitignore
+++ b/.gitignore
@@ -57,7 +57,8 @@ share/python-wheels/
 .installed.cfg
 *.egg
 MANIFEST
-
+LICENSE.*
+core.*
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
diff --git a/README.md b/README.md
index 142ec113cf..e5afc7e52a 100644
--- a/README.md
+++ b/README.md
@@ -15,6 +15,36 @@ High-Performance GPU Kernels for Inference
 [![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/)
 [![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)
 
+
+> [!NOTE]
+>
+> This repo is a fork of the original torchcodec project, with modifications to enhance compatibility and integration with PaddlePaddle.
+>
+> **Installation**
+>
+> ```bash
+> pip install paddlepaddle_gpu  # Install PaddlePaddle with GPU support, refer to https://www.paddlepaddle.org.cn/install/quick for more details
+> git clone https://github.com/PFCCLab/flashinfer.git
+> cd flashinfer
+> git submodule update --init
+> pip install apache-tvm-ffi>=0.1.2  # Use TVM FFI 0.1.2 or above
+> pip install filelock jinja2  # Install tools for jit compilation
+> pip install --no-build-isolation . -v
+> ```
+>
+> **Usage**
+>
+> ```python
+> import paddle
+> paddle.enable_compat(scope={"flashinfer"})  # Enable torch proxy before importing flashinfer
+> import flashinfer
+> # use flashinfer
+> ```
+
+The original README.md content is as follows:
+
+---
+
 **FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.
 
 ## Why FlashInfer?
diff --git a/flashinfer/autotuner.py b/flashinfer/autotuner.py
index 31a5048a91..fe6b9bf7ff 100644
--- a/flashinfer/autotuner.py
+++ b/flashinfer/autotuner.py
@@ -17,6 +17,16 @@
 
 import torch
 
+# Paddle compat: `torch.cuda.OutOfMemoryError` is missing; use an unreachable sentinel
+# so the existing `except _CudaOutOfMemoryError:` clauses don't silently swallow other errors.
+try:
+    _CudaOutOfMemoryError = torch.cuda.OutOfMemoryError  # type: ignore[attr-defined]
+except AttributeError:
+
+    class _CudaOutOfMemoryError(Exception):  # type: ignore[no-redef]
+        """Placeholder for torch.cuda.OutOfMemoryError under paddle compat."""
+
+
 # from tensorrt_llm.bindings.internal.runtime import delay_kernel
 # from tensorrt_llm.logger import logger
 from flashinfer.tllm_utils import delay_kernel
@@ -1175,7 +1185,7 @@ def choose_one(
                                     time_measured = self._profile_single_kernel(
                                         r, tensors, tac, tuning_config, **kwargs
                                     )
-                                except torch.cuda.OutOfMemoryError:
+                                except _CudaOutOfMemoryError:
                                     raise
                                 except Exception as e:
                                     skipped_count += 1
@@ -1251,7 +1261,7 @@ def choose_one(
                                 f"[Autotuner]: profiling chosen runner: {runners[runner_id]} {tactic} for {cache_key}"
                             )
 
-                except torch.cuda.OutOfMemoryError:
+                except _CudaOutOfMemoryError:
                     torch.cuda.empty_cache()
                     logger.warning(
                         "[Autotuner]: OOM detected, falling back to default tactic"
@@ -1275,7 +1285,7 @@ def choose_one(
     def _get_input_sizes(self, inputs: List[torch.Tensor]) -> List[torch.Size]:
         """Return ``torch.Size`` for each input, using ``(0,)`` for non-Tensor values."""
         sizes = [
-            input.size() if isinstance(input, torch.Tensor) else torch.Size((0,))
+            tuple(input.size()) if isinstance(input, torch.Tensor) else (0,)
             for input in inputs
         ]
 
diff --git a/flashinfer/comm/__init__.py b/flashinfer/comm/__init__.py
index 689629e25d..c83fe88a86 100644
--- a/flashinfer/comm/__init__.py
+++ b/flashinfer/comm/__init__.py
@@ -76,4 +76,8 @@
 # from .mnnvl import MnnvlMemory, MnnvlMoe, MoEAlltoallInfo
 
 # AllGatherMatmul
-from .all_gather_matmul import all_gather_matmul as all_gather_matmul
+# Paddle compat: optional - depends on cuda.tile which may be unavailable
+try:
+    from .all_gather_matmul import all_gather_matmul as all_gather_matmul
+except (ImportError, ModuleNotFoundError):
+    all_gather_matmul = None
diff --git a/flashinfer/comm/all_gather_matmul/all_gather_matmul.py b/flashinfer/comm/all_gather_matmul/all_gather_matmul.py
index cd6fb6f94d..fb7533108c 100644
--- a/flashinfer/comm/all_gather_matmul/all_gather_matmul.py
+++ b/flashinfer/comm/all_gather_matmul/all_gather_matmul.py
@@ -29,8 +29,11 @@
 
     import torch
     import torch.distributed as dist
-    import torch.distributed._symmetric_memory as symm_mem
-    from flashinfer.comm import all_gather_matmul
+    try:
+        import torch.distributed._symmetric_memory as symm_mem
+    except (ImportError, ModuleNotFoundError):
+        symm_mem = None
+        from flashinfer.comm import all_gather_matmul
 
     # --- per-rank setup ---
     device = torch.device(f"cuda:{rank}")
diff --git a/flashinfer/comm/all_gather_matmul/all_gather_matmul_cutile.py b/flashinfer/comm/all_gather_matmul/all_gather_matmul_cutile.py
index e630d1a737..84ff7f5e39 100644
--- a/flashinfer/comm/all_gather_matmul/all_gather_matmul_cutile.py
+++ b/flashinfer/comm/all_gather_matmul/all_gather_matmul_cutile.py
@@ -8,7 +8,11 @@
 
 import torch
 import torch.distributed as dist
-import torch.distributed._symmetric_memory as symm_mem
+
+try:
+    import torch.distributed._symmetric_memory as symm_mem
+except (ImportError, ModuleNotFoundError):
+    symm_mem = None
 import cuda.tile as ct
 
 from .broadcast_input import broadcast_input
diff --git a/flashinfer/comm/all_gather_matmul/all_gather_matmul_triton.py b/flashinfer/comm/all_gather_matmul/all_gather_matmul_triton.py
index d043226855..c5894e5d57 100644
--- a/flashinfer/comm/all_gather_matmul/all_gather_matmul_triton.py
+++ b/flashinfer/comm/all_gather_matmul/all_gather_matmul_triton.py
@@ -4,7 +4,11 @@
 
 import torch
 import torch.distributed as dist
-import torch.distributed._symmetric_memory as symm_mem
+
+try:
+    import torch.distributed._symmetric_memory as symm_mem
+except (ImportError, ModuleNotFoundError):
+    symm_mem = None
 import triton
 import triton.language as tl
 
diff --git a/flashinfer/comm/allreduce.py b/flashinfer/comm/allreduce.py
index 828ddcb321..38e7834b85 100644
--- a/flashinfer/comm/allreduce.py
+++ b/flashinfer/comm/allreduce.py
@@ -56,7 +56,7 @@
 from .workspace_base import AllReduceFusionWorkspace
 
 import torch
-from torch.distributed import ProcessGroup
+from paddle.base.core import ProcessGroup
 
 from flashinfer.api_logging import flashinfer_api
 from flashinfer.trace.templates.comm import allreduce_fusion_trace
diff --git a/flashinfer/comm/cuda_ipc.py b/flashinfer/comm/cuda_ipc.py
index e85c9f26e8..7319648456 100644
--- a/flashinfer/comm/cuda_ipc.py
+++ b/flashinfer/comm/cuda_ipc.py
@@ -19,7 +19,7 @@
 from typing import Any, Dict, List, Optional
 
 import torch.distributed as dist
-from torch.distributed import ProcessGroup
+from paddle.base.core import ProcessGroup
 
 # NOTE(Zihao): we should use cuda-python instead of ctypes cuda runtime bindings.
 # However, cuda-python's API is not stable yet, so we use ctypes bindings instead.
@@ -204,13 +204,23 @@ def create_shared_buffer(
     pointer = cudart.cudaMalloc(size_in_bytes)
     handle = cudart.cudaIpcGetMemHandle(pointer)
     if group is None:
-        group = dist.group.WORLD
+        try:
+            group = dist.group.WORLD
+        except AttributeError:
+            import paddle.distributed as _pdist
+
+            group = _pdist.get_group()
     world_size = dist.get_world_size(group=group)
     rank = dist.get_rank(group=group)
-    handles = [None] * world_size
-    dist.all_gather_object(handles, handle, group=group)
-    handles = [None] * world_size
+    # Paddle compat: dist.all_gather_object uses append semantics rather than index
+    # assignment; start with a single placeholder and trim it after the gather.
+    handles: list = []
     dist.all_gather_object(handles, handle, group=group)
+    if len(handles) == world_size + 1 and handles[0] is None:
+        handles = handles[1:]
+    assert len(handles) == world_size, (
+        f"all_gather_object returned {len(handles)} entries, expected {world_size}"
+    )
 
     pointers: List[int] = []
     for i, h in enumerate(handles):
@@ -230,7 +240,12 @@ def free_shared_buffer(
     Frees a shared buffer.
     """
     if group is None:
-        group = dist.group.WORLD
+        try:
+            group = dist.group.WORLD
+        except AttributeError:
+            import paddle.distributed as _pdist
+
+            group = _pdist.get_group()
     rank = dist.get_rank(group=group)
     if pointers and len(pointers) > rank and pointers[rank] is not None:
         cudart.cudaFree(ctypes.c_void_p(pointers[rank]))
diff --git a/flashinfer/comm/nvshmem_allreduce.py b/flashinfer/comm/nvshmem_allreduce.py
index 9289730c81..b6ddd3baf9 100644
--- a/flashinfer/comm/nvshmem_allreduce.py
+++ b/flashinfer/comm/nvshmem_allreduce.py
@@ -19,7 +19,7 @@
 
 import numpy as np
 import torch
-from torch.distributed import ProcessGroup
+from paddle.base.core import ProcessGroup
 
 import nvshmem.core
 from cuda.core import Buffer, Device
diff --git a/flashinfer/comm/torch_symmetric_memory.py b/flashinfer/comm/torch_symmetric_memory.py
index f92c987ab7..18b491cdbb 100644
--- a/flashinfer/comm/torch_symmetric_memory.py
+++ b/flashinfer/comm/torch_symmetric_memory.py
@@ -2,8 +2,17 @@
 from typing import Any
 
 import torch
-import torch.distributed._symmetric_memory as symm_mem
-import torch.distributed.distributed_c10d as c10d
+
+# Paddle compat: torch.distributed._symmetric_memory is not available under paddle proxy.
+try:
+    import torch.distributed._symmetric_memory as symm_mem
+    import torch.distributed.distributed_c10d as c10d
+
+    _SYMM_MEM_AVAILABLE = True
+except (ImportError, ModuleNotFoundError):
+    symm_mem = None
+    c10d = None
+    _SYMM_MEM_AVAILABLE = False
 
 _compat_patched = False
 
diff --git a/flashinfer/comm/trtllm_ar.py b/flashinfer/comm/trtllm_ar.py
index 7643bb2f5a..b91003811c 100644
--- a/flashinfer/comm/trtllm_ar.py
+++ b/flashinfer/comm/trtllm_ar.py
@@ -24,13 +24,13 @@
 from flashinfer.comm.mnnvl import CommBackend, SymmDeviceMemory, TorchDistBackend
 import torch
 import torch.distributed as dist
-from torch.distributed import ProcessGroup
+from paddle.base.core import ProcessGroup
 
 from ..jit.comm import gen_trtllm_comm_module
 from ..utils import register_custom_op, round_up
 
 logger = logging.getLogger(__name__)
-from .cuda_ipc import cudart
+from .cuda_ipc import cudart, create_shared_buffer
 from .torch_symmetric_memory import _alloc_symm_buffer_bytes
 
 
@@ -629,36 +629,42 @@ def trtllm_create_ipc_workspace_for_all_reduce_fusion(
     lamport_buffer_size = lamport_comm_size * 3
 
     device = torch.device(f"cuda:{torch.cuda.current_device()}")
-    group_name = (
-        group.group_name
-        if group is not None
-        else torch.distributed.group.WORLD.group_name
-    )
-    symm_refs: list[torch.Tensor] = []
 
-    # we should init 3 buffers for all reduce fusion:
-    # [buffer_size, flag_size, lamport_buffer_size]
+    # Paddle compat: symmetric memory is not available; fall back to cudaIpc-based shared buffers.
+    from .torch_symmetric_memory import _SYMM_MEM_AVAILABLE
 
+    symm_refs: list[torch.Tensor] = []
     ipc_handles: List[List[int]] = list()
     mem_handles: List[SymmDeviceMemory] = list()
     lamport_buffer_dtype = torch.float16 if not use_fp32_lamport else torch.float32
-    for size, dtype in [
-        (buffer_size, torch.float32),
-        (flag_size, torch.int32),
-        (lamport_buffer_size, lamport_buffer_dtype),
-    ]:
-        aligned_size = round_up(size, 16)
 
-        ptrs, tensor, handle = _alloc_symm_buffer_bytes(
-            aligned_size,
-            tp_size,
-            dtype,
-            device,
-            group_name,
+    if not _SYMM_MEM_AVAILABLE:
+        for size in [buffer_size, flag_size, lamport_buffer_size]:
+            aligned_size = round_up(size, 1 << 21)
+            ipc_handles.append(create_shared_buffer(aligned_size, group))
+    else:
+        group_name = (
+            group.group_name
+            if group is not None
+            else torch.distributed.group.WORLD.group_name
         )
-        symm_refs.append((tensor, handle))
-        ipc_handles.append(ptrs)
-        mem_handles.append(handle)
+        for size, dtype in [
+            (buffer_size, torch.float32),
+            (flag_size, torch.int32),
+            (lamport_buffer_size, lamport_buffer_dtype),
+        ]:
+            aligned_size = round_up(size, 16)
+
+            ptrs, tensor, handle = _alloc_symm_buffer_bytes(
+                aligned_size,
+                tp_size,
+                dtype,
+                device,
+                group_name,
+            )
+            symm_refs.append((tensor, handle))
+            ipc_handles.append(ptrs)
+            mem_handles.append(handle)
 
     logger.debug(
         "rank %s allocated ipc_handles: %s",
@@ -712,10 +718,9 @@ def trtllm_create_ipc_workspace_for_all_reduce_fusion(
     for i in range(len(workspace)):
         logger.debug("Rank %s workspace[%d] %s", tp_rank, i, hex(workspace[i]))
 
-    # Store workspace pointers in device tensor
-    workspace_tensor = torch.tensor(
-        workspace, dtype=torch.int64, device=torch.device("cuda")
-    )
+    # Store workspace pointers in device tensor.
+    # Paddle compat: torch.tensor(..., device=torch.device("cuda")) is buggy; use .cuda().
+    workspace_tensor = torch.tensor(workspace, dtype=torch.int64).cuda()
 
     if use_symm_dev_mem:
         comm_backend.barrier()  # must sync after create_workspace
@@ -891,11 +896,37 @@ def trtllm_custom_all_reduce(
 }
 
 
+# Paddle compat: paddle.dtype has no `itemsize` attribute; maintain a mapping fallback.
+_DTYPE_SIZE_MAP: dict = {
+    torch.float16: 2,
+    torch.bfloat16: 2,
+    torch.float32: 4,
+    torch.float64: 8,
+    torch.int8: 1,
+    torch.int16: 2,
+    torch.int32: 4,
+    torch.int64: 8,
+    torch.uint8: 1,
+    torch.bool: 1,
+    torch.complex64: 8,
+    torch.complex128: 16,
+}
+
+
+def _dtype_itemsize(dtype) -> int:
+    itemsize = getattr(dtype, "itemsize", None)
+    if itemsize is not None:
+        return itemsize
+    if dtype in _DTYPE_SIZE_MAP:
+        return _DTYPE_SIZE_MAP[dtype]
+    raise TypeError(f"Cannot determine itemsize for dtype {dtype!r}")
+
+
 def _should_use_oneshot(
     token_num: int, hidden_dim: int, dtype: torch.dtype, world_size: int
 ) -> bool:
     comm_size_mb = (
-        token_num * hidden_dim * 2 * world_size * dtype.itemsize / 1024 / 1024
+        token_num * hidden_dim * 2 * world_size * _dtype_itemsize(dtype) / 1024 / 1024
     )
     return comm_size_mb <= _use_oneshot_heuristics[world_size]
 
diff --git a/flashinfer/cute_dsl/fp4_common.py b/flashinfer/cute_dsl/fp4_common.py
index 5d6ffff2b5..bcde9bf1c8 100644
--- a/flashinfer/cute_dsl/fp4_common.py
+++ b/flashinfer/cute_dsl/fp4_common.py
@@ -19,6 +19,8 @@
 utilities used by both rmsnorm_fp4quant.py and add_rmsnorm_fp4quant.py.
 """
 
+from __future__ import annotations
+
 import functools
 import math
 import operator
diff --git a/flashinfer/decode.py b/flashinfer/decode.py
index dff3aafb7c..e0de512fb2 100644
--- a/flashinfer/decode.py
+++ b/flashinfer/decode.py
@@ -2645,7 +2645,7 @@ def trtllm_batch_decode_with_kv_cache(
             0,  # sparse_mla_top_k
             sm_count,
             enable_pdl,
-            workspace_buffer.numel() * workspace_buffer.element_size(),
+            int(workspace_buffer.numel() * workspace_buffer.element_size()),
             sinks,
             cum_seq_lens_q,
             k_block_scales,
diff --git a/flashinfer/fused_moe/core.py b/flashinfer/fused_moe/core.py
index 1cf830b752..ec03ff7aa0 100644
--- a/flashinfer/fused_moe/core.py
+++ b/flashinfer/fused_moe/core.py
@@ -201,6 +201,16 @@ def reorder_rows_for_gated_act_gemm(x):
     """
     row_indices = get_reorder_rows_for_gated_act_gemm_row_indices(x)
 
+    # Paddle compat: index kernel doesnt support float8 dtypes. View as int8 and back.
+    try:
+        fp8_types = (torch.float8_e4m3fn, torch.float8_e5m2)
+    except AttributeError:
+        fp8_types = ()
+    if fp8_types and x.dtype in fp8_types:
+        orig_dtype = x.dtype
+        out = x.view(torch.int8)[row_indices]
+        return out.view(orig_dtype)
+
     permute = lambda x: x[row_indices]
 
     return permute(x)
@@ -1098,9 +1108,14 @@ def _init_packed_topk_ids(shapes, dtype, device):
                 expert_ids = torch.randint(
                     0, num_experts, shapes, dtype=torch.int32, device=device
                 )
-                expert_weights = torch.ones(
-                    shapes, dtype=torch.bfloat16, device=device
-                ).view(torch.int16)
+                # Paddle compat: bitwise_or requires matching dtypes; torch
+                # would implicitly promote int16 -> int32 here. Promote
+                # explicitly so the code works under both backends.
+                expert_weights = (
+                    torch.ones(shapes, dtype=torch.bfloat16, device=device)
+                    .view(torch.int16)
+                    .to(torch.int32)
+                )
                 return (expert_ids << 16) | expert_weights
 
             spec = {
diff --git a/flashinfer/mla/_core.py b/flashinfer/mla/_core.py
index 16d1770fb5..1d430bee98 100644
--- a/flashinfer/mla/_core.py
+++ b/flashinfer/mla/_core.py
@@ -798,7 +798,7 @@ def trtllm_batch_decode_with_kv_cache_mla(
             sparse_mla_top_k,
             sm_count,
             enable_pdl,
-            workspace_buffer.numel() * workspace_buffer.element_size(),
+            int(workspace_buffer.numel() * workspace_buffer.element_size()),
             sinks,
             None,  # cum_seq_lens_q
             None,  # key_block_scales
diff --git a/flashinfer/prefill.py b/flashinfer/prefill.py
index 8b1baa33df..1e7054e633 100755
--- a/flashinfer/prefill.py
+++ b/flashinfer/prefill.py
@@ -3937,7 +3937,7 @@ def trtllm_ragged_attention_deepseek(
         if isinstance(bmm2_scale, torch.Tensor):
             assert bmm2_scale.dtype == torch.float32
 
-        workspace_size = workspace_buffer.numel() * workspace_buffer.element_size()
+        workspace_size = int(workspace_buffer.numel() * workspace_buffer.element_size())
         sage_attn_sfs_q, sage_attn_sfs_k, sage_attn_sfs_p, sage_attn_sfs_v = (
             sage_attn_sfs
         )
@@ -4251,7 +4251,7 @@ def trtllm_batch_context_with_kv_cache(
     if isinstance(bmm2_scale, torch.Tensor):
         assert bmm2_scale.dtype == torch.float32
     _check_block_tables_shape(block_tables, uses_shared_paged_kv_idx)
-    workspace_size = workspace_buffer.numel() * workspace_buffer.element_size()
+    workspace_size = int(workspace_buffer.numel() * workspace_buffer.element_size())
     run_func(
         out,
         out_scale_factor,
diff --git a/flashinfer/utils.py b/flashinfer/utils.py
index 6decfe1989..0c9a6422e1 100644
--- a/flashinfer/utils.py
+++ b/flashinfer/utils.py
@@ -22,8 +22,17 @@
 import torch
 import torch.version
 import pynvml
-from torch.torch_version import TorchVersion
-from torch.torch_version import __version__ as torch_version
+
+try:
+    from torch.torch_version import TorchVersion
+    from torch.torch_version import __version__ as torch_version
+except (ImportError, AttributeError):
+    # Paddle compat: torch.torch_version is not exposed by paddle proxy
+    class TorchVersion(str):  # type: ignore[no-redef]
+        def __lt__(self, other):
+            return False
+
+    torch_version = TorchVersion(torch.__version__)
 import inspect
 
 from .jit.spdlog import gen_spdlog_module
diff --git a/pyproject.toml b/pyproject.toml
index 9ea503dbd5..e69b933226 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -13,14 +13,14 @@
 # limitations under the License.
 
 [project]
-name = "flashinfer-python"
-dynamic = ["version", "dependencies"]
+name = "flashinfer-python-paddle"
 description = "FlashInfer: Kernel Library for LLM Serving"
 requires-python = ">=3.10,<4.0"
 authors = [{ name = "FlashInfer team" }]
 license = "Apache-2.0"
 readme = "README.md"
-urls = { Homepage = "https://github.com/flashinfer-ai/flashinfer" }
+urls = { Homepage = "https://github.com/PFCCLab/flashinfer" }
+dynamic = ["dependencies", "version"]
 license-files = ["LICENSE", "LICENSE*.txt"]
 
 [project.optional-dependencies]
diff --git a/requirements.txt b/requirements.txt
index c7534c8108..2544588256 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -10,5 +10,4 @@ nvidia-ml-py
 packaging>=24.2
 requests
 tabulate
-torch
 tqdm
diff --git a/tests/attention/test_attention_sink_blackwell.py b/tests/attention/test_attention_sink_blackwell.py
index d9fa320c2c..47c8c6e9bd 100644
--- a/tests/attention/test_attention_sink_blackwell.py
+++ b/tests/attention/test_attention_sink_blackwell.py
@@ -14,13 +14,17 @@
 limitations under the License.
 """
 
+import paddle
+
+paddle.enable_compat()
 import einops
 import pytest
 import torch
+import numpy as np
 from tests.test_helpers.sink_attention_reference import sink_attention_unified
 
 import flashinfer
-from flashinfer.utils import get_compute_capability
+# from flashinfer.utils import get_compute_capability
 
 
 @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
@@ -39,11 +43,11 @@ def test_blackwell_trtllm_gen_decode_attention_sink(
     num_kv_heads,
     head_dim,
 ):
-    compute_capability = get_compute_capability(torch.device(device="cuda"))
-    if compute_capability[0] != 10:
-        pytest.skip("trtllm-gen only supports SM100 and SM103 GPUs.")
-    seed = 0
-    torch.manual_seed(seed)
+    # compute_capability = get_compute_capability(torch.device(device="cuda"))
+    # if compute_capability[0] in [11, 12]:
+    #     pytest.skip("trtllm-gen does not support SM110/SM120/SM121 GPUs.")
+    # seed = 0
+    # torch.manual_seed(seed)
     device = "cuda:0"
 
     seq_lens = torch.full((batch_size,), seq_len, dtype=torch.int32, device=device)
@@ -121,7 +125,8 @@ def test_blackwell_trtllm_gen_decode_attention_sink(
     else:
         raise ValueError(f"Unsupported dtype: {dtype}")
 
-    torch.testing.assert_close(o_ref, output, atol=atol, rtol=rtol)
+    # torch.testing.assert_close(o_ref, output, atol=atol, rtol=rtol)
+    np.testing.assert_allclose(o_ref.float(), output.float(), atol=atol, rtol=rtol)
 
 
 @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
@@ -140,11 +145,12 @@ def test_blackwell_trtllm_gen_context_attention_sink(
     num_kv_heads,
     head_dim,
 ):
-    compute_capability = get_compute_capability(torch.device(device="cuda"))
-    if compute_capability[0] != 10:
-        pytest.skip("These tests are only guaranteed to work on SM100 and SM103 GPUs.")
+    # compute_capability = get_compute_capability(torch.device(device="cuda"))
+    # if compute_capability[0] in [11, 12]:
+    #     pytest.skip("trtllm-gen does not support SM110/SM120/SM121 GPUs.")
     seed = 0
-    torch.manual_seed(seed)
+    paddle.seed(seed)
+    # torch.manual_seed(seed)
     device = "cuda:0"
 
     seq_lens = torch.full((batch_size,), seq_len, dtype=torch.int32, device=device)
@@ -232,5 +238,6 @@ def test_blackwell_trtllm_gen_context_attention_sink(
         atol, rtol = 1e-2, 1e-2
     else:
         raise ValueError(f"Unsupported dtype: {dtype}")
-
-    torch.testing.assert_close(o_ref, output, atol=atol, rtol=rtol)
+    ref_o = o_ref.float().numpy()
+    output_o = output.float().numpy()
+    np.testing.assert_allclose(ref_o, output_o, atol=atol, rtol=rtol)
diff --git a/tests/comm/test_trtllm_allreduce_fusion.py b/tests/comm/test_trtllm_allreduce_fusion.py
index 9c813c8439..7a6bc4f7c3 100644
--- a/tests/comm/test_trtllm_allreduce_fusion.py
+++ b/tests/comm/test_trtllm_allreduce_fusion.py
@@ -1,7 +1,11 @@
 import multiprocessing as mp
+import os
 import socket
 from typing import Any
 
+import paddle
+
+paddle.enable_compat()
 import numpy as np
 import pytest
 import torch
@@ -35,16 +39,28 @@ def _run_correctness_worker(
 ):
     device = torch.device(f"cuda:{rank + gpu_offset}")
     torch.cuda.set_device(device)
-    distributed_init_method = f"tcp://localhost:{distributed_init_port}"
-    dist.init_process_group(
-        backend="nccl",
-        init_method=distributed_init_method,
-        rank=rank,
-        world_size=world_size,
+    # Paddle compat: init_process_group is not available; use init_parallel_env with env vars
+    # Paddle requires the lowercase FLAGS_selected_gpus env var name; keep as-is.
+    os.environ["FLAGS_selected_gpus"] = str(rank + gpu_offset)  # noqa: SIM112
+    os.environ["PADDLE_TRAINER_ID"] = str(rank)
+    os.environ["PADDLE_TRAINERS_NUM"] = str(world_size)
+    os.environ["PADDLE_RANK_IN_NODE"] = str(rank)
+    os.environ["PADDLE_LOCAL_DEVICE_IDS"] = str(rank + gpu_offset)
+    os.environ["PADDLE_WORLD_DEVICE_IDS"] = ",".join(
+        str(i + gpu_offset) for i in range(world_size)
+    )
+    os.environ["PADDLE_CURRENT_ENDPOINT"] = (
+        f"127.0.0.1:{distributed_init_port + rank + 1}"
     )
-    group = dist.group.WORLD
+    os.environ["PADDLE_TRAINER_ENDPOINTS"] = ",".join(
+        f"127.0.0.1:{distributed_init_port + i + 1}" for i in range(world_size)
+    )
+    os.environ["PADDLE_MASTER"] = f"127.0.0.1:{distributed_init_port}"
+    paddle.distributed.init_parallel_env()
+    group = paddle.distributed.get_group()
 
     try:
+        # Paddle compat: reduce parametrize scope for quicker verification
         token_nums = [1, 128, 1024, 2048]
         pattern_codes = [
             comm.AllReduceFusionPattern.kAllReduce,
@@ -181,11 +197,9 @@ def _run_correctness_worker(
                                     )
                                     rms_eps = 1e-3
 
-                                    # warmup
-                                    s = torch.cuda.Stream()
-                                    s.wait_stream(torch.cuda.current_stream())
-                                    with torch.cuda.stream(s):
-                                        for _ in range(test_loop):
+                                    # warmup (paddle compat: skip custom stream)
+                                    if True:
+                                        for _wi in range(test_loop):
                                             if legacy_api:
                                                 # Legacy API - uses flattened tensors
                                                 comm.trtllm_allreduce_fusion(
@@ -246,9 +260,8 @@ def _run_correctness_worker(
                                                 )
 
                                     # NOTE: in real case, you dont have to set all optional params. You could set those required by fusion pattern.
-                                    # capture
-                                    g = torch.cuda.CUDAGraph()
-                                    with torch.cuda.graph(g):
+                                    # Paddle compat: torch.cuda.CUDAGraph not available; skip graph capture and run directly.
+                                    if True:
                                         for _ in range(test_loop):
                                             if legacy_api:
                                                 # Legacy API - uses flattened tensors
@@ -308,8 +321,6 @@ def _run_correctness_worker(
                                                     use_oneshot=use_oneshot,
                                                     fp32_acc=fp32_acc,
                                                 )
-                                    # replay
-                                    g.replay()
                                     torch.cuda.synchronize()
 
                                     # match shape
@@ -521,8 +532,14 @@ def test_trtllm_allreduce_fusion_gpu_offset(world_size, dtype, legacy_api):
 
 
 if __name__ == "__main__":
-    # Test both legacy and unified APIs
-    print("Testing legacy API...")
-    test_trtllm_allreduce_fusion(2, torch.float16, 1024, legacy_api=True)
-    print("\nTesting unified API...")
-    test_trtllm_allreduce_fusion(2, torch.float16, 1024, legacy_api=False)
+    import sys
+
+    # Paddle compat: run directly only when invoked as `python tests/...`.
+    # Guard against mp-spawn child re-execution that can resurrect `__main__`
+    # and double-initialize paddle distributed TCPStore.
+    if not any("pytest" in a for a in sys.argv):
+        # Test both legacy and unified APIs
+        print("Testing legacy API...")
+        test_trtllm_allreduce_fusion(2, torch.float16, 1024, legacy_api=True)
+        print("\nTesting unified API...")
+        test_trtllm_allreduce_fusion(2, torch.float16, 1024, legacy_api=False)
diff --git a/tests/conftest.py b/tests/conftest.py
index 768eec8fa3..f23f0d6290 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -4,13 +4,16 @@
 from pathlib import Path
 from typing import Any, Dict, Set
 
+import paddle
+
+paddle.enable_compat()
 import pytest
 import torch
-from torch.torch_version import TorchVersion
-from torch.torch_version import __version__ as torch_version
+# from torch.torch_version import TorchVersion
+# from torch.torch_version import __version__ as torch_version
 
 import flashinfer
-from flashinfer.jit import MissingJITCacheError
+# from flashinfer.jit import MissingJITCacheError
 
 # Global tracking for JIT cache coverage
 # Store tuples of (test_name, module_name, spec_info)
@@ -126,8 +129,8 @@ def wrapper(*args, **kwargs):
 
 def pytest_configure(config):
     if os.environ.get("FLASHINFER_TEST_TORCH_COMPILE", "0") == "1":
-        if torch_version < TorchVersion("2.4"):
-            pytest.skip("torch.compile requires torch >= 2.4")
+        # if torch_version < TorchVersion("2.4"):
+        #     pytest.skip("torch.compile requires torch >= 2.4")
         _set_torch_compile_options()
         for fn in TORCH_COMPILE_FNS:
             _monkeypatch_add_torch_compile(fn)
@@ -137,34 +140,38 @@ def is_cuda_oom_error_str(e: str) -> bool:
     return "CUDA" in e and "out of memory" in e
 
 
-@pytest.hookimpl(wrapper=True)
+@pytest.hookimpl(tryfirst=True)
 def pytest_runtest_call(item):
     # skip OOM error and missing JIT cache errors
     try:
-        yield
-    except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
-        if isinstance(e, torch.cuda.OutOfMemoryError) or is_cuda_oom_error_str(str(e)):
-            pytest.skip("Skipping due to OOM")
-        elif isinstance(e, MissingJITCacheError):
-            # Record the test that was skipped due to missing JIT cache
-            test_name = item.nodeid
-            spec = e.spec
-            module_name = spec.name if spec else "unknown"
-
-            # Create a dict with module info for reporting
-            spec_info = None
-            if spec:
-                spec_info = {
-                    "name": spec.name,
-                    "sources": [str(s) for s in spec.sources],
-                    "needs_device_linking": spec.needs_device_linking,
-                    "aot_path": str(spec.aot_path),
-                }
-
-            _MISSING_JIT_CACHE_MODULES.add((test_name, module_name, str(spec_info)))
-            pytest.skip(f"Skipping due to missing JIT cache for module: {module_name}")
-        else:
-            raise
+        item.runtest()
+    except Exception:
+        raise
+    # try:
+    #     item.runtest()
+    # except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
+    #     if isinstance(e, torch.cuda.OutOfMemoryError) or is_cuda_oom_error_str(str(e)):
+    #         pytest.skip("Skipping due to OOM")
+    #     elif isinstance(e, MissingJITCacheError):
+    #         # Record the test that was skipped due to missing JIT cache
+    #         test_name = item.nodeid
+    #         spec = e.spec
+    #         module_name = spec.name if spec else "unknown"
+
+    #         # Create a dict with module info for reporting
+    #         spec_info = None
+    #         if spec:
+    #             spec_info = {
+    #                 "name": spec.name,
+    #                 "sources": [str(s) for s in spec.sources],
+    #                 "needs_device_linking": spec.needs_device_linking,
+    #                 "aot_path": str(spec.aot_path),
+    #             }
+
+    #         _MISSING_JIT_CACHE_MODULES.add((test_name, module_name, str(spec_info)))
+    #         pytest.skip(f"Skipping due to missing JIT cache for module: {module_name}")
+    #     else:
+    #         raise
 
 
 def pytest_terminal_summary(terminalreporter, exitstatus, config):
diff --git a/tests/moe/test_trtllm_gen_fused_moe.py b/tests/moe/test_trtllm_gen_fused_moe.py
index 231db811d8..22f807a318 100644
--- a/tests/moe/test_trtllm_gen_fused_moe.py
+++ b/tests/moe/test_trtllm_gen_fused_moe.py
@@ -14,6 +14,9 @@
 limitations under the License.
 """
 
+import paddle
+
+paddle.enable_compat()
 import pytest
 from abc import ABC, abstractmethod
 from typing import Dict
@@ -948,7 +951,7 @@ def to_float8_blockwise(
                     block = x[start_m:end_m, start_n:end_n]
 
                     # Per-block quantization logic
-                    min_val, max_val = block.aminmax()
+                    min_val, max_val = block.float().aminmax()
                     amax = torch.maximum(min_val.abs(), max_val.abs()).clamp(min=1e-12)
                     scale = finfo.max / amax
 
@@ -956,8 +959,9 @@ def to_float8_blockwise(
                     quantized_block = (block * scale).clamp(
                         min=finfo.min, max=finfo.max
                     )
-                    quantized_x[start_m:end_m, start_n:end_n] = quantized_block.to(
-                        dtype
+                    # Paddle compat: fp8 set_value_with_tensor kernel missing; use int8 view
+                    quantized_x.view(torch.int8)[start_m:end_m, start_n:end_n] = (
+                        quantized_block.to(dtype).view(torch.int8)
                     )
                     scales[i, j] = scale.float().reciprocal()
 
@@ -1656,7 +1660,7 @@ def __init__(
 def routing_reference(expertLogits, topK, padding):
     """Reference routing implementation for permutation calculation."""
     originalDevice = expertLogits.device
-    expertLogits = expertLogits.cpu()
+    expertLogits = expertLogits.cpu().float()
     numTokens, numExperts = expertLogits.shape
     assert topK <= numExperts
 
@@ -2684,7 +2688,7 @@ def run_moe_test(
     moe_impl._cache_permute_indices = cache_permute_indices
 
     seed = 0
-    torch.random.manual_seed(seed)
+    torch.manual_seed(seed)
 
     # Extract routing configuration
     top_k = routing_config["top_k"]
@@ -4391,7 +4395,7 @@ def test_routing_dtype_flexibility(
 
 def test_fp8_block_scale_routed_activation_type_relu2_smoke():
     """Smoke test routed FP8 block-scale call path with explicit non-gated activation_type."""
-    compute_capability = get_compute_capability(torch.device(device="cuda"))
+    compute_capability = get_compute_capability(torch.device("cuda"))
     if compute_capability[0] not in [10]:
         pytest.skip("These tests are only guaranteed to work on SM100 and SM103 GPUs.")
 
diff --git a/tests/moe/utils.py b/tests/moe/utils.py
index 39abf18717..556d7b8d9f 100644
--- a/tests/moe/utils.py
+++ b/tests/moe/utils.py
@@ -63,7 +63,7 @@ def skip_checks(
     zero_hidden_states=False,
 ):
     """Common skip logic for all tests."""
-    compute_capability = get_compute_capability(torch.device(device="cuda"))
+    compute_capability = get_compute_capability(torch.device("cuda"))
     if compute_capability[0] not in [10]:
         pytest.skip("These tests are only guaranteed to work on SM100 and SM103 GPUs.")