Demo-131 Cuda graph with optimized paged attention #990

PanZezhong1725 · 2026-01-27T03:29:09Z

No description provided.

…graph recording - Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim

对 `NineToothedTensor` 进行 C++ 层封装加入使用数组作为 `shape` 和 `strides` 创建 `ninetoothed::Tensor` 的方式使用 `ninetoothed::Tensor` 接入九齿的 ReLU 算子 Add an include guard to `ninetoothed/utils.h`

…oothed/build.py` with `concurrent.futures`

…ild ntops

…stantiate

issue/811 use relax graph capture mode

issue/988 - adapt to ali ppu

demo131 - multiple issues regarding quantization, qy, and so forth * issue/843: success per_channel_quant_int8 * issue/843: success qy quant * issue/843: modified quant * Add w8a8int8 performance tests * add infinicore op linear_w8a8i8 * w8a8 linear module functional nn * issue/843: QY-GPU Support Int8 scale_mm (#68) * issue/843: success qy scaled_mm * issue/843: modified kernel.cuh as per_channel_dequant_int8.cuh * fix parallel slic in w8 * w8: support multiple batch size * temp: 修改quantconfig处理 * fix format and delete redundancy code * fix format * fix format * fix format * Refactor: add new API alongside legacy interfaces with deprecation warnings * 添加w4 inifnicore相关内容，以及将Quantization config划入InfiniCore * 量化算子支持图 * solve cub version problem and fix code structure * fix format * demo131 - remove commented lines --------- Co-authored-by: xgqdut2016 <kenan_gewei@163.com> Co-authored-by: xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com> Co-authored-by: wooway777 <wooway777@gmail.com>

wooway777 and others added 24 commits January 27, 2026 10:36

issue/987 - add .cpp files to ninetoothed includes

1e63710

issue/978 - metax cuda graph impl and wrappings

822a534

issue/900 - support embedding on iluvatar, metax, and moore

835209e

issue/900 - adapt to graph and adjust test script

eb34d4d

issue/900 - maintains classic embedding for devices yet to be worked on

f9761a2

issue/791 fix add_rmsnorm api and rmsnorm module

0c204df

issue/884 - add_rms_norm on iluvatar, metax and moore

dfafc21

issue/632 - adapt to iluvatar core 20

4ddc664

issue/791 - fix add_rmsnorm api on mtx and mth

0611cb1

issue/810 support more ops as graph op

81e5fe9

issue/985 - adjust cxflags and cxxflags for lua scripts

7c5aa16

issue/402 - convenient ninetoothed util

55cd22e

对 `NineToothedTensor` 进行 C++ 层封装加入使用数组作为 `shape` 和 `strides` 创建 `ninetoothed::Tensor` 的方式使用 `ninetoothed::Tensor` 接入九齿的 ReLU 算子 Add an include guard to `ninetoothed/utils.h`

issue/925 - Speed up scripts/build_ntops.py and `src/infiniop/ninet…

32340fc

…oothed/build.py` with `concurrent.futures`

issue/940 - check build result and implicitly require build.py for bu…

ca58118

…ild ntops

issue/935 - add metax include dir for ninetoothed

47843aa

issue/919 - ninetoothed flash attention

6ac8f90

issue/931 - ninetoothed swiglu for nv, il, mtx

5614e1b

issue/923 - ninetoothed kv caching for nv, il, mtx

97eced0

issue/979 optimize paged attention

1c18c04

issue/979 - removed commented paged attn codes

4cd1f68

issue/983 - adapted the optimized paged attention to metax

7a18d24

demo131 - patch lua flags and includes

1fa5629

issue/811 use relax graph capture mode, add compile flag for graph in…

807e5e4

…stantiate

PanZezhong1725 requested review from a team, Ceng23333, pengcheng888, voltjia, whjthu and wooway777 January 27, 2026 03:29

PanZezhong1725 requested review from ma-hang, qinyiqun, spike-zhu and zhangyue207 January 27, 2026 03:29

PanZezhong1725 and others added 7 commits January 27, 2026 11:30

Merge pull request #989 from InfiniTensor/issue/811-fix

70862bc

issue/811 use relax graph capture mode

issue/995 fix paged attn on iluvatar

bf0c825

issue/988 - adapt to ali ppu

7e2a4c0

issue/988 - unlock unused operators on ali ppu

5558e85

issue/988 - update readme

e0268b2

Merge pull request #999 from InfiniTensor/issue/988

abab565

issue/988 - adapt to ali ppu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo-131 Cuda graph with optimized paged attention #990

Demo-131 Cuda graph with optimized paged attention #990

PanZezhong1725 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Demo-131 Cuda graph with optimized paged attention #990

Are you sure you want to change the base?

Demo-131 Cuda graph with optimized paged attention #990

Conversation

PanZezhong1725 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants