Skip to content

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774

Open
NKNaN wants to merge 2 commits into
PaddlePaddle:developfrom
NKNaN:spec-ngram
Open

[Speculative Decoding] Refine ngram kernel signature and adapt ngram proposer#7774
NKNaN wants to merge 2 commits into
PaddlePaddle:developfrom
NKNaN:spec-ngram

Conversation

@NKNaN
Copy link
Copy Markdown
Contributor

@NKNaN NKNaN commented May 11, 2026

Motivation

投机解码 ngram 方法端到端结果验证

Modifications

  1. 测试脚本(AI studio A800单卡环境能够跑通):

    # test.py
    from fastdeploy import LLM, SamplingParams
    
    # 场景1:代码生成——变量名、关键字、结构大量重复,ngram 命中率高
    msg1 = [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": (
            "用 Python 写一个 Student 类,包含以下方法:\n"
            "1. __init__(self, name, age, score)\n"
            "2. get_name(self) 返回 self.name\n"
            "3. get_age(self) 返回 self.age\n"
            "4. get_score(self) 返回 self.score\n"
            "5. set_name(self, name) 设置 self.name\n"
            "6. set_age(self, age) 设置 self.age\n"
            "7. set_score(self, score) 设置 self.score\n"
            "8. __repr__(self) 返回 f'Student(name={self.name}, age={self.age}, score={self.score})'\n"
            "请完整实现所有方法。"
        )},
    ]
    
    # 场景2:结构化列表——每条格式相同,生成时前缀 n-gram 高度重复
    msg2 = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": (
            "请列出20个中国城市,每条格式为:\n"
            "城市名:xxx,省份:xxx,人口:约xxx万,著名景点:xxx\n"
            "请严格按照这个格式输出全部20条,不要省略。"
        )},
    ]
    
    messages = [msg1, msg2]
    
    # 采样参数
    sampling_params = SamplingParams(top_p=0.95, max_tokens=6400)
    
    # 加载模型
    llm = LLM(
        model="baidu/ERNIE-4.5-0.3B-Paddle",
        tensor_parallel_size=1,
        max_model_len=8192,
        speculative_config={
            "method": "ngram",
            "num_speculative_tokens": 5,   # 每轮最多投机 5 个 draft token,范围 [1, 5]
            "max_ngram_size": 5,           # 最大 n-gram 窗口,默认 5
        },
       # enable_overlap_schedule=True,
    )
    
    outputs = llm.chat(messages, sampling_params)
    
    # 输出结果
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs.text
        print(prompt)
        print(generated_text)
  2. 修改ngram kernel接口:
    由于 input_ids 和 pre_ids 目前全部并入 token_ids_all,将原本接口中的 input_ids 删除,prompt tokens 和 predict tokens 完全由 token_ids_all 负责记录。

  3. 确认修改后的 ngram match kernel 端到端执行正确:

    1. token_ids_all 与 input_ids_cpu 的初始化:
    # fastdeploy\worker\input_batch.py: 114-115
    self.token_ids_all = paddle.full(
        [max_num_seqs, self.model_config.max_model_len], ...
    )
    # fastdeploy\worker\input_batch.py: 280-281
    self.input_ids_cpu = paddle.full(
        shape=[max_num_seqs, self.model_config.max_model_len], ...
    )
    1. 验证 token_ids_all prompt 部分的写入(gpu_model_runner中)和读取(NgramProposer._run_impl中)内容一致(通过打印 log 查看):
    # fastdeploy\worker\gpu_model_runner.py: 916-919
    # prompt_tokens
    async_set_value(self.share_inputs["token_ids_all"][idx : idx + 1, :prompt_len], prompt_token_ids)
    # generated_token_ids fill -1
    self.share_inputs["token_ids_all"][idx : idx + 1, prompt_len:] = -1
    
    ## 在此处打印 token_ids_all[i, 0:20] 和 token_ids_all[i, prompt_len-3:prompt_len+3] 到日志
    logger.info(f"[NGRAM][VERIFY-WRITE] idx={idx} prompt_len={prompt_len} "
            f"token_ids_all[0:20]={self.share_inputs['token_ids_all'][idx, :20].tolist()} "
            f"token_ids_all[pl-3:pl+3]={self.share_inputs['token_ids_all'][idx, prompt_len-3:prompt_len+3].tolist()}")
    # 在 ngram.py _run_impl 开头添加
    def _run_impl(self, share_inputs):
        """
        run
        """
    if not hasattr(self, '_debug_call_count'):
        self._debug_call_count = 0
    if self._debug_call_count < 3:
        pl = share_inputs["prompt_lens"]
        tia = share_inputs["token_ids_all"]
        si = share_inputs["step_idx"]
        for bid in range(pl.shape[0]):
            plen = int(pl[bid].item())
            if plen > 0:
                logger.info(f"[NGRAM][VERIFY-READ] call={self._debug_call_count} bid={bid} "
                            f"step_idx={int(si[bid].item())} prompt_len={plen} "
                            f"token_ids_all[0:20]={tia[bid, :20].tolist()} "
                            f"token_ids_all[pl-3:pl]={tia[bid, plen-3:plen].tolist()}"
                            f"seq_lens_dec={int(share_inputs['seq_lens_decoder'][bid].item())} ")
        self._debug_call_count += 1
    
    ngram_match(...)
    # 查看log
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
    INFO     2026-05-10 13:31:12,504 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=0 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
    INFO     2026-05-10 13:31:12,514 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=0 step_idx=1 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=168 
    INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=1 step_idx=13 prompt_len=4096 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,516 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,517 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,518 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,519 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=0 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,522 684126 gpu_model_runner.py[line:920] [NGRAM][VERIFY-WRITE] idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl+3]=[92267, 93963, 93919, -1, -1, -1]
    INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=0 step_idx=2 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=169 
    INFO     2026-05-10 13:31:12,533 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=1 step_idx=1 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=54 
    INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,535 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,536 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=1 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,541 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=0 step_idx=3 prompt_len=168 token_ids_all[0:20]=[100273, 2520, 524, 274, 20472, 17461, 27963, 93937, 23, 2969, 93963, 16816, 12199, 93919, 94667, 748, 36619, 69716, 93956, 10553] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=170 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=1 step_idx=2 prompt_len=54 token_ids_all[0:20]=[100273, 2969, 93963, 69157, 63191, 5, 3, 94016, 1358, 3671, 93956, 94405, 94525, 14246, 94022, 94035, 23, 3671, 94312, 94035] token_ids_all[pl-3:pl]=[92267, 93963, 93919]seq_lens_dec=55 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=2 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=3 step_idx=13 prompt_len=2048 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,542 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=4 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=5 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=6 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0 
    INFO     2026-05-10 13:31:12,543 684126 ngram.py[line:52] [NGRAM][VERIFY-READ] call=2 bid=7 step_idx=13 prompt_len=1024 token_ids_all[0:20]=[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5] token_ids_all[pl-3:pl]=[5, 5, 5]seq_lens_dec=0

    token_ids_all 为 5 时是 dummy batch,seq_lens_decoder=0。除此之外可以看到 token_ids_all prompt 部分的读和写的内容一致。

    1. 验证 ngram.py ngram_match 调用后如果匹配到了ngram, 则得到的 draft_token[i, 1:proposed_length] in token_ids_all[:prompt_len+step_idx[i]] == True
    ngram_match(...)
    
    # 在 ngram.py _run_impl 结尾添加
    if not hasattr(self, '_debug_call_count'):
        self._debug_call_count = 0
    if self._debug_call_count < 50:
        tia = share_inputs["token_ids_all"]
        pl  = share_inputs["prompt_lens"] 
        si  = share_inputs["step_idx"]
        dt  = share_inputs["draft_tokens"]
        slt = share_inputs["seq_lens_this_time"]
        print(f"[NGRAM-DEBUG] call={self._debug_call_count} "
            f"slt={slt.tolist()} "
            f"step_idx={si.tolist()} "
            f"prompt_lens={pl.tolist()} "
            f"draft_token_num={share_inputs['actual_draft_token_num'].tolist()} "
            f"seq_dec={share_inputs['seq_lens_decoder'].tolist()}")
        for bid in range(slt.shape[0]):
            n_proposed = int(slt[bid].item()) - 1
            if n_proposed <= 0:
                continue
            step = int(si[bid].item())
            plen = int(pl[bid].item())
            context = tia[bid, :plen + step].tolist()
            proposed = dt[bid, 1:1 + n_proposed].tolist()
    
            # 在 context 中查找 proposed 序列
            found = any(
                context[i:i + n_proposed] == proposed
                for i in range(len(context) - n_proposed + 1)
            )
            logger.info(f"[NGRAM][E2E] call={self._debug_call_count} bid={bid} step_idx={step}"
                        f"proposed={proposed} found_in_context={found}")
        self._debug_call_count += 1
    # 查看 [NGRAM-DEBUG]
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM-DEBUG\]' log/paddle/workerlog.0
    [NGRAM-DEBUG] call=0 slt=[1] step_idx=[[1], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [4096], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[168, 0, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=1 slt=[1, 1] step_idx=[[2], [1], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[169, 54, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=2 slt=[1, 1] step_idx=[[3], [2], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[170, 55, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=3 slt=[6, 1] step_idx=[[4], [3], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[171, 56, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=4 slt=[1, 6] step_idx=[[5], [4], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[172, 57, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=5 slt=[6, 1] step_idx=[[6], [5], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[173, 58, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=6 slt=[6, 6] step_idx=[[7], [6], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[174, 59, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=7 slt=[6, 1] step_idx=[[8], [7], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[175, 60, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=8 slt=[6, 6] step_idx=[[10], [8], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[177, 61, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=9 slt=[1, 6] step_idx=[[11], [9], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[178, 62, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=10 slt=[1, 6] step_idx=[[12], [10], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[179, 63, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=11 slt=[6, 6] step_idx=[[13], [11], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[180, 64, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=12 slt=[6, 6] step_idx=[[15], [13], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[182, 66, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=13 slt=[1, 6] step_idx=[[16], [14], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[183, 67, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=14 slt=[1, 1] step_idx=[[17], [16], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[184, 69, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=15 slt=[6, 6] step_idx=[[18], [17], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[185, 70, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=16 slt=[1, 1] step_idx=[[19], [18], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[186, 71, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=17 slt=[6, 1] step_idx=[[20], [19], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[187, 72, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=18 slt=[6, 6] step_idx=[[21], [20], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[188, 73, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=19 slt=[6, 1] step_idx=[[22], [21], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[189, 74, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=20 slt=[1, 1] step_idx=[[23], [22], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[190, 75, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=21 slt=[1, 6] step_idx=[[24], [23], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[191, 76, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=24 slt=[1, 1] step_idx=[[35], [26], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[202, 79, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=25 slt=[6, 6] step_idx=[[36], [27], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[203, 80, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=26 slt=[1, 1] step_idx=[[37], [28], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[204, 81, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=27 slt=[1, 6] step_idx=[[38], [29], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[205, 82, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=28 slt=[6, 1] step_idx=[[39], [30], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[206, 83, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=29 slt=[4, 6] step_idx=[[40], [31], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[207, 84, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=30 slt=[1, 6] step_idx=[[41], [32], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[208, 85, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=31 slt=[1, 1] step_idx=[[42], [33], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[209, 86, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=32 slt=[1, 1] step_idx=[[43], [34], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[210, 87, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=33 slt=[6, 6] step_idx=[[44], [35], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[211, 88, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=34 slt=[6, 6] step_idx=[[45], [36], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[212, 89, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=35 slt=[6, 1] step_idx=[[46], [38], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[213, 91, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=36 slt=[1, 1] step_idx=[[47], [39], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[214, 92, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=37 slt=[6, 6] step_idx=[[48], [40], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[215, 93, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=38 slt=[6, 1] step_idx=[[49], [41], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[216, 94, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=39 slt=[1, 1] step_idx=[[50], [42], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[217, 95, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=40 slt=[1, 6] step_idx=[[51], [43], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[218, 96, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=41 slt=[6, 4] step_idx=[[52], [44], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[219, 97, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=42 slt=[6, 1] step_idx=[[53], [46], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[220, 99, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=43 slt=[6, 6] step_idx=[[54], [47], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[221, 100, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=44 slt=[6, 1] step_idx=[[56], [48], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[223, 101, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=45 slt=[6, 6] step_idx=[[57], [49], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[224, 102, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=46 slt=[6, 1] step_idx=[[58], [50], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[225, 103, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=47 slt=[1, 1] step_idx=[[59], [51], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[226, 104, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=48 slt=[6, 6] step_idx=[[60], [52], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[227, 105, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=49 slt=[6, 3] step_idx=[[61], [53], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[228, 106, 0, 0, 0, 0, 0, 0]
    # 查看 [NGRAM]
    (base) aistudio@ssh-5453289-10284016-bf48d89cf-ph9f8:~/FastDeploy$ grep '\[NGRAM\]' log/paddle/workerlog.0
    INFO     2026-05-10 15:48:31,030 726574 ngram.py[line:84] [NGRAM][E2E] call=3 bid=0 step_idx=4proposed=[93949, 695, 7858, 804, 93937] found_in_context=True
    INFO     2026-05-10 15:48:31,033 726574 ngram.py[line:84] [NGRAM][E2E] call=4 bid=1 step_idx=4proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,037 726574 ngram.py[line:84] [NGRAM][E2E] call=5 bid=0 step_idx=6proposed=[93956, 10553, 4923, 1919, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,041 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=0 step_idx=7proposed=[4162, 1919, 93977, 23, 92267] found_in_context=True
    INFO     2026-05-10 15:48:31,043 726574 ngram.py[line:84] [NGRAM][E2E] call=6 bid=1 step_idx=6proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,046 726574 ngram.py[line:84] [NGRAM][E2E] call=7 bid=0 step_idx=8proposed=[10553, 4923, 1919, 94035, 23] found_in_context=True
    INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=0 step_idx=10proposed=[1919, 93977, 23, 92267, 93963] found_in_context=True
    INFO     2026-05-10 15:48:31,050 726574 ngram.py[line:84] [NGRAM][E2E] call=8 bid=1 step_idx=8proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,053 726574 ngram.py[line:84] [NGRAM][E2E] call=9 bid=1 step_idx=9proposed=[14045, 94466, 93956, 17340, 33015] found_in_context=True
    INFO     2026-05-10 15:48:31,057 726574 ngram.py[line:84] [NGRAM][E2E] call=10 bid=1 step_idx=10proposed=[93937, 42854, 94035, 3991, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=0 step_idx=13proposed=[23, 4, 93937, 1377, 1472] found_in_context=True
    INFO     2026-05-10 15:48:31,060 726574 ngram.py[line:84] [NGRAM][E2E] call=11 bid=1 step_idx=11proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=0 step_idx=15proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,064 726574 ngram.py[line:84] [NGRAM][E2E] call=12 bid=1 step_idx=13proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
    INFO     2026-05-10 15:48:31,067 726574 ngram.py[line:84] [NGRAM][E2E] call=13 bid=1 step_idx=14proposed=[93956, 17340, 33015, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=0 step_idx=18proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,074 726574 ngram.py[line:84] [NGRAM][E2E] call=15 bid=1 step_idx=17proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,081 726574 ngram.py[line:84] [NGRAM][E2E] call=17 bid=0 step_idx=20proposed=[69716, 93956, 10553, 4923, 1919] found_in_context=True
    INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=0 step_idx=21proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,085 726574 ngram.py[line:84] [NGRAM][E2E] call=18 bid=1 step_idx=20proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,090 726574 ngram.py[line:84] [NGRAM][E2E] call=19 bid=0 step_idx=22proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,098 726574 ngram.py[line:84] [NGRAM][E2E] call=21 bid=1 step_idx=23proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=0 step_idx=25proposed=[1472, 6946, 804, 93938, 853] found_in_context=True
    INFO     2026-05-10 15:48:31,102 726574 ngram.py[line:84] [NGRAM][E2E] call=22 bid=1 step_idx=24proposed=[3, 94016, 1358, 3671, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=0 step_idx=31proposed=[4816, 93938, 10714, 93948, 23] found_in_context=True
    INFO     2026-05-10 15:48:31,106 726574 ngram.py[line:84] [NGRAM][E2E] call=23 bid=1 step_idx=25proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
    INFO     2026-05-10 15:48:31,112 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=0 step_idx=36proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,113 726574 ngram.py[line:84] [NGRAM][E2E] call=25 bid=1 step_idx=27proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,120 726574 ngram.py[line:84] [NGRAM][E2E] call=27 bid=1 step_idx=29proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,123 726574 ngram.py[line:84] [NGRAM][E2E] call=28 bid=0 step_idx=39proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=0 step_idx=40proposed=[3099, 23, 283] found_in_context=True
    INFO     2026-05-10 15:48:31,127 726574 ngram.py[line:84] [NGRAM][E2E] call=29 bid=1 step_idx=31proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,132 726574 ngram.py[line:84] [NGRAM][E2E] call=30 bid=1 step_idx=32proposed=[4, 5, 3, 3, 94466] found_in_context=True
    INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=0 step_idx=44proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,142 726574 ngram.py[line:84] [NGRAM][E2E] call=33 bid=1 step_idx=35proposed=[94016, 1358, 3671, 93956, 94405] found_in_context=True
    INFO     2026-05-10 15:48:31,146 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=0 step_idx=45proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,147 726574 ngram.py[line:84] [NGRAM][E2E] call=34 bid=1 step_idx=36proposed=[93956, 73776, 93956, 94112, 96674] found_in_context=True
    INFO     2026-05-10 15:48:31,154 726574 ngram.py[line:84] [NGRAM][E2E] call=35 bid=0 step_idx=46proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,160 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=0 step_idx=48proposed=[93938, 4816, 93938, 10714, 93948] found_in_context=True
    INFO     2026-05-10 15:48:31,161 726574 ngram.py[line:84] [NGRAM][E2E] call=37 bid=1 step_idx=40proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,166 726574 ngram.py[line:84] [NGRAM][E2E] call=38 bid=0 step_idx=49proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,173 726574 ngram.py[line:84] [NGRAM][E2E] call=40 bid=1 step_idx=43proposed=[94405, 94525, 14246, 94022, 94035] found_in_context=True
    INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=0 step_idx=52proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,176 726574 ngram.py[line:84] [NGRAM][E2E] call=41 bid=1 step_idx=44proposed=[94822, 93956, 97249] found_in_context=True
    INFO     2026-05-10 15:48:31,179 726574 ngram.py[line:84] [NGRAM][E2E] call=42 bid=0 step_idx=53proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=0 step_idx=54proposed=[920, 853, 93963, 37993, 28685] found_in_context=True
    INFO     2026-05-10 15:48:31,183 726574 ngram.py[line:84] [NGRAM][E2E] call=43 bid=1 step_idx=47proposed=[3671, 94312, 94035, 14045, 93956] found_in_context=True
    INFO     2026-05-10 15:48:31,186 726574 ngram.py[line:84] [NGRAM][E2E] call=44 bid=0 step_idx=56proposed=[93938, 10714, 93948, 23, 5] found_in_context=True
    INFO     2026-05-10 15:48:31,189 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=0 step_idx=57proposed=[16816, 12199, 93919, 94667, 748] found_in_context=True
    INFO     2026-05-10 15:48:31,190 726574 ngram.py[line:84] [NGRAM][E2E] call=45 bid=1 step_idx=49proposed=[42854, 94035, 3991, 93956, 20932] found_in_context=True
    INFO     2026-05-10 15:48:31,193 726574 ngram.py[line:84] [NGRAM][E2E] call=46 bid=0 step_idx=58proposed=[28685, 23, 283, 93963, 920] found_in_context=True
    INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=0 step_idx=60proposed=[2969, 93963, 16816, 12199, 93919] found_in_context=True
    INFO     2026-05-10 15:48:31,200 726574 ngram.py[line:84] [NGRAM][E2E] call=48 bid=1 step_idx=52proposed=[23, 3671, 94312, 94035, 14045] found_in_context=True
    INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=0 step_idx=61proposed=[3099, 23, 283, 44055, 934] found_in_context=True
    INFO     2026-05-10 15:48:31,204 726574 ngram.py[line:84] [NGRAM][E2E] call=49 bid=1 step_idx=53proposed=[94035, 10985] found_in_context=True

    kernel 中的 ngram 地址计算 bug 修复后日志打印结果显示能够匹配到,且存在经过 verify 后在一步 decode 中接受了多个token的情况,如:
    [NGRAM-DEBUG] call=22 slt=[6, 6] step_idx=[[25], [24], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[192, 77, 0, 0, 0, 0, 0, 0]
    [NGRAM-DEBUG] call=23 slt=[6, 6] step_idx=[[31], [25], [13], [13], [13], [13], [13], [13]] prompt_lens=[[168], [54], [2048], [2048], [1024], [1024], [1024], [1024]] draft_token_num=[5, 5, 5, 5, 5, 5, 5, 5] seq_dec=[198, 78, 0, 0, 0, 0, 0, 0]
    相邻两次proposer.run()时,step_idx[0] 从 25 增加到 31,seq_len_decoder[0] 从 192 增加到 198

  4. CUDAGraph 适配

    1. proposer.run() 在 gpu_runner._postprocess() 中执行,这部分不被 CUDAGraph 录制
    2. draft token 的 verify 需要一次性输入多个 token,所以会改变decode时录制的 expected_decode_len 和 batch_size,所以在 gpu worker 的 warmup 阶段需要将预计改变的形状提前录制好,需要修改 gpu_runner.capture_model() 和 FDConfig 对应的地方
    3. 测试脚本的 FDConfig 默认已经开启了 CUDAGraph
  5. Overlap Schedule 适配

    1. input_ids_cpu 在 input_batch.py 中初始化时没有设定 pin_memory,不参与 overlap
    2. 测试脚本开启 enable_overlap_schedule=True 后 log 中仍能够打印出正确匹配且上一步 decode 接受了多个token的情况

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 11, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 11, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 11, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-11 22:47:36

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 Required 失败任务(Approval 待审批),需处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
19(0) 19 12 2 3 2 0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 9s PR问题:PR修改了spec_decode目录,缺少FastDeploy RD审批 请 freeliuzc 或 Deleter-D 审批此 PR Job -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 13s Job -
xpu_build_test / xpu-build-test - Job -
FD-Build-Linux / fd-build - Job -
Trigger Jenkins for PR - Job -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 11 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码审批(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码审批
  • 置信度: 高
  • 根因摘要: PR修改了spec_decode目录,缺少FastDeploy RD成员审批
  • 分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding 目录,根据 FastDeploy 代码审批规则,需要至少一位 FastDeploy RD 成员(freeliuzc(liuzichang01)Deleter-D(wangyanpeng04))的审批方可通过。当前检测到 1 个审批错误,exit code 6。

关键日志:

0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. freeliuzc(liuzichang01)Deleter-D(wangyanpeng04) Review 并 Approve 此 PR

修复建议摘要: 请 freeliuzc 或 Deleter-D 审批此 PR

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d70f33d). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7774   +/-   ##
==========================================
  Coverage           ?   71.53%           
==========================================
  Files              ?      396           
  Lines              ?    55822           
  Branches           ?     8724           
==========================================
  Hits               ?    39935           
  Misses             ?    13136           
  Partials           ?     2751           
Flag Coverage Δ
GPU 71.53% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-11 22:49:23

📋 Review 摘要

PR 概述:精简 ngram match kernel 接口(合并 input_ids/input_ids_lentoken_ids_all),修复 ngram 指针偏移 bug,并完成端到端验证
变更范围custom_ops/gpu_ops/speculate_decoding/fastdeploy/spec_decode/ngram.pyfastdeploy/config.pyfastdeploy/worker/gpu_model_runner.py、测试文件
影响面 Tag[Speculative Decoding] [OP] [FDConfig]

📝 PR 规范检查

标题含官方 Tag [Speculative Decoding] ✓;但 PR body 缺少 ## Usage or Command## Accuracy Tests 两个必填节,结构不符合描述模板要求。

PR 描述建议(可直接复制):

## Motivation
精简 ngram match kernel 接口,将原本分离的 `input_ids`/`input_ids_len` 参数合并到 `token_ids_all`(由 `prompt_lens` 划定 prompt 与 generated tokens 边界),并修复 ngram 指针偏移 bug(`step_idx` 语义由 0-based 末尾位置索引统一为 token 计数语义)。完成 A800 单卡端到端结果验证,确认投机解码 ngram 方法的端到端正确性。

## Modifications
1. **`custom_ops/gpu_ops/speculate_decoding/ngram_match.cu` / `cpp_extensions.cc`**:删除 `input_ids``input_ids_len``input_ids_stride` 参数;GPU kernel 与 CPU fallback 均改为直接从 `token_ids_all[:, :prompt_len]` 读取 prompt(搜索域)、从 `token_ids_all[:, prompt_len:]` 读取 pre_ids(ngram 来源);修复 ngram 指针偏移 bug:将 `cur_step_idx + 1 - ngram_size` 改为 `cur_step_idx - ngram_size`2. **`fastdeploy/spec_decode/ngram.py`**:删除 `input_ids_len` 相关张量及 `update()` 方法,`_run_impl` 调用签名与新 kernel 接口对齐。
3. **`fastdeploy/config.py`**:将 `SpecMethod.NGRAM` 加入 CUDAGraph capture 的 expected_decode_len 计算逻辑。
4. **`fastdeploy/worker/gpu_model_runner.py`**`capture_model()` 中为 NGRAM 方法补充 warmup 路径,与 MTP/SUFFIX 保持一致。
5. **测试**:更新 `tests/operators/test_ngram_match.py``tests/spec_decode/test_benchmark_ngram_kernel.py``tests/spec_decode/test_ngram_gpu_kernel.py`;新增 `tests/spec_decode/test_ngram_proposer.py`## Usage or Command
N/A

## Accuracy Tests
端到端验证(A800 单卡):通过打印 `token_ids_all` 读写日志确认 prompt 写入与读取一致;验证 `draft_tokens` 均在 `token_ids_all[:prompt_len + step_idx]` 范围内命中;`step_idx` 跨步增量(如 25→31)确认一次 decode 成功接受多个 speculative tokens。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
❓ 疑问 fastdeploy/spec_decode/ngram.py:38 update() 已删除,需确认 gpu_model_runner.py 中无残余调用

总体评价

接口精简思路清晰,bug fix(ngram 指针偏移)通过大量 e2e 日志得到充分验证,测试覆盖完整。描述结构需补全 ## Usage or Command## Accuracy Tests 两节。

self.input_ids_len[bid] = seq_len
self.input_ids_len_gpu[bid] = seq_len

def _run_impl(self, share_inputs):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 update() 方法在此 PR 中已删除,请确认 fastdeploy/worker/gpu_model_runner.py 中(如 _postprocess 等位置)已无 proposer.update(bid, seq_len) 残余调用,否则在 NGRAM 模式下会引发 AttributeError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants