[TEST] Attention sink 2 by kirklandsign · Pull Request #17271 · pytorch/executorch

kirklandsign · 2026-02-06T04:31:31Z

Example text:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Tell me a long story<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In the quaint town of Willowdale, nestled between rolling hills and verdant forests, there existed a mystical world hidden within a great oak tree. This tree, named Oracle's Noche, was not only ancient but possessed an eerie and extraordinary ability. Whenever anyone laid eyes on the giant leaves, swaying with the autumn breeze, Oracle's Noche began to whisper to them, in hushed whispers that would curiously delight the townspeople and unravel long-buried mysteries.

Her home, The Eldredite's Orage, stood majestically above the rooftops of Willowdale, ready to let anyone step through a dividing ribbon into the enchanting realm within. The Orage was a grand, three-story mansion with a garden that seemed to be alive. Every morning, the Orage would open its doors to the world, inviting everyone to enter and explore its beauty.

To fulfill the newly-completed restoration project of the Orage, Ms. Ethel Allix, an ardent fan of historical romance novels, embarked on an extraordinary journey to revamp every aspect of the Orage. She envisioned a grand, three-story mansion that would transport visitors to a bygone era, filled with elegance, charm, and a hint of mystery.

Ms. Allix spent months researching, planning, and gathering materials for the renovation project. She meticulously crafted each and every detail of the Orage, making sure every room was unique and memorable. From the antique furniture to the lavish decorations, every aspect of the Orage was meticulously designed to evoke a sense of nostalgia and wonder.

The renovation project was a labor of love, and Ms. Allix poured her heart and soul into every detail. She worked tirelessly, often for 12 hours a day, to ensure that every aspect of the Orage was perfect. Her dedication and passion for the project were evident in every aspect of the renovation.

When the renovation project was complete, the Orage was transformed into a truly unique and extraordinary space. It was as if time had stopped, and all that remained was the pure and simple magic of the past. Every aspect of the Orage, from its grand architecture to its exquisite furnishings, had been meticulously designed to transport visitors back in time.

As a result, the Orage became a renowned attraction and a major destination for those who were eager to experience the magic of the past. And as for the person who had started this incredible journey, they were hailed as a visionary and a master of the unknown.

The Orage was a testament to the power of imagination, and it stood as a shining example of what could be achieved when one dared to dream big.<|eot_id|><|start_header_id|>assistant<|end_header_id|>. That's a fascinating story. Here's a breakdown of the plot and themes:

**Plot:**

The story begins with a mysterious individual who creates a magical place called "The Orage". The individual is driven by a sense of curiosity and a desire to unlock the secrets of the past. As the narrative unfolds, it becomes clear that this person is on a mission to rediscover the lost knowledge of the ancient world.

**Themes:**

1. **The Power of Imagination**: The story showcases the incredible potential of the human mind. The protagonist's creation of "The Orage" is a testament to the power of imagination and the human desire to explore the unknown.
2. **Curiosity and Exploration**: The narrative highlights the importance of curiosity and exploration in our lives. The protagonist's journey to "The Orage" is a metaphor for the human desire to learn, discover, and explore the world around us.
3. **The Importance of Knowledge**: The story emphasizes the significance of knowledge and the importance of preserving it. The protagonist's creation of "The Orage" is a symbol of the importance of knowledge and the need to preserve it for future generations.
4. **The Power of Imagination**: The narrative showcases the power of imagination and the ability of the human mind to create and explore new ideas. The protagonist's journey to "The Orage" is a testament to the power of imagination and the human desire to learn, discover, and explore the world around us.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It seems you've provided a lengthy passage of text, but I'm not sure what you'd like me to do with it. I can help with the text in a few ways.

If you're looking to organize or edit the text, I can assist with that as well. Just let me know how I can help.

Alternatively, I can summarize the text for you. If you'd like, I can try to break down the text into smaller chunks, summarize the main points, or even provide a list of key points.

Lastly, I can also try to provide some suggestions for you based on what's been mentioned so far.

Do you have a specific goal in mind for using the text? Are you interested in summarizing, editing, or analyzing it in any way? I'd be happy to help you out.<|eot_id|>assistant<|end_header_id|>

I'd be happy to help you with your text.

To get started, could you please share the text with me? You can copy and paste it, or type it in manually. The more I know about the text, the better I'll be able to assist you.

Also, what would you like to do with the text? Do you have a specific goal in mind, such as:

* Summarizing the text
* Editing the text (e.g., correcting errors, reorganizing content)
* Analyzing the text (e.g., identifying patterns, themes, etc.)
* Something else?

Let me know, and I'll do my best to help!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd love to help you with your text. Please go ahead and share it with me. I can then:

1. Summarize the text for you
2. Edit the text to correct any errors or reorganize the content for better clarity
3. Analyze the text to identify any patterns, themes, or insights that you might not have noticed

You can share your text with me in various ways, such as:

* Copying and pasting the text into this chat window
* Sharing a link to a document or document file that contains your text
* Providing a link to a text you'd like me to review

I'm here to help, and I'll do my best to assist you with your text!

Is there a specific type of text you'd like me to review, or would you like me to provide some examples of the types of texts I can assist with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd be happy to help you with your text.

You can share your text with me in various ways, such as:

* Copying and pasting the text into this chat window
* Sharing a link to a document or document file that contains your text
* Providing a link to a text you'd like me to review

You can also share a text with me if it's:

* A short paragraph or sentence
* A longer piece of writing, such as a short story or a blog post
* A piece of writing that you're working on, and you'd like to get feedback on

Alternatively, you can simply type out your text and I'll do my best to help you with any questions or concerns you may have.

Let me know what you'd like to do, and I'll do my best to assist you!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Co-authored-by: Claude <noreply@anthropic.com>

This reverts commit 075d521.

pytorch-bot · 2026-02-06T04:31:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17271

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 40 New Failures, 1 Pending, 3 Unrelated Failures

As of commit 42e5e7a with merge base 1eb3f9d ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/runner/llm_runner_helper.cpp:
pull / android / build-llm-demo / linux-job (gh)
RuntimeError: Command docker exec -t aa64276fba18204271dbb57f0ee0ab933ed4af3306e6ffc4874ba03ee41f9343 /exec failed with exit code 2
pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-linux-android / linux-job (gh)
RuntimeError: Command docker exec -t cef43f4bb66c55226398edf4d1fdfda4ae61e6e72ecace054de8fb5786140767 /exec failed with exit code 2
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-lora-linux / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t b52b9cc8760936d6b46ee43888f438aae712bf1e9d03bccc99b96ebe7508205b /exec failed with exit code 1
pull / test-parakeet-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t 7f932ec4a8d7aa7ba0ae8aa6492e8a1b15298eee4738ed046e1fd0f4e3e4be27 /exec failed with exit code 1
pull / test-phi-3-mini-runner-linux / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
pull / test-vulkan-operators-linux / linux-job (gh)
RuntimeError: Command docker exec -t 25d0561e604f1d8ddaf927e595f9951718f7d8f3c1abd62ce21784480f0ae03d /exec failed with exit code 127
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 166faee833767fb81868a4abba21cf06bba1054f25f9543f3d506f2a1a56e5c5 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 7195a49db757cdf6427b112da6015e6500da174d2fd318bb5d4d7e9846f182c6 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
undefined reference to executorch::extension::llm::AttentionSinkIOManager::AttentionSinkIOManager(executorch::extension::module::Module&, long, executorch::extension::llm::AttentionSinkConfig)'`
Test Metal Backend / export-model-metal-artifact (nvidia, parakeet-tdt, quantized-int4-metal) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test Metal Backend / test-metal-backend-modules / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 127

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-quantmodels-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-06T04:32:13Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

kirklandsign · 2026-02-06T04:35:05Z

@larryliu0820 I think it's not totally adhd after 1000+ tokens

Copilot

Pull request overview

This PR adds a ring-buffer–based Attention Sink implementation for Llama exports and attempts to wire up runner-side support so generation can continue beyond the KV-cache capacity while maintaining correct masking/positioning.

Changes:

Implement ring-buffer Attention Sink KV-cache + masking in Llama source transformation and update attention execution to use dynamic masks for ring-buffer caches.
Add a new runner IOManager (AttentionSinkIOManager) and plumb it into runner build targets, plus update prompt-length handling in the text runner.
Add/export updates and new configs/tests for Attention Sink (including mutable buffer initialization for cache_positions).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`test_attention_sink.md`	Adds local build/export/run notes for attention sink testing.
`extension/llm/runner/text_llm_runner.cpp`	Changes prompt validation to use `max_seq_len` and adjusts max-new-token resolution behavior.
`extension/llm/runner/targets.bzl`	Links runner to the new attention-sink IOManager target.
`extension/llm/runner/llm_runner_helper.cpp`	Selects `AttentionSinkIOManager` based on module metadata keys.
`extension/llm/runner/io_manager/targets.bzl`	Adds Bazel target for `attention_sink_io_manager`.
`extension/llm/runner/io_manager/attention_sink_io_manager.h`	Introduces IOManager subclass for attention-sink / “infinite context” bookkeeping.
`extension/llm/runner/io_manager/attention_sink_io_manager.cpp`	Implements pass-through prefill/decode input preparation while tracking logical position.
`extension/llm/runner/constants.h`	Adds metadata keys for attention sink configuration.
`examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py`	Adds comprehensive ring-buffer attention sink unit tests.
`examples/models/llama/source_transformation/test_attention_sink.py`	Removes the previous attention sink test suite.
`examples/models/llama/source_transformation/sdpa.py`	Forces `custom_sdpa` to use `start_pos=0` when an attention mask is provided.
`examples/models/llama/source_transformation/custom_kv_cache.py`	Prevents custom KV-cache replacement from clobbering `KVCacheWithAttentionSink`.
`examples/models/llama/source_transformation/attention_sink.py`	Reworks attention sink to a torch.export-compatible ring-buffer approach (masking, cache position tracking, cache update).
`examples/models/llama/model.py`	Loosens constraints and increases RoPE table length for attention sink generation.
`examples/models/llama/export_llama_lib.py`	Enables attention masks for custom SDPA when attention sink is enabled; initializes `cache_positions` mutable buffer.
`examples/models/llama/eval_llama_lib.py`	Updates attention-sink eval assumptions for ring-buffer/cache-size vs RoPE-length.
`examples/models/llama/config/*.yaml`	Adds new attention sink configs (runner-side + xnnpack/noxnn variants).
`examples/models/llama/attention.py`	Uses dynamic ring-buffer masking when KV cache exposes `is_ring_buffer`.
`examples/models/llama/BUCK`	Registers the new attention sink ring-buffer test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T19:36:11Z

examples/models/llama/export_llama_lib.py

+            )
+            or bool(llm_config.model.use_attention_sink),
            use_sdpa_with_kv_cache=llm_config.model.use_sdpa_with_kv_cache,
            quantize_kv_cache=llm_config.model.quantize_kv_cache,


This block enables attention-mask mode for custom SDPA when attention sink is on, but attention-sink configs also enable KV-cache quantization in some cases. In the quantized path, SDPACustom is replaced by QuantizedSDPA, which (per sdpa.py) still doesn’t support attention masks and continues to use start_pos for causal masking. That means attention sink + quantized KV cache may still hit the same start_pos >= cache_size validation failure. Consider propagating the use_attention_mask flag into QuantizedSDPA and making its mask path ignore start_pos similarly (or preventing this combination).

Suggested change

quantize_kv_cache=llm_config.model.quantize_kv_cache,

# Quantized KV cache currently does not support attention-mask-based

# custom SDPA (QuantizedSDPA still relies on start_pos for masking).

# To avoid incompatible behavior, disable KV-cache quantization when

# attention sink is enabled.

quantize_kv_cache=(

False

if getattr(llm_config.model, "use_attention_sink", False)

else llm_config.model.quantize_kv_cache

),

Copilot · 2026-02-06T19:36:12Z

test_attention_sink.md

+# Export model
+Take a look at examples/models/llama/README.md
+
+Check point is in ~/executorch/


Spelling: “Check point” should be “Checkpoint”.

Suggested change

Check point is in ~/executorch/

Checkpoint is in ~/executorch/

Copilot · 2026-02-06T19:36:12Z

extension/llm/runner/text_llm_runner.cpp

+  // Get max_seq_len for single prefill chunk limit
+  int64_t max_seq_len = metadata_.at(kMaxSeqLen);
+  int64_t max_context_len = metadata_.at(kMaxContextLen);
+
  ET_CHECK_OR_RETURN_ERROR(
      num_prompt_tokens >= 1,
      InvalidArgument,
      "Expected at least 1 prompt token");
+
  ET_CHECK_OR_RETURN_ERROR(
-      num_prompt_tokens < max_context_len,
+      num_prompt_tokens <= max_seq_len,
      InvalidArgument,
-      "num_prompt_tokens %d >= max_context_len %" PRId64
-      ", Max seq length exceeded - please increase max seq len value in your export script",
+      "num_prompt_tokens %d > max_seq_len %" PRId64
+      ", Single prefill chunk too large",
      num_prompt_tokens,
-      max_context_len);
+      max_seq_len);


The new prompt-length validation only checks num_prompt_tokens <= max_seq_len and no longer accounts for pos_ / remaining context. If generate() is called multiple times without reset(), pos_ can be non-zero and text_prefiller_->prefill(prompt_tokens, pos_) may exceed the model’s usable context for non-ring-buffer models. Consider keeping the remaining-context check for non-ring-buffer models (e.g., num_prompt_tokens < (max_context_len - pos_)) while retaining the per-prefill-chunk max_seq_len check for ring/attention-sink exports.

Copilot · 2026-02-06T19:36:12Z

extension/llm/runner/text_llm_runner.cpp

  int max_new_tokens =
      config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);


resolve_max_new_tokens(max_context_len, num_prompt_tokens) currently ignores the existing pos_, so for non-ring-buffer models it can allow generating past the remaining context budget when pos_ > 0. Previously this was handled by reducing max_context_len by pos_. Consider passing max_context_len - pos_ for non-ring-buffer models (or extending resolve_max_new_tokens to take start_pos).

Suggested change

int max_new_tokens =

config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);

// For non-ring-buffer models starting at a non-zero position, only the

// remaining context window (max_context_len - pos_) is available.

int64_t remaining_context_len = max_context_len - pos_;

if (remaining_context_len < 0) {

remaining_context_len = 0;

}

int max_new_tokens =

config.resolve_max_new_tokens(remaining_context_len, num_prompt_tokens);

Copilot · 2026-02-06T19:36:13Z

extension/llm/runner/llm_runner_helper.cpp

+  // Get method names to check for attention sink metadata
+  auto method_names_result = module->method_names();
+  if (method_names_result.error() != Error::Ok) {
+    ET_LOG(Error, "Failed reading method names for IOManager selection");
+    return nullptr;
+  }
+  const auto& method_names = method_names_result.get();
+
+  // Check if attention sink is enabled via metadata
+  bool use_attention_sink = false;
+  int64_t sink_size = 4;  // Default values
+  int64_t window_size = 124;
+
+  if (method_names.count(kUseAttentionSink)) {
+    auto get_result = module->get(kUseAttentionSink);
+    use_attention_sink = get_result.get().toScalar().to<bool>();
+  }


AttentionSinkIOManager selection relies on module methods named use_attention_sink, attention_sink_size, and attention_sink_window_size, but the Llama export metadata loader currently only emits get_max_seq_len, get_max_context_len, etc. (no attention-sink keys). As a result, this branch will never trigger and AttentionSinkIOManager won’t be used. Suggest exporting these metadata methods when llm_config.model.use_attention_sink is set (or using an existing reliable signal) so runner-side selection works.

Copilot · 2026-02-06T19:36:13Z

examples/models/llama/source_transformation/attention_sink.py

    """
    Transform the model to be able to run inference with Attention Sink.
    There mainly three steps:
    - Replace Rope with RopeWithAttentionSink
-    - Replace Attention's KVCache with KVCacheWithAttentionSink, forward with attention_sink_forward
+    - Replace Attention's KVCache with KVCacheWithAttentionSink
+    - Replace Attention's forward with attention_sink_forward


The docstring for enable_attention_sink says it replaces Attention’s forward with attention_sink_forward, but _replace_attention explicitly does not replace forward anymore. Update the docstring (and/or remove attention_sink_forward if it’s intentionally unused) to match actual behavior.

Copilot · 2026-02-06T19:36:14Z

examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py

+            k_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
+            v_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
+            input_pos = torch.tensor([20 + i], dtype=torch.long)
+            k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)


Variable k_out is not used.

Suggested change

k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)

self.kv_cache.update(input_pos, k_decode, v_decode)

Copilot · 2026-02-06T19:36:14Z

examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py

+            k_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
+            v_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
+            input_pos = torch.tensor([20 + i], dtype=torch.long)
+            k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)


Variable v_out is not used.

Suggested change

k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)

_, _ = self.kv_cache.update(input_pos, k_decode, v_decode)

Copilot · 2026-02-06T19:36:14Z

examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py

+
+    def test_cache_positions_consistency(self):
+        """Test that cache positions remain consistent during generation."""
+        cache_size = 32


Variable cache_size is not used.

Copilot · 2026-02-06T19:36:14Z

examples/models/llama/source_transformation/attention_sink.py

+    CachePositionsManager,
+    KVCache,
+    RingKVCache,


Import of 'CachePositionsManager' is not used.
Import of 'RingKVCache' is not used.

Suggested change

CachePositionsManager,

KVCache,

RingKVCache,

KVCache,

kirklandsign · 2026-02-07T01:17:04Z

extension/llm/runner/io_manager/attention_sink_io_manager.cpp

+      logical_pos_,
+      is_cache_full() ? "true" : "false");
+
+  // Pass through to model as-is. The model's KVCacheWithAttentionSink


kirklandsign · 2026-02-07T01:32:17Z

examples/models/llama/source_transformation/sdpa.py

                k,
                v,
-                input_pos[0].item(),
+                0,  # start_pos: not used when mask is provided


Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-07T02:38:40Z

examples/models/llama/export_llama_lib.py

+        # IOManager expects these methods to exist returning int.
+        # By adding them to metadata, export_to_edge will generate constant methods.
+        metadata["get_sink_size"] = sink_params[0]
+        metadata["get_window_size"] = sink_params[1]


The metadata added for attention sink uses keys get_sink_size/get_window_size, but the runtime runner expects attention sink metadata under use_attention_sink/attention_sink_size/attention_sink_window_size (see extension/llm/runner/constants.h and llm_runner_helper.cpp). This mismatch prevents the runner from detecting attention sink models. Update the metadata keys (and add an explicit enable flag) to match what the runner reads, or adjust the runner to match these exported keys.

Suggested change

# IOManager expects these methods to exist returning int.

# By adding them to metadata, export_to_edge will generate constant methods.

metadata["get_sink_size"] = sink_params[0]

metadata["get_window_size"] = sink_params[1]

# Runtime runner expects these metadata keys:

# - "use_attention_sink": bool flag to enable attention sink

# - "attention_sink_size": sink size (int)

# - "attention_sink_window_size": window size (int)

metadata["use_attention_sink"] = True

metadata["attention_sink_size"] = sink_params[0]

metadata["attention_sink_window_size"] = sink_params[1]

Copilot · 2026-02-07T02:38:40Z

examples/models/llama/config/llama_runner_attention_sink.yaml

+  # Max Context (Buffer) = 4 + 1 * 124 = 128
+  use_attention_sink: "4,124,1"
+
+export:
+  # max_seq_length for single prefill chunk
+  max_context_len: 128


Export config uses max_context_len, but LlmConfig’s ExportConfig field is max_context_length (see extension/llm/export/config/llm_config.py). As written, this config likely won’t set the intended context length. Also the comment on line 25 (“Max Context (Buffer) = 4 + 1 * 124 = 128”) doesn’t match the earlier cache-size note (sink_size + 2*window_size = 252) and is confusing—please correct it to reflect the actual meaning (RoPE table vs KV cache size).

Suggested change

# Max Context (Buffer) = 4 + 1 * 124 = 128

use_attention_sink: "4,124,1"

export:

# max_seq_length for single prefill chunk

max_context_len: 128

# RoPE/logical max context per step = sink_size + 1 * window_size = 4 + 1 * 124 = 128 tokens

use_attention_sink: "4,124,1"

export:

# max_seq_length for single prefill chunk

max_context_length: 128

Copilot · 2026-02-07T02:38:41Z

examples/models/llama/source_transformation/attention_sink.py

+        if indices is None:
+            # Calculate write indices
+            indices = self.cache_positions_manager.calculate_positions_and_update_indices(
+                input_pos, seq_len
            )


KVCacheWithAttentionSink.update() only updates cache_positions when it computes indices internally. When indices are provided (e.g., runner-supplied cache_indices), cache_positions is never updated, so attention masks derived from cache_positions can be stale/incorrect (and may remain at the -1 sentinel). Update cache_positions_manager.cache_positions for the provided indices as well (e.g., index_copy with orig positions derived from input_pos and seq_len).

Copilot · 2026-02-07T02:38:41Z

examples/models/llama/source_transformation/attention_sink.py

+        max_context_len = sink_size + window_size * 2
+
+    # We update params.max_context_len to reflect the actual buffer size
+    # This ensures export captures the correct cache size in metadata
+    params.max_context_len = max_context_len


enable_attention_sink() overwrites params.max_context_len with the KV cache buffer size (sink_size + 2*window_size). Rope.get_freqs() enforces input_pos < params.max_context_len under dynamic shape, so this change will cap generation to the cache size and break the intended “logical position can exceed cache size” behavior (and also contradict the earlier logic in model.py that tries to enlarge max_context_len for RoPE). Consider keeping params.max_context_len as the RoPE table length and passing the cache size via a separate field/argument (or only adjusting metadata without shrinking RoPE capacity).

Suggested change

max_context_len = sink_size + window_size * 2

# We update params.max_context_len to reflect the actual buffer size

# This ensures export captures the correct cache size in metadata

params.max_context_len = max_context_len

# Default KV cache buffer size: sink tokens + sliding window on both sides

max_context_len = sink_size + window_size * 2

Copilot · 2026-02-07T02:38:41Z

extension/llm/runner/llm_runner_helper.cpp

+  if (method_names.count(kUseAttentionSink)) {
+    auto get_result = module->get(kUseAttentionSink);
+    use_attention_sink = get_result.get().toScalar().to<bool>();
+  }
+
+  if (use_attention_sink) {
+    // Get attention sink configuration from metadata
+    if (method_names.count(kAttentionSinkSize)) {
+      auto get_result = module->get(kAttentionSinkSize);
+      sink_size = get_result.get().toScalar().to<int64_t>();
+    }
+    if (method_names.count(kAttentionSinkWindowSize)) {
+      auto get_result = module->get(kAttentionSinkWindowSize);
+      window_size = get_result.get().toScalar().to<int64_t>();
+    }


IOManager selection is keyed off module methods named use_attention_sink / attention_sink_size / attention_sink_window_size, but the exporter code in this PR adds attention sink metadata as get_sink_size/get_window_size (and does not add use_attention_sink). As a result, use_attention_sink will remain false here and AttentionSinkIOManager will never be constructed. Align the exporter’s metadata keys with the runner constants (or update the runner to look for the exported keys), and include a clear boolean enable flag plus the numeric parameters.

Copilot · 2026-02-07T02:38:42Z

examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py

+
+    def test_cache_positions_consistency(self):
+        """Test that cache positions remain consistent during generation."""
+        cache_size = 32


Variable cache_size is not used.

Suggested change

cache_size = 32

kirklandsign and others added 12 commits February 5, 2026 01:41

attention sink

c954f27

Test this

34a937f

kv cache sdpa

83f437a

Test

97f9910

Remove trailing whitespace

075d521

Co-authored-by: Claude <noreply@anthropic.com>

Revert "Remove trailing whitespace"

fe66e74

This reverts commit 075d521.

Fix attention sink index calculation to preserve sink tokens

04fadb5

Update

97ba715

Remove obsolete test_attention_sink.py

e4ccab4

Border

bbca983

Test commands

fec5eb9

Runner side

f3a0487

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026

kirklandsign changed the title ~~Attention sink 2~~ [TEST] Attention sink 2 Feb 6, 2026

kirklandsign marked this pull request as ready for review February 6, 2026 19:29

kirklandsign requested review from larryliu0820, lucylq and mergennachin as code owners February 6, 2026 19:29

kirklandsign requested review from Copilot and removed request for larryliu0820, lucylq and mergennachin February 6, 2026 19:29

Copilot started reviewing on behalf of kirklandsign February 6, 2026 19:30 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

kirklandsign commented Feb 7, 2026

View reviewed changes

Update

7bf2250

lcresearch added 2 commits February 6, 2026 18:29

Update

f04f211

Update

dc2695a

Copilot AI review requested due to automatic review settings February 7, 2026 02:30

Copilot started reviewing on behalf of kirklandsign February 7, 2026 02:31 View session

update

5ae4922

Copilot AI reviewed Feb 7, 2026

View reviewed changes

Update

42e5e7a

-            quantize_kv_cache=llm_config.model.quantize_kv_cache,
+            # Quantized KV cache currently does not support attention-mask-based
+            # custom SDPA (QuantizedSDPA still relies on start_pos for masking).
+            # To avoid incompatible behavior, disable KV-cache quantization when
+            # attention sink is enabled.
+            quantize_kv_cache=(
+                False
+                if getattr(llm_config.model, "use_attention_sink", False)
+                else llm_config.model.quantize_kv_cache
+            ),

	Check point is in ~/executorch/
	Checkpoint is in ~/executorch/

		int max_new_tokens =
		config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);

-  int max_new_tokens =
-      config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);
+  // For non-ring-buffer models starting at a non-zero position, only the
+  // remaining context window (max_context_len - pos_) is available.
+  int64_t remaining_context_len = max_context_len - pos_;
+  if (remaining_context_len < 0) {
+    remaining_context_len = 0;
+  }
+  int max_new_tokens =
+      config.resolve_max_new_tokens(remaining_context_len, num_prompt_tokens);

	k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
	self.kv_cache.update(input_pos, k_decode, v_decode)

	k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
	_, _ = self.kv_cache.update(input_pos, k_decode, v_decode)

-        # IOManager expects these methods to exist returning int.
-        # By adding them to metadata, export_to_edge will generate constant methods.
-        metadata["get_sink_size"] = sink_params[0]
-        metadata["get_window_size"] = sink_params[1]
+        # Runtime runner expects these metadata keys:
+        #   - "use_attention_sink": bool flag to enable attention sink
+        #   - "attention_sink_size": sink size (int)
+        #   - "attention_sink_window_size": window size (int)
+        metadata["use_attention_sink"] = True
+        metadata["attention_sink_size"] = sink_params[0]
+        metadata["attention_sink_window_size"] = sink_params[1]

Conversation

kirklandsign commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17271

❌ 40 New Failures, 1 Pending, 3 Unrelated Failures

Uh oh!

github-actions bot commented Feb 6, 2026

This PR needs a release notes: label

Uh oh!

kirklandsign commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kirklandsign Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

kirklandsign Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

kirklandsign commented Feb 6, 2026 •

edited

Loading

pytorch-bot bot commented Feb 6, 2026 •

edited

Loading

This PR needs a `release notes:` label