Skip to content

[TEST] Attention sink 2#17271

Open
kirklandsign wants to merge 17 commits intomainfrom
attention-sink-2
Open

[TEST] Attention sink 2#17271
kirklandsign wants to merge 17 commits intomainfrom
attention-sink-2

Conversation

@kirklandsign
Copy link
Contributor

@kirklandsign kirklandsign commented Feb 6, 2026

Example text:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Tell me a long story<|eot_id|><|start_header_id|>assistant<|end_header_id|>

In the quaint town of Willowdale, nestled between rolling hills and verdant forests, there existed a mystical world hidden within a great oak tree. This tree, named Oracle's Noche, was not only ancient but possessed an eerie and extraordinary ability. Whenever anyone laid eyes on the giant leaves, swaying with the autumn breeze, Oracle's Noche began to whisper to them, in hushed whispers that would curiously delight the townspeople and unravel long-buried mysteries.

Her home, The Eldredite's Orage, stood majestically above the rooftops of Willowdale, ready to let anyone step through a dividing ribbon into the enchanting realm within. The Orage was a grand, three-story mansion with a garden that seemed to be alive. Every morning, the Orage would open its doors to the world, inviting everyone to enter and explore its beauty.

To fulfill the newly-completed restoration project of the Orage, Ms. Ethel Allix, an ardent fan of historical romance novels, embarked on an extraordinary journey to revamp every aspect of the Orage. She envisioned a grand, three-story mansion that would transport visitors to a bygone era, filled with elegance, charm, and a hint of mystery.

Ms. Allix spent months researching, planning, and gathering materials for the renovation project. She meticulously crafted each and every detail of the Orage, making sure every room was unique and memorable. From the antique furniture to the lavish decorations, every aspect of the Orage was meticulously designed to evoke a sense of nostalgia and wonder.

The renovation project was a labor of love, and Ms. Allix poured her heart and soul into every detail. She worked tirelessly, often for 12 hours a day, to ensure that every aspect of the Orage was perfect. Her dedication and passion for the project were evident in every aspect of the renovation.

When the renovation project was complete, the Orage was transformed into a truly unique and extraordinary space. It was as if time had stopped, and all that remained was the pure and simple magic of the past. Every aspect of the Orage, from its grand architecture to its exquisite furnishings, had been meticulously designed to transport visitors back in time.

As a result, the Orage became a renowned attraction and a major destination for those who were eager to experience the magic of the past. And as for the person who had started this incredible journey, they were hailed as a visionary and a master of the unknown.

The Orage was a testament to the power of imagination, and it stood as a shining example of what could be achieved when one dared to dream big.<|eot_id|><|start_header_id|>assistant<|end_header_id|>. That's a fascinating story. Here's a breakdown of the plot and themes:

**Plot:**

The story begins with a mysterious individual who creates a magical place called "The Orage". The individual is driven by a sense of curiosity and a desire to unlock the secrets of the past. As the narrative unfolds, it becomes clear that this person is on a mission to rediscover the lost knowledge of the ancient world.

**Themes:**

1. **The Power of Imagination**: The story showcases the incredible potential of the human mind. The protagonist's creation of "The Orage" is a testament to the power of imagination and the human desire to explore the unknown.
2. **Curiosity and Exploration**: The narrative highlights the importance of curiosity and exploration in our lives. The protagonist's journey to "The Orage" is a metaphor for the human desire to learn, discover, and explore the world around us.
3. **The Importance of Knowledge**: The story emphasizes the significance of knowledge and the importance of preserving it. The protagonist's creation of "The Orage" is a symbol of the importance of knowledge and the need to preserve it for future generations.
4. **The Power of Imagination**: The narrative showcases the power of imagination and the ability of the human mind to create and explore new ideas. The protagonist's journey to "The Orage" is a testament to the power of imagination and the human desire to learn, discover, and explore the world around us.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It seems you've provided a lengthy passage of text, but I'm not sure what you'd like me to do with it. I can help with the text in a few ways.

If you're looking to organize or edit the text, I can assist with that as well. Just let me know how I can help.

Alternatively, I can summarize the text for you. If you'd like, I can try to break down the text into smaller chunks, summarize the main points, or even provide a list of key points.

Lastly, I can also try to provide some suggestions for you based on what's been mentioned so far.

Do you have a specific goal in mind for using the text? Are you interested in summarizing, editing, or analyzing it in any way? I'd be happy to help you out.<|eot_id|>assistant<|end_header_id|>

I'd be happy to help you with your text.

To get started, could you please share the text with me? You can copy and paste it, or type it in manually. The more I know about the text, the better I'll be able to assist you.

Also, what would you like to do with the text? Do you have a specific goal in mind, such as:

* Summarizing the text
* Editing the text (e.g., correcting errors, reorganizing content)
* Analyzing the text (e.g., identifying patterns, themes, etc.)
* Something else?

Let me know, and I'll do my best to help!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd love to help you with your text. Please go ahead and share it with me. I can then:

1. Summarize the text for you
2. Edit the text to correct any errors or reorganize the content for better clarity
3. Analyze the text to identify any patterns, themes, or insights that you might not have noticed

You can share your text with me in various ways, such as:

* Copying and pasting the text into this chat window
* Sharing a link to a document or document file that contains your text
* Providing a link to a text you'd like me to review

I'm here to help, and I'll do my best to assist you with your text!

Is there a specific type of text you'd like me to review, or would you like me to provide some examples of the types of texts I can assist with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd be happy to help you with your text.

You can share your text with me in various ways, such as:

* Copying and pasting the text into this chat window
* Sharing a link to a document or document file that contains your text
* Providing a link to a text you'd like me to review

You can also share a text with me if it's:

* A short paragraph or sentence
* A longer piece of writing, such as a short story or a blog post
* A piece of writing that you're working on, and you'd like to get feedback on

Alternatively, you can simply type out your text and I'll do my best to help you with any questions or concerns you may have.

Let me know what you'd like to do, and I'll do my best to assist you!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17271

Note: Links to docs will display an error until the docs builds have been completed.

❌ 40 New Failures, 1 Pending, 3 Unrelated Failures

As of commit 42e5e7a with merge base 1eb3f9d (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@kirklandsign kirklandsign changed the title Attention sink 2 [TEST] Attention sink 2 Feb 6, 2026
@kirklandsign
Copy link
Contributor Author

@larryliu0820 I think it's not totally adhd after 1000+ tokens

@kirklandsign kirklandsign marked this pull request as ready for review February 6, 2026 19:29
@kirklandsign kirklandsign requested review from Copilot and removed request for larryliu0820, lucylq and mergennachin February 6, 2026 19:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a ring-buffer–based Attention Sink implementation for Llama exports and attempts to wire up runner-side support so generation can continue beyond the KV-cache capacity while maintaining correct masking/positioning.

Changes:

  • Implement ring-buffer Attention Sink KV-cache + masking in Llama source transformation and update attention execution to use dynamic masks for ring-buffer caches.
  • Add a new runner IOManager (AttentionSinkIOManager) and plumb it into runner build targets, plus update prompt-length handling in the text runner.
  • Add/export updates and new configs/tests for Attention Sink (including mutable buffer initialization for cache_positions).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test_attention_sink.md Adds local build/export/run notes for attention sink testing.
extension/llm/runner/text_llm_runner.cpp Changes prompt validation to use max_seq_len and adjusts max-new-token resolution behavior.
extension/llm/runner/targets.bzl Links runner to the new attention-sink IOManager target.
extension/llm/runner/llm_runner_helper.cpp Selects AttentionSinkIOManager based on module metadata keys.
extension/llm/runner/io_manager/targets.bzl Adds Bazel target for attention_sink_io_manager.
extension/llm/runner/io_manager/attention_sink_io_manager.h Introduces IOManager subclass for attention-sink / “infinite context” bookkeeping.
extension/llm/runner/io_manager/attention_sink_io_manager.cpp Implements pass-through prefill/decode input preparation while tracking logical position.
extension/llm/runner/constants.h Adds metadata keys for attention sink configuration.
examples/models/llama/source_transformation/test_attention_sink_ring_buffer.py Adds comprehensive ring-buffer attention sink unit tests.
examples/models/llama/source_transformation/test_attention_sink.py Removes the previous attention sink test suite.
examples/models/llama/source_transformation/sdpa.py Forces custom_sdpa to use start_pos=0 when an attention mask is provided.
examples/models/llama/source_transformation/custom_kv_cache.py Prevents custom KV-cache replacement from clobbering KVCacheWithAttentionSink.
examples/models/llama/source_transformation/attention_sink.py Reworks attention sink to a torch.export-compatible ring-buffer approach (masking, cache position tracking, cache update).
examples/models/llama/model.py Loosens constraints and increases RoPE table length for attention sink generation.
examples/models/llama/export_llama_lib.py Enables attention masks for custom SDPA when attention sink is enabled; initializes cache_positions mutable buffer.
examples/models/llama/eval_llama_lib.py Updates attention-sink eval assumptions for ring-buffer/cache-size vs RoPE-length.
examples/models/llama/config/*.yaml Adds new attention sink configs (runner-side + xnnpack/noxnn variants).
examples/models/llama/attention.py Uses dynamic ring-buffer masking when KV cache exposes is_ring_buffer.
examples/models/llama/BUCK Registers the new attention sink ring-buffer test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
or bool(llm_config.model.use_attention_sink),
use_sdpa_with_kv_cache=llm_config.model.use_sdpa_with_kv_cache,
quantize_kv_cache=llm_config.model.quantize_kv_cache,
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block enables attention-mask mode for custom SDPA when attention sink is on, but attention-sink configs also enable KV-cache quantization in some cases. In the quantized path, SDPACustom is replaced by QuantizedSDPA, which (per sdpa.py) still doesn’t support attention masks and continues to use start_pos for causal masking. That means attention sink + quantized KV cache may still hit the same start_pos >= cache_size validation failure. Consider propagating the use_attention_mask flag into QuantizedSDPA and making its mask path ignore start_pos similarly (or preventing this combination).

Suggested change
quantize_kv_cache=llm_config.model.quantize_kv_cache,
# Quantized KV cache currently does not support attention-mask-based
# custom SDPA (QuantizedSDPA still relies on start_pos for masking).
# To avoid incompatible behavior, disable KV-cache quantization when
# attention sink is enabled.
quantize_kv_cache=(
False
if getattr(llm_config.model, "use_attention_sink", False)
else llm_config.model.quantize_kv_cache
),

Copilot uses AI. Check for mistakes.
# Export model
Take a look at examples/models/llama/README.md

Check point is in ~/executorch/
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: “Check point” should be “Checkpoint”.

Suggested change
Check point is in ~/executorch/
Checkpoint is in ~/executorch/

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +147
// Get max_seq_len for single prefill chunk limit
int64_t max_seq_len = metadata_.at(kMaxSeqLen);
int64_t max_context_len = metadata_.at(kMaxContextLen);

ET_CHECK_OR_RETURN_ERROR(
num_prompt_tokens >= 1,
InvalidArgument,
"Expected at least 1 prompt token");

ET_CHECK_OR_RETURN_ERROR(
num_prompt_tokens < max_context_len,
num_prompt_tokens <= max_seq_len,
InvalidArgument,
"num_prompt_tokens %d >= max_context_len %" PRId64
", Max seq length exceeded - please increase max seq len value in your export script",
"num_prompt_tokens %d > max_seq_len %" PRId64
", Single prefill chunk too large",
num_prompt_tokens,
max_context_len);
max_seq_len);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new prompt-length validation only checks num_prompt_tokens <= max_seq_len and no longer accounts for pos_ / remaining context. If generate() is called multiple times without reset(), pos_ can be non-zero and text_prefiller_->prefill(prompt_tokens, pos_) may exceed the model’s usable context for non-ring-buffer models. Consider keeping the remaining-context check for non-ring-buffer models (e.g., num_prompt_tokens < (max_context_len - pos_)) while retaining the per-prefill-chunk max_seq_len check for ring/attention-sink exports.

Copilot uses AI. Check for mistakes.
Comment on lines 153 to 154
int max_new_tokens =
config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_max_new_tokens(max_context_len, num_prompt_tokens) currently ignores the existing pos_, so for non-ring-buffer models it can allow generating past the remaining context budget when pos_ > 0. Previously this was handled by reducing max_context_len by pos_. Consider passing max_context_len - pos_ for non-ring-buffer models (or extending resolve_max_new_tokens to take start_pos).

Suggested change
int max_new_tokens =
config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);
// For non-ring-buffer models starting at a non-zero position, only the
// remaining context window (max_context_len - pos_) is available.
int64_t remaining_context_len = max_context_len - pos_;
if (remaining_context_len < 0) {
remaining_context_len = 0;
}
int max_new_tokens =
config.resolve_max_new_tokens(remaining_context_len, num_prompt_tokens);

Copilot uses AI. Check for mistakes.
Comment on lines 240 to 256
// Get method names to check for attention sink metadata
auto method_names_result = module->method_names();
if (method_names_result.error() != Error::Ok) {
ET_LOG(Error, "Failed reading method names for IOManager selection");
return nullptr;
}
const auto& method_names = method_names_result.get();

// Check if attention sink is enabled via metadata
bool use_attention_sink = false;
int64_t sink_size = 4; // Default values
int64_t window_size = 124;

if (method_names.count(kUseAttentionSink)) {
auto get_result = module->get(kUseAttentionSink);
use_attention_sink = get_result.get().toScalar().to<bool>();
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttentionSinkIOManager selection relies on module methods named use_attention_sink, attention_sink_size, and attention_sink_window_size, but the Llama export metadata loader currently only emits get_max_seq_len, get_max_context_len, etc. (no attention-sink keys). As a result, this branch will never trigger and AttentionSinkIOManager won’t be used. Suggest exporting these metadata methods when llm_config.model.use_attention_sink is set (or using an existing reliable signal) so runner-side selection works.

Copilot uses AI. Check for mistakes.
Comment on lines +405 to +410
"""
Transform the model to be able to run inference with Attention Sink.
There mainly three steps:
- Replace Rope with RopeWithAttentionSink
- Replace Attention's KVCache with KVCacheWithAttentionSink, forward with attention_sink_forward
- Replace Attention's KVCache with KVCacheWithAttentionSink
- Replace Attention's forward with attention_sink_forward
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for enable_attention_sink says it replaces Attention’s forward with attention_sink_forward, but _replace_attention explicitly does not replace forward anymore. Update the docstring (and/or remove attention_sink_forward if it’s intentionally unused) to match actual behavior.

Copilot uses AI. Check for mistakes.
k_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
v_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
input_pos = torch.tensor([20 + i], dtype=torch.long)
k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable k_out is not used.

Suggested change
k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
self.kv_cache.update(input_pos, k_decode, v_decode)

Copilot uses AI. Check for mistakes.
k_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
v_decode = torch.randn(1, self.n_heads, 1, self.head_dim)
input_pos = torch.tensor([20 + i], dtype=torch.long)
k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable v_out is not used.

Suggested change
k_out, v_out = self.kv_cache.update(input_pos, k_decode, v_decode)
_, _ = self.kv_cache.update(input_pos, k_decode, v_decode)

Copilot uses AI. Check for mistakes.

def test_cache_positions_consistency(self):
"""Test that cache positions remain consistent during generation."""
cache_size = 32
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable cache_size is not used.

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +23
CachePositionsManager,
KVCache,
RingKVCache,
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'CachePositionsManager' is not used.
Import of 'RingKVCache' is not used.

Suggested change
CachePositionsManager,
KVCache,
RingKVCache,
KVCache,

Copilot uses AI. Check for mistakes.
logical_pos_,
is_cache_full() ? "true" : "false");

// Pass through to model as-is. The model's KVCacheWithAttentionSink
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No!!

k,
v,
input_pos[0].item(),
0, # start_pos: not used when mask is provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really?

Copilot AI review requested due to automatic review settings February 7, 2026 02:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1356 to +1359
# IOManager expects these methods to exist returning int.
# By adding them to metadata, export_to_edge will generate constant methods.
metadata["get_sink_size"] = sink_params[0]
metadata["get_window_size"] = sink_params[1]
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata added for attention sink uses keys get_sink_size/get_window_size, but the runtime runner expects attention sink metadata under use_attention_sink/attention_sink_size/attention_sink_window_size (see extension/llm/runner/constants.h and llm_runner_helper.cpp). This mismatch prevents the runner from detecting attention sink models. Update the metadata keys (and add an explicit enable flag) to match what the runner reads, or adjust the runner to match these exported keys.

Suggested change
# IOManager expects these methods to exist returning int.
# By adding them to metadata, export_to_edge will generate constant methods.
metadata["get_sink_size"] = sink_params[0]
metadata["get_window_size"] = sink_params[1]
# Runtime runner expects these metadata keys:
# - "use_attention_sink": bool flag to enable attention sink
# - "attention_sink_size": sink size (int)
# - "attention_sink_window_size": window size (int)
metadata["use_attention_sink"] = True
metadata["attention_sink_size"] = sink_params[0]
metadata["attention_sink_window_size"] = sink_params[1]

Copilot uses AI. Check for mistakes.
Comment on lines 25 to 30
# Max Context (Buffer) = 4 + 1 * 124 = 128
use_attention_sink: "4,124,1"

export:
# max_seq_length for single prefill chunk
max_context_len: 128
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Export config uses max_context_len, but LlmConfig’s ExportConfig field is max_context_length (see extension/llm/export/config/llm_config.py). As written, this config likely won’t set the intended context length. Also the comment on line 25 (“Max Context (Buffer) = 4 + 1 * 124 = 128”) doesn’t match the earlier cache-size note (sink_size + 2*window_size = 252) and is confusing—please correct it to reflect the actual meaning (RoPE table vs KV cache size).

Suggested change
# Max Context (Buffer) = 4 + 1 * 124 = 128
use_attention_sink: "4,124,1"
export:
# max_seq_length for single prefill chunk
max_context_len: 128
# RoPE/logical max context per step = sink_size + 1 * window_size = 4 + 1 * 124 = 128 tokens
use_attention_sink: "4,124,1"
export:
# max_seq_length for single prefill chunk
max_context_length: 128

Copilot uses AI. Check for mistakes.
Comment on lines +283 to 287
if indices is None:
# Calculate write indices
indices = self.cache_positions_manager.calculate_positions_and_update_indices(
input_pos, seq_len
)
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KVCacheWithAttentionSink.update() only updates cache_positions when it computes indices internally. When indices are provided (e.g., runner-supplied cache_indices), cache_positions is never updated, so attention masks derived from cache_positions can be stale/incorrect (and may remain at the -1 sentinel). Update cache_positions_manager.cache_positions for the provided indices as well (e.g., index_copy with orig positions derived from input_pos and seq_len).

Copilot uses AI. Check for mistakes.
Comment on lines +431 to +435
max_context_len = sink_size + window_size * 2

# We update params.max_context_len to reflect the actual buffer size
# This ensures export captures the correct cache size in metadata
params.max_context_len = max_context_len
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_attention_sink() overwrites params.max_context_len with the KV cache buffer size (sink_size + 2*window_size). Rope.get_freqs() enforces input_pos < params.max_context_len under dynamic shape, so this change will cap generation to the cache size and break the intended “logical position can exceed cache size” behavior (and also contradict the earlier logic in model.py that tries to enlarge max_context_len for RoPE). Consider keeping params.max_context_len as the RoPE table length and passing the cache size via a separate field/argument (or only adjusting metadata without shrinking RoPE capacity).

Suggested change
max_context_len = sink_size + window_size * 2
# We update params.max_context_len to reflect the actual buffer size
# This ensures export captures the correct cache size in metadata
params.max_context_len = max_context_len
# Default KV cache buffer size: sink tokens + sliding window on both sides
max_context_len = sink_size + window_size * 2

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +267
if (method_names.count(kUseAttentionSink)) {
auto get_result = module->get(kUseAttentionSink);
use_attention_sink = get_result.get().toScalar().to<bool>();
}

if (use_attention_sink) {
// Get attention sink configuration from metadata
if (method_names.count(kAttentionSinkSize)) {
auto get_result = module->get(kAttentionSinkSize);
sink_size = get_result.get().toScalar().to<int64_t>();
}
if (method_names.count(kAttentionSinkWindowSize)) {
auto get_result = module->get(kAttentionSinkWindowSize);
window_size = get_result.get().toScalar().to<int64_t>();
}
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOManager selection is keyed off module methods named use_attention_sink / attention_sink_size / attention_sink_window_size, but the exporter code in this PR adds attention sink metadata as get_sink_size/get_window_size (and does not add use_attention_sink). As a result, use_attention_sink will remain false here and AttentionSinkIOManager will never be constructed. Align the exporter’s metadata keys with the runner constants (or update the runner to look for the exported keys), and include a clear boolean enable flag plus the numeric parameters.

Copilot uses AI. Check for mistakes.

def test_cache_positions_consistency(self):
"""Test that cache positions remain consistent during generation."""
cache_size = 32
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable cache_size is not used.

Suggested change
cache_size = 32

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants