Added Causal Mask Pattern Fusion for LongRoPe Models and Cache Insertion for Phi4-mini-reasoning by tadani3 · Pull Request #2461 · microsoft/onnxscript

tadani3 · 2025-07-24T02:40:20Z

Modification of the GQA causal mask fusion rule to handle the attention mask fusion for Longrope models such as Phi-4-mini-reasoning. The causal mask modification leads to a result that matches the optimizations made in ModelBuilder.

onnxscript/rewriter/ort_fusions/gqa.py

+        mask_key = self._get_mask_key(attention_mask)
+
+        if mask_key in self._mask_cache:
+            total_seq_length_int32, seqlens_k_int32 = self._mask_cache[mask_key]


onnxscript/rewriter/ort_fusions/gqa.py

+        mask_key = self._get_mask_key(attention_mask)
+
+        if mask_key in self._mask_cache:
+            total_seq_length_int32, seqlens_k_int32 = self._mask_cache[mask_key]


onnxscript/rewriter/ort_fusions/gqa.py

codecov · 2025-07-24T02:43:36Z

Codecov Report

❌ Patch coverage is 32.53235% with 365 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.03%. Comparing base (da23d76) to head (2772f77).

Files with missing lines	Patch %	Lines
...ipt/rewriter/phi4_mini_reasoning_post_processor.py	18.24%	345 Missing ⚠️
onnxscript/rewriter/ort_fusions/gqa.py	83.19%	20 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2461      +/-   ##
==========================================
- Coverage   69.81%   69.03%   -0.78%     
==========================================
  Files         209      210       +1     
  Lines       25313    25790     +477     
  Branches     2525     2603      +78     
==========================================
+ Hits        17673    17805     +132     
- Misses       6762     7110     +348     
+ Partials      878      875       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-advanced-security

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Copilot

Pull Request Overview

This PR adds causal mask pattern fusion support specifically for LongRoPe models such as Phi-4-mini-reasoning. The implementation extends the existing GQA (Group Query Attention) fusion rules to handle the complex attention mask patterns used by LongRoPe models, optimizing the mask computation process while maintaining compatibility with ModelBuilder optimizations.

Key changes:

Addition of a new LongRoPeGQACausalMask class that implements specialized mask pattern matching and fusion
Extension of the GQA rewrite rules to include LongRoPe-specific optimizations
Implementation of mask caching mechanism to avoid recomputation

Copilot · 2025-07-24T02:52:39Z

onnxscript/rewriter/ort_fusions/gqa.py

 _basic_gqa_rule = GroupQueryAttention.rule()
+_longrope_gqa_causal_mask_rule = LongRoPeGQACausalMask.rule()

 gqa_rules = pattern.RewriteRuleSet([_basic_gqa_rule])


The gqa_rules variable is being reassigned, which overwrites the previous assignment on line 514. This means the first assignment gqa_rules = pattern.RewriteRuleSet([_basic_gqa_rule]) is completely ignored.

Suggested change

gqa_rules = pattern.RewriteRuleSet([_basic_gqa_rule])

Copilot · 2025-07-24T02:52:39Z

onnxscript/rewriter/ort_fusions/gqa.py

+        # Propagation to GQA
+        mask_sliced = op.Slice(mask_A_B_C_scaled, [0], pattern.ANY_VALUE, [3], [1], _outputs=["mask_sliced"])
+
+        #mask_where = op.Where(mask_sliced, pattern.ANY_VALUE, pattern.ANY_VALUE, _outputs=["mask_where"])


This commented-out code should be removed if it's not needed, or properly implemented if it serves a purpose. Dead code reduces maintainability.

Suggested change

#mask_where = op.Where(mask_sliced, pattern.ANY_VALUE, pattern.ANY_VALUE, _outputs=["mask_where"])

Copilot · 2025-07-24T02:52:40Z

onnxscript/rewriter/ort_fusions/gqa.py

+        mask_expanded_C = op.Expand(reshaped_range_C, mask_shape_C_abs, _outputs=["mask_expanded_C"])
+
+        # EXPAND A/B TO AND
+        mask_expanded_A_sub = op.Sub(mask_expanded_A, 262144, _outputs=["mask_expanded_A_sub"])


The magic number 262144 should be defined as a named constant to improve code readability and maintainability. Consider defining it as a class constant with a descriptive name.

Suggested change

mask_expanded_A_sub = op.Sub(mask_expanded_A, 262144, _outputs=["mask_expanded_A_sub"])

mask_expanded_A_sub = op.Sub(mask_expanded_A, MASK_OFFSET, _outputs=["mask_expanded_A_sub"])

Better to make it a pattern-variable, I think ... if I understand right, this is actually a magic sequence-length constant? Perhaps model-specific?

On second thoughts, I am guessing this is the window_size, which should become an attribute-parameter to the GQA op.

onnxscript/rewriter/ort_fusions/gqa.py

Copilot · 2025-07-24T02:52:40Z

onnxscript/rewriter/ort_fusions/gqa.py

+        Generate a unique key for the mask based on input_ids and past_kv_cache.
+        This is used to cache the mask to avoid recomputation.
+        """
+        return (id(attention_mask))


Using id() for cache keys is fragile because object ids can be reused after garbage collection. This could lead to incorrect cache hits with different attention_mask objects that happen to have the same id.

Suggested change

Generate a unique key for the mask based on input_ids and past_kv_cache.

This is used to cache the mask to avoid recomputation.

"""

return (id(attention_mask))

Generate a unique key for the mask based on the content of attention_mask.

This is used to cache the mask to avoid recomputation.

"""

if isinstance(attention_mask, np.ndarray):

return hash(attention_mask.tobytes())

elif isinstance(attention_mask, (list, tuple)):

return hash(tuple(attention_mask))

else:

raise TypeError("Unsupported type for attention_mask: {}".format(type(attention_mask)))

If a cache is used, it should be cleaned up like in this example so that it is not carried over from one graph/model to another

And I am not sure if we need to handle np arrays? If the key is either one or two ir.Values, that should be fine ... ir.Values can be used as keys in dictionaries directly, and that should avoid the garbage-collection problem.

I agree _get_mask_key seems unecessary. We can use the Value objects directly as keys.

justinchuby

Congrats on your first PR to onnxscript! Could you resolve the lint errors by running lintrunner -a and fix any issues?

justinchuby · 2025-07-24T02:52:46Z

onnxscript/rewriter/ort_fusions/gqa.py

+        """
+        return (id(attention_mask))
+
+    def compute_mask(self, op, attention_mask : _onnx_types.INT64['batch', 'seq_len']):


The rewriter doesn't use onnxscript type (yet). Could you instead use a comment to document the shape of the attention_mask?

justinchuby · 2025-07-24T02:54:33Z

onnxscript/rewriter/ort_fusions/gqa.py

            _outputs=3,
        )

+class LongRoPeGQACausalMask(pattern.RewriteRuleClassBase):


Could you use the docstring to document the pattern and its replacement? For the branches A, B, and C, I would consider giving them descriptive names.

The following is my understanding: if this is correct, maybe they can be renamed appropriately: I believe that A constructs the kv_range, B constructs the query_range, and C constructs the batch_range. Each constructs the corresponding range as a 4D tensor with 1s in other position (for constructing a final attention-mask of shape [Batch, NumHeads, QueryRange, KVRange] via broadcast).

I am a bit puzzled that query_range and kv_range look to be the same here, it might be an artifact of this model-usage, I guess.

I wasn't sure what the branches referred to but I'll make changes following what Rama is suggesting.

justinchuby · 2025-07-24T02:55:14Z

onnxscript/rewriter/ort_fusions/gqa.py

+        total_seq_length,
+    ):
+        seq_len = op.Shape(input_ids, end=2, start=1, _outputs=["seq_len"])
+        seq_len_0D = op.Squeeze(seq_len, _outputs=["seq_len_0D"])


Suggested change

seq_len_0D = op.Squeeze(seq_len, _outputs=["seq_len_0D"])

seq_len_0d = op.Squeeze(seq_len, _outputs=["seq_len_0d"])

prefer snake case for variable names when possible

justinchuby · 2025-07-24T02:55:48Z

onnxscript/rewriter/ort_fusions/gqa.py

+        mask_A_B_combined = op.And(mask_A_B_greater_bitwise, mask_A_B_less, _outputs=["mask_A_B_combined"])
+        mask_A_B_combined_bitwise = op.And(True, mask_A_B_combined, _outputs=["mask_A_B_combined_bitwise"])
+
+        # EXPAND B/C TO AND


I would document the branches in plain English for readers

justinchuby · 2025-07-24T16:19:57Z

onnxscript/rewriter/ort_fusions/gqa.py

+class LongRoPeGQACausalMask(pattern.RewriteRuleClassBase):
+    def __init__(self):
+        super().__init__("LongRoPeGQACausalMask", remove_nodes=False)
+        self._mask_cache = {}


The copilot review is reasonable: the rewrite rule class should be stateless. Is there a different way to do this other than keeping a self._mask_cache?

I think use of state for this purpose is okay? It has been used before for a similar purpose: which is to introduce values that are reused across multiple rewrites. (Now that we have CSE, there is an alternative path, which is to create duplicate copies and then eliminate them via CSE ... but I am not sure it is worth the bother.)

BTW: my GQA fusion doesn't use state, and produces multiple copies (as described above).

My concern is that the states will transfer from model to another if not careful, which is probably not a good idea. Maybe we can have a class managed state dict that will be cleared by the class?

gramalingam · 2025-07-24T20:07:25Z

Hi @tadani3 , sorry about the concurrent changes I had merged into GQA fusion recently, which might impact some of your changes ... but I am a bit confused by the diffs shown, which don't seem to reflect the changes I had made, so I am a bit confused. Briefly, the earlier version did the fusion into two steps, the first rule ignore the attention-mask, and focused on the rest of the computation, and the second rule, explicitly handles the attention-mask. The more recent version merged the two into one, for various reasons.

I think it shouldn't impact your changes much, except that you will have to make the changes in rule 1 instead of rule 2. But, as I said, I am bit confused why I am not seeing those in the diffs

justinchuby · 2025-07-24T21:55:38Z

onnxscript/rewriter/ort_fusions/gqa.py

+        super().__init__("LongRoPeGQACausalMask", remove_nodes=False)
+        self._mask_cache = {}
+
+    def _get_mask_key(self, attention_mask):


In general, avoid creating class methods that do not require states from self, and instead make them module-level private functions for testability and clarity.

justinchuby · 2025-07-24T21:58:06Z

onnxscript/rewriter/ort_fusions/gqa.py

 import numpy as np
 import onnx_ir as ir

+import onnxscript.onnx_types as _onnx_types


Suggested change

import onnxscript.onnx_types as _onnx_types

_onnx_types is incompatible with the rewriter (yet)

tadani3 · 2025-07-24T21:59:42Z

@microsoft-github-policy-service agree company="Microsoft"

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+# Licensed under the MIT License.  See License.txt in the project root for
+# license information.
+# --------------------------------------------------------------------------
+import onnx


onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+# --------------------------------------------------------------------------
+import onnx
+from onnxscript import ir
+import onnx.helper


onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+        cache_length = self.rotemb_attrs["cache_length"]
+        position_ids = torch.arange(cache_length, dtype=torch.int64).unsqueeze(0)  # Shape: (1, cache_length)
+
+        inv_freq_expanded = inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)  # (1, dim//2, 1)


onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)  # (1, cache_length, dim//2)
+            emb = torch.cat((freqs, freqs), dim=-1)  # (1, cache_length, dim)
+            cos_cache = emb.cos() * attention_factor  # (1, cache_length, dim)


onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+                attention_factor = self.rotemb_attrs["multi_cache"]["short_mscale"]
+
+        inv_freq_shape = torch.arange(0, dim, 2, dtype=torch.int64, device="cpu").float() / dim
+        inv_freq = 1.0 / (ext_factors * base**inv_freq_shape)


onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

+        if "rescale_inv_freq" in self.rotemb_attrs:
+            inv_freq = self.make_inv_freq_rescaled(inv_freq)
+
+        return inv_freq, attention_factor


tadani3 · 2025-07-31T23:50:07Z

I added a class called Phi4MiniReasoningPostProcessor which uses the ONNX IR to fulfill two tasks:

Inserts the cached Cos/Sin values used in the Rotary Embeddings

The Cos/Sin caches are computed offline following the logic found in Transformer's modeling_phi3.py file.
The existing pattern that computes Cos and Sin is found in the ONNX model graph and removed.
An If node containing both sets of caches is inserted to the graph. The selection condition for the two sets of caches is the total_sequence_length, which we obtain from the attention_mask.
GQA nodes within the graph are updated to have their cos/sin inputs come from the newly inserted caches.

Removes Position Ids and remaining child nodes from the model graph.

Matches the position ids and following nodes that were used in the old Cos/Sin value computation.
Uses the ONNX IR to remove these nodes from the graph.

Reasoning and Motivation

These changes allow the version of Phi4-mini-reasoning produced using the DynamoExporter to match the version found in Model Builder

…nge computation

…xscript into longrope_causal_mask

onnxscript/rewriter/ort_fusions/gqa.py

justinchuby · 2025-08-01T23:00:39Z

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

If you can move this file to a separate PR we can merge the fusion rules. Thanks

Added Causal Mask Pattern Fusion for LongRoPe Models

7bd391d

github-project-automation bot moved this to Todo in ONNX Script Review Board Jul 24, 2025

github-project-automation bot added this to ONNX Script Review Board Jul 24, 2025

github-advanced-security bot found potential problems Jul 24, 2025

View reviewed changes

justinchuby requested review from Copilot and gramalingam July 24, 2025 02:51

Copilot AI reviewed Jul 24, 2025

View reviewed changes

justinchuby reviewed Jul 24, 2025

View reviewed changes

Added Phi4-mini-reasoning cache insertion and position Id deletion logic

f0f41a8

github-advanced-security bot found potential problems Jul 31, 2025

View reviewed changes

tadani3 changed the title ~~Added Causal Mask Pattern Fusion for LongRoPe Models~~ Added Causal Mask Pattern Fusion for LongRoPe Models and Cache Insertion for Phi4-mini-reasoning Jul 31, 2025

Merge branch 'main' into longrope_causal_mask

189d0c8

tadani3 added 5 commits August 1, 2025 16:00

Removed whitespace from gqa longrope fusion

758e92d

Added docstrings to GQA pattern method

d4a8c57

Renamed pattern branches to match kv_range, query_range, and batch_ra…

30faab7

…nge computation

Merge branch 'longrope_causal_mask' of https://github.com/tadani3/onn…

01e37b3

…xscript into longrope_causal_mask

Removed unecessary pattern variable

912a80b

github-advanced-security bot found potential problems Aug 1, 2025

View reviewed changes

onnxscript/rewriter/ort_fusions/gqa.py Fixed Show fixed Hide fixed

tadani3 added 4 commits August 1, 2025 16:37

Added snake casing for variable names

fd95719

Added more snake casing and removed uneeded code

19d2656

Moved get_mask_key method to module level and used IR value directly

0742db2

Added cleanup method for the attention mask cache

2772f77

tadani3 mentioned this pull request Aug 1, 2025

Added LongRoPe Model Causal Mask Pattern Fusion #2473

Closed

justinchuby reviewed Aug 1, 2025

View reviewed changes

justinchuby marked this pull request as draft January 8, 2026 17:24

	#mask_where = op.Where(mask_sliced, pattern.ANY_VALUE, pattern.ANY_VALUE, _outputs=["mask_where"])

	mask_expanded_A_sub = op.Sub(mask_expanded_A, 262144, _outputs=["mask_expanded_A_sub"])
	mask_expanded_A_sub = op.Sub(mask_expanded_A, MASK_OFFSET, _outputs=["mask_expanded_A_sub"])

-        Generate a unique key for the mask based on input_ids and past_kv_cache.
-        This is used to cache the mask to avoid recomputation.
-        """
-        return (id(attention_mask))
+        Generate a unique key for the mask based on the content of attention_mask.
+        This is used to cache the mask to avoid recomputation.
+        """
+        if isinstance(attention_mask, np.ndarray):
+            return hash(attention_mask.tobytes())
+        elif isinstance(attention_mask, (list, tuple)):
+            return hash(tuple(attention_mask))
+        else:
+            raise TypeError("Unsupported type for attention_mask: {}".format(type(attention_mask)))

	seq_len_0D = op.Squeeze(seq_len, _outputs=["seq_len_0D"])
	seq_len_0d = op.Squeeze(seq_len, _outputs=["seq_len_0d"])

Conversation

tadani3 commented Jul 24, 2025

Uh oh!

Check notice

Check notice

Uh oh!

codecov bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinchuby left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinchuby Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gramalingam commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tadani3 commented Jul 24, 2025

Uh oh!

Check notice

Check notice

Check failure

Check failure

Check failure

Check failure

codecov bot commented Jul 24, 2025 •

edited

Loading

justinchuby left a comment •

edited

Loading

justinchuby Jul 24, 2025 •

edited

Loading