Skip to content

Conversation

@damien-lejeune
Copy link
Contributor

@damien-lejeune damien-lejeune commented Jan 19, 2026

Proposed changes

Some performance improvement have been implemented:

  • Swap of the reduce loop, from {num_op {n_iter}} to {n_iter { num_op } }, now reading the data only once
  • Improvement in register pressure
  • Improvement in vectorization

Features change:

  • As we read the data once we need all reduce operators to have the same identity elements. Constraints have been added to isSupportedArgument and unit test have been updated to reflect the new behavior. See the discussion for an alternative design.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

  • Performance-wise: I also tried to experiments with adding a data prefetching step did not lead to much improvements as it increased the register pressure and negated the positive effect.
  • Design-wise: reading the date once implies that we need the reduce operations to have the same the same identity element, as we specify the padding value to be this element at reading time. This PR just added the check in the isSupportedArgument method allowing multiple different operations (like addition or bitwise OR). Another way to enforce this would be to remove the possibility to have multiple reduce operations, but keeping multiple different elementwise and accumulator operations, as in old CK.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request optimizes the multi-reduction operation by reorganizing the reduction loop structure and improving memory access patterns. The key insight is swapping the loop order to read data once while performing multiple reduction operations, rather than reading data multiple times.

Changes:

  • Swapped reduction loop from {n_iter { num_op }} to {num_op {n_iter}} to minimize data reads
  • Improved vectorization logic to handle both contiguous and non-contiguous reduction dimensions
  • Added validation to ensure all reduction operations share the same identity value (required for the new implementation)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
include/ck_tile/ops/reduce/kernel/multi_reduce2d_kernel.hpp Restructured reduction loop to process all operations per iteration, added identity value validation, and improved vectorization handling
test/ck_tile/reduce/test_multi_reduce2d_threadwise.cpp Updated test configuration from Add-Max-Add to Add-SumSquare to align with new identity value requirement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aosewski aosewski changed the title Dlejeune/ck tile multi reduce improvements Ck tile multi reduce improvements Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants