Remove code duplications in batched gemm (multi D) gemm (multi D) wmma #3617

chris-tsiaousis-hpc · 2026-01-20T14:31:27Z

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

The actual code duplication as per LoC is not significant, but this is mainly due to clang-format having template parameters in multiple lines. The actual value of this PR is that the implementation code is shared and the details are nicely hidden from the device structs.

illsilin · 2026-01-20T21:20:35Z

Hey guys, please create branches in ROCm repo directly, otherwise the CI builds won't run automatically.
I'll kick off the CI build manually for now.

Copilot

Pull request overview

This PR refactors batched GEMM-GEMM implementations to eliminate code duplication by extracting common functionality into a shared header file (device_batched_gemm_gemm_wmma_cshuffle_v3_common.hpp). The refactoring consolidates kernel launch logic, argument structures, and validation code that was previously duplicated across multiple device implementations.

Changes:

Created a new common header file containing shared kernel functions, argument structures, and validation logic
Refactored device_batched_gemm_multiple_d_gemm_multiple_d_wmma_cshuffle_v3.hpp to use the common implementation
Refactored device_batched_gemm_gemm_wmma_cshuffle_v3.hpp to use the common implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`device_batched_gemm_gemm_wmma_cshuffle_v3_common.hpp`	New file containing extracted common kernel, argument structures, invoker, and validation logic with support for both MultiD and non-MultiD variants
`device_batched_gemm_multiple_d_gemm_multiple_d_wmma_cshuffle_v3.hpp`	Refactored to remove duplicated code by using common base classes and replacing RawArg with Argument from common implementation
`device_batched_gemm_gemm_wmma_cshuffle_v3.hpp`	Refactored to remove duplicated code by using common base classes and replacing RawArg with Argument, with empty arrays for D0/D1 tensors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T22:10:59Z

...ude/ck/tensor_operation/gpu/device/impl/device_batched_gemm_gemm_wmma_cshuffle_v3_common.hpp

+            const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_lengths_vec,
+            const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_strides_vec)


The parameter names suggest these are D1 tensors (d1_g_m_o_lengths_vec, d1_g_m_o_strides_vec), but the template parameter specifies NumD0Tensor. This should be NumD1Tensor to match the actual tensor type being described.

Suggested change

const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_lengths_vec,

const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_strides_vec)

const std::array<std::array<index_t, 3>, NumD1Tensor>& d1_g_m_o_lengths_vec,

const std::array<std::array<index_t, 3>, NumD1Tensor>& d1_g_m_o_strides_vec)

Copilot · 2026-01-20T22:11:00Z

...ude/ck/tensor_operation/gpu/device/impl/device_batched_gemm_gemm_wmma_cshuffle_v3_common.hpp

+                                                                Number<I> d0_idx) const
+        {
+            return g_idx * static_cast<long_index_t>(BatchStrideD0s_[d0_idx]);


The parameter is named 'd0_idx' but should be 'd1_idx' in the GetD0BasePtr method to correctly reflect that it's indexing into D0 tensors, not D1.

Suggested change

Number<I> d0_idx) const

{

return g_idx * static_cast<long_index_t>(BatchStrideD0s_[d0_idx]);

Number<I> d1_idx) const

{

return g_idx * static_cast<long_index_t>(BatchStrideD0s_[d1_idx]);

Copilot · 2026-01-20T22:11:00Z

...ude/ck/tensor_operation/gpu/device/impl/device_batched_gemm_gemm_wmma_cshuffle_v3_common.hpp

+                constexpr TailNumber tn = decltype(tail_number)::value;
+
+                const auto kernel = kernel_batched_gemm_gemm_wmma_cshuffle_v3<DeviceOp,
+                                                                              GridwiseGemm,
+                                                                              has_loop,
+                                                                              tn,


The variable name 'tn' is ambiguous. It should be renamed to 'tail_num' or 'tail_number_value' for clarity.

Suggested change

constexpr TailNumber tn = decltype(tail_number)::value;

const auto kernel = kernel_batched_gemm_gemm_wmma_cshuffle_v3<DeviceOp,

GridwiseGemm,

has_loop,

tn,

constexpr TailNumber tail_num = decltype(tail_number)::value;

const auto kernel = kernel_batched_gemm_gemm_wmma_cshuffle_v3<DeviceOp,

GridwiseGemm,

has_loop,

tail_num,

…lti_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

…ementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

…shuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

…ltiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

chris-tsiaousis-hpc requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 20, 2026 14:31

ErwinTerpstra added the organization: streamhpc label Jan 20, 2026

chris-tsiaousis-hpc force-pushed the 219-gemm-gemm-multi-d-duplication-removal branch from 5c807c4 to 9c0098b Compare January 20, 2026 14:57

afagaj requested a review from Copilot January 20, 2026 22:10

Copilot AI reviewed Jan 20, 2026

View reviewed changes

ErwinTerpstra deleted the branch ROCm:develop January 21, 2026 09:03

ErwinTerpstra closed this Jan 21, 2026

ErwinTerpstra reopened this Jan 21, 2026

ErwinTerpstra requested review from Snektron and vpietila-amd as code owners January 21, 2026 09:35

ErwinTerpstra changed the base branch from eterpstr/97-implement-device_batched_gemm_add_relu_gemm_add-for-rdna4 to develop January 21, 2026 09:36

ErwinTerpstra requested a review from a team as a code owner January 21, 2026 09:36

chris-tsiaousis-hpc force-pushed the 219-gemm-gemm-multi-d-duplication-removal branch from 9c0098b to d85089a Compare January 21, 2026 09:40

chris-tsiaousis-hpc changed the title ~~219 gemm gemm multi d duplication removal~~ Remove code duplications in batched gemm (multi D) gemm (multi D) wmma Jan 21, 2026

chris-tsiaousis-hpc force-pushed the 219-gemm-gemm-multi-d-duplication-removal branch from d85089a to 18667f4 Compare January 21, 2026 13:38

chris-tsiaousis-hpc added 4 commits January 22, 2026 09:47

Used the common struct in the batched gemm gemm wmma cshuffle v3 impl…

ed2211f

…ementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

Used the shared structs in the gemm multiple D gemm multiple D wmma c…

c4f5ff3

…shuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm mu…

0d8c690

…ltiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

chris-tsiaousis-hpc force-pushed the 219-gemm-gemm-multi-d-duplication-removal branch from 18667f4 to 0d8c690 Compare January 22, 2026 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove code duplications in batched gemm (multi D) gemm (multi D) wmma #3617

Remove code duplications in batched gemm (multi D) gemm (multi D) wmma #3617

chris-tsiaousis-hpc commented Jan 20, 2026 •

edited

Loading

Uh oh!

illsilin commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_lengths_vec,
		const std::array<std::array<index_t, 3>, NumD0Tensor>& d1_g_m_o_strides_vec)

Remove code duplications in batched gemm (multi D) gemm (multi D) wmma #3617

Are you sure you want to change the base?

Remove code duplications in batched gemm (multi D) gemm (multi D) wmma #3617

Conversation

chris-tsiaousis-hpc commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

illsilin commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chris-tsiaousis-hpc commented Jan 20, 2026 •

edited

Loading