Add basic HVX reduce kernels. by copybara-service[bot] · Pull Request #9506 · google/XNNPACK

copybara-service · 2026-02-07T01:40:49Z

Add basic HVX reduce kernels.

Kernels that should be reasonably good:

min, max, min_max for all types
sum, sum_squared for int8 and uint8 for k1 > 1

Kernels that are not good and need work:

sum, sum_squared for int8 and uint8 for k1 = 1. These are currently naively implemented with conversions, and wide arithmetic (instead of widening arithmetic).
In general k1 = 1 is not good because we unroll the accumulator by 2x/4x, so we can load whole vectors, which makes the accumulators really large (e.g. 128). This means that we're very likely to hit tail case code.

Example inner loop (k1 > 1 uint8 sum, sum_squared is almost identical):

.LBB30_116:                             // %while.body14.i
                                        //   Parent Loop BB30_110 Depth=1
                                        //     Parent Loop BB30_112 Depth=2
                                        //       Parent Loop BB30_114 Depth=3
                                        // =>      This Inner Loop Header: Depth=4
        {
                v27 = vmemu(r5++#1)
        }
        {
                v28 = vmemu(r6++#1)
        }
        {
                v10.w += vrmpy(v27.ub,r9.b)
                v23 = vmemu(r0++#1)
        }
        {
                v9.w += vrmpy(v28.ub,r9.b)
                v15 = vmemu(r7++#1)
        }
        {
                v5.w += vrmpy(v23.ub,r9.b)
        }
        {
                v6.w += vrmpy(v15.ub,r9.b)
                r3 = add(r3,#-128)
        }
        {
                p3 = cmp.gtu(r3,#127)
                if (p3.new) jump:t .LBB30_116
        }

Kernels that should be reasonably good: - min, max, min_max for all types - sum, sum_squared for int8 and uint8 for k1 > 1 Kernels that are not good and need work: - sum, sum_squared for int8 and uint8 for k1 = 1. These are currently naively implemented with conversions, and wide arithmetic (instead of widening arithmetic). - In general k1 = 1 is not good because we unroll the accumulator by 2x/4x, so we can load whole vectors, which makes the accumulators really large (e.g. 128). This means that we're very likely to hit tail case code. Example inner loop (k1 > 1 uint8 sum, sum_squared is almost identical): ``` .LBB30_116: // %while.body14.i // Parent Loop BB30_110 Depth=1 // Parent Loop BB30_112 Depth=2 // Parent Loop BB30_114 Depth=3 // => This Inner Loop Header: Depth=4 { v27 = vmemu(r5++#1) } { v28 = vmemu(r6++#1) } { v10.w += vrmpy(v27.ub,r9.b) v23 = vmemu(r0++#1) } { v9.w += vrmpy(v28.ub,r9.b) v15 = vmemu(r7++#1) } { v5.w += vrmpy(v23.ub,r9.b) } { v6.w += vrmpy(v15.ub,r9.b) r3 = add(r3,#-128) } { p3 = cmp.gtu(r3,#127) if (p3.new) jump:t .LBB30_116 } ``` PiperOrigin-RevId: 874380564

copybara-service Bot force-pushed the test_866681443 branch 2 times, most recently from 1eea957 to 7f2b311 Compare February 24, 2026 02:09

copybara-service Bot force-pushed the test_866681443 branch from 7f2b311 to c28ce78 Compare February 24, 2026 06:27

copybara-service Bot merged commit c28ce78 into master Feb 24, 2026
1 check passed

copybara-service Bot deleted the test_866681443 branch February 24, 2026 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic HVX reduce kernels.#9506

Add basic HVX reduce kernels.#9506
copybara-service[bot] merged 1 commit intomasterfrom
test_866681443

copybara-service Bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

copybara-service Bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

copybara-service Bot commented Feb 7, 2026 •

edited

Loading