Skip to content

Add basic HVX reduce kernels.#9506

Merged
copybara-service[bot] merged 1 commit intomasterfrom
test_866681443
Feb 24, 2026
Merged

Add basic HVX reduce kernels.#9506
copybara-service[bot] merged 1 commit intomasterfrom
test_866681443

Conversation

@copybara-service
Copy link
Copy Markdown
Contributor

@copybara-service copybara-service Bot commented Feb 7, 2026

Add basic HVX reduce kernels.

Kernels that should be reasonably good:

  • min, max, min_max for all types
  • sum, sum_squared for int8 and uint8 for k1 > 1

Kernels that are not good and need work:

  • sum, sum_squared for int8 and uint8 for k1 = 1. These are currently naively implemented with conversions, and wide arithmetic (instead of widening arithmetic).
  • In general k1 = 1 is not good because we unroll the accumulator by 2x/4x, so we can load whole vectors, which makes the accumulators really large (e.g. 128). This means that we're very likely to hit tail case code.

Example inner loop (k1 > 1 uint8 sum, sum_squared is almost identical):

.LBB30_116:                             // %while.body14.i
                                        //   Parent Loop BB30_110 Depth=1
                                        //     Parent Loop BB30_112 Depth=2
                                        //       Parent Loop BB30_114 Depth=3
                                        // =>      This Inner Loop Header: Depth=4
        {
                v27 = vmemu(r5++#1)
        }
        {
                v28 = vmemu(r6++#1)
        }
        {
                v10.w += vrmpy(v27.ub,r9.b)
                v23 = vmemu(r0++#1)
        }
        {
                v9.w += vrmpy(v28.ub,r9.b)
                v15 = vmemu(r7++#1)
        }
        {
                v5.w += vrmpy(v23.ub,r9.b)
        }
        {
                v6.w += vrmpy(v15.ub,r9.b)
                r3 = add(r3,#-128)
        }
        {
                p3 = cmp.gtu(r3,#127)
                if (p3.new) jump:t .LBB30_116
        }

@copybara-service copybara-service Bot force-pushed the test_866681443 branch 2 times, most recently from 1eea957 to 7f2b311 Compare February 24, 2026 02:09
Kernels that should be reasonably good:
- min, max, min_max for all types
- sum, sum_squared for int8 and uint8 for k1 > 1

Kernels that are not good and need work:
- sum, sum_squared for int8 and uint8 for k1 = 1. These are currently naively implemented with conversions, and wide arithmetic (instead of widening arithmetic).
- In general k1 = 1 is not good because we unroll the accumulator by 2x/4x, so we can load whole vectors, which makes the accumulators really large (e.g. 128). This means that we're very likely to hit tail case code.

Example inner loop (k1 > 1 uint8 sum, sum_squared is almost identical):
```
.LBB30_116:                             // %while.body14.i
                                        //   Parent Loop BB30_110 Depth=1
                                        //     Parent Loop BB30_112 Depth=2
                                        //       Parent Loop BB30_114 Depth=3
                                        // =>      This Inner Loop Header: Depth=4
        {
                v27 = vmemu(r5++#1)
        }
        {
                v28 = vmemu(r6++#1)
        }
        {
                v10.w += vrmpy(v27.ub,r9.b)
                v23 = vmemu(r0++#1)
        }
        {
                v9.w += vrmpy(v28.ub,r9.b)
                v15 = vmemu(r7++#1)
        }
        {
                v5.w += vrmpy(v23.ub,r9.b)
        }
        {
                v6.w += vrmpy(v15.ub,r9.b)
                r3 = add(r3,#-128)
        }
        {
                p3 = cmp.gtu(r3,#127)
                if (p3.new) jump:t .LBB30_116
        }
```

PiperOrigin-RevId: 874380564
@copybara-service copybara-service Bot merged commit c28ce78 into master Feb 24, 2026
1 check passed
@copybara-service copybara-service Bot deleted the test_866681443 branch February 24, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant