Skip to content

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

  • Replace lambdas with named functors in container_concat
  • Add make_uniform_tuple helper for repeated value patterns
  • Reduces container_concat instantiations from 186 to 93 (50% reduction)

Test Plan

  • Waiting for full CI

PR Stack

# PR Description
1 #3585 sequence_gen with __make_integer_seq
2 #3588 generate_identity_sequences helper
3 #3589 Named functors in transform_tensor_descriptor
4 #3590 container_concat optimization
5 #3596 O(1) pack expansion rewrites
6 #3600 TensorDescriptor/TensorAdaptor lambda elimination

Tracking issue: #3575

Add make_uniform_tuple<N>(value) helper to replace common pattern:
  generate_tuple([&](auto) { return value; }, Number<N>{})

This avoids unique lambda instantiations when creating tuples with
repeated values. Applied to device_grouped_conv_fwd_multiple_abd.
Replace O(N) recursive container_reduce with O(1) fold expression
for computing products of container elements. This reduces template
instantiation depth from 26 to 23 levels.

- Add container_product() using unpack + fold expression
- Migrate 10 call sites from container_reduce(x, multiplies{}, 1)
Lambdas create unique types per call site, causing duplicate template
instantiations. Named functors are shared across call sites.

Results:
- container_concat: 186 → 93 instantiations (50% reduction)
- Wall-clock: 518ms → 309ms (40% reduction)
@tenpercent tenpercent force-pushed the mpodkory/transform-tensor-descriptor-optimization branch from b26ed88 to 00849ac Compare January 17, 2026 03:51
@tenpercent tenpercent force-pushed the mpodkory/generate-tuple-optimizations branch from 887bdf2 to 02e42dc Compare January 17, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants