Skip to content

Conversation

@Brijesh-Thakkar
Copy link

Which issue does this PR close?

This PR contributes to the performance epic tracking expressions that are
slower with Comet enabled:

Rationale for this change

Benchmarks in the Comet performance epic show that the bit_length string
expression is slower when executed via Comet. The existing implementation
relies on the generic Arrow bit_length kernel, which introduces additional
overhead for common string array types.

This change specializes the implementation for StringArray and
LargeStringArray to reduce per-row overhead while preserving existing
behavior for all other array types.

What changes are included in this PR?

  • Added a specialized implementation of bit_length for StringArray and
    LargeStringArray
  • Avoided the generic Arrow length kernel for these array types
  • Retained the existing Arrow kernel as a fallback for other array types
    (e.g. Utf8View, Dictionary, Binary)

Are these changes tested?

Yes. Existing unit tests and SQL logic tests already cover bit_length
behavior across string types, including UTF-8, LargeUtf8, Utf8View, and
dictionary-encoded strings. All existing tests pass without modification.

Are there any user-facing changes?

No. This change is an internal performance optimization and does not affect
user-facing behavior or semantics.

Copilot AI review requested due to automatic review settings January 1, 2026 23:15
@github-actions github-actions bot added the functions Changes to functions implementation label Jan 1, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the bit_length function for string arrays by adding specialized implementations for StringArray and LargeStringArray that avoid the overhead of the generic Arrow kernel. The optimization aims to improve performance when executing bit_length via Apache Comet.

Key Changes:

  • Added specialized bit_length implementations for StringArray and LargeStringArray using manual iteration
  • Preserved fallback to Arrow's bit_length kernel for other array types (Utf8View, Dictionary, Binary)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +110 to +120
} else if let Some(arr) = v.as_any().downcast_ref::<LargeStringArray>() {
let mut builder = Int32Builder::with_capacity(arr.len());
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
builder.append_value((byte_len * 8) as i32);
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type for LargeStringArray should be Int64, not Int32. The return_type method returns Int64 for LargeUtf8 (via utf8_to_int_type), and the scalar handling for LargeUtf8 also returns Int64 (line 130-132). This implementation should use Int64Builder to maintain consistency with the declared return type and scalar handling.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +120
if let Some(arr) = v.as_any().downcast_ref::<StringArray>() {
let mut builder = Int32Builder::with_capacity(arr.len());
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
builder.append_value((byte_len * 8) as i32);
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
} else if let Some(arr) = v.as_any().downcast_ref::<LargeStringArray>() {
let mut builder = Int32Builder::with_capacity(arr.len());
for i in 0..arr.len() {
if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
builder.append_value((byte_len * 8) as i32);
}
}
Ok(ColumnarValue::Array(Arc::new(builder.finish())))
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is significant code duplication between the StringArray and LargeStringArray implementations. Consider extracting the common logic into a generic helper function or using a macro to reduce duplication and improve maintainability. The only differences are the array types being downcast and the builder types used.

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +106
let byte_len = arr.value(i).as_bytes().len();
builder.append_value((byte_len * 8) as i32);
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast (byte_len * 8) as i32 could overflow if the string is longer than i32::MAX / 8 bytes (approximately 268 MB). Consider adding overflow checking or using checked_mul and returning an error for excessively large strings to prevent silent overflow.

Copilot uses AI. Check for mistakes.
@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 2, 2026

The existing implementation relies on the generic Arrow bit_length kernel, which introduces additional
overhead for common string array types.

I'm a bit curious about this claim, would be interested to see some benchmarks here 🤔

@Brijesh-Thakkar
Copy link
Author

The existing implementation relies on the generic Arrow bit_length kernel, which introduces additional
overhead for common string array types.

I'm a bit curious about this claim, would be interested to see some benchmarks here 🤔

sure, i will add benchmarks for this also

Array, StringArray, LargeStringArray, Int32Builder,
};
use std::sync::Arc;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

if arr.is_null(i) {
builder.append_null();
} else {
let byte_len = arr.value(i).as_bytes().len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .as_bytes() needed here ? &str::len() returns the same

@Jefffrey Jefffrey marked this pull request as draft January 3, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants