-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Unify Spark TrySum implementation with DataFusion Sum implementation #19593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| DataType::Int64 => Ok(DataType::Int64), | ||
| DataType::UInt64 => Ok(DataType::UInt64), | ||
| DataType::Float64 => Ok(DataType::Float64), | ||
| DataType::Null => Ok(DataType::Float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark try_sum had test case where null inputs will return a null as double, so adding that here
| if args.expr_fields[0].data_type() == &DataType::Null { | ||
| return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None)))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
| // See COMMENTS.md to understand why nullable is set to true | ||
| Field::new_list_field(args.return_type().clone(), true), | ||
| false, | ||
| match (args.is_distinct, self.try_sum_mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit ugly here, would be nice if we manage to make accumulators own their state fields, see: #14701 (comment)
| // Can overflow into null | ||
| if self.try_sum_mode { | ||
| return SetMonotonicity::NotMonotonic; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain on this 🤔
| Err(e) => { | ||
| return Err(e.into()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically these errs are unreachable as I think only arithmeticoverflow error variant is returned for add_checked but keeping this in case
| // Only difference from TrySumAccumulator is that it verifies the resulting sum | ||
| // can fit within the decimals precision; if Rust had specialization we could unify | ||
| // the two types (╥﹏╥) | ||
| #[derive(Debug)] | ||
| struct TrySumDecimalAccumulator<T: DecimalType + std::fmt::Debug> { | ||
| inner: TrySumAccumulator<T>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to any ideas to unify these structs
| /// Thin wrapper over DataFusion native [`Sum`] which is configurable into a try | ||
| /// sum mode to return `null` on overflows. We need this thin wrapper to provide | ||
| /// the `try_sum` named function for use in Spark. | ||
| #[derive(PartialEq, Eq, Hash, Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we had a simpler way to do these types of thin wrappers 🤔
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests seemed to already be covered by SLTs so removed them
Which issue does this PR close?
Rationale for this change
Refactor try_sum to reuse existing sum code from native Datafusion version.
What changes are included in this PR?
Add new mode to Sum which allows toggling into try_sum mode, affecting which accumulator is used.
Add new try_sum accumulator for sum which returns null if sum at any point overflows.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No