[SPARK-55787][SQL] Add is_struct_empty and is_struct_non_empty built-in functions#55314
Open
Kino1994 wants to merge 1 commit intoapache:masterfrom
Open
[SPARK-55787][SQL] Add is_struct_empty and is_struct_non_empty built-in functions#55314Kino1994 wants to merge 1 commit intoapache:masterfrom
Kino1994 wants to merge 1 commit intoapache:masterfrom
Conversation
…in functions
Add two new Catalyst expressions for efficiently detecting structs where
all fields are null, without resorting to serialization (e.g. to_json).
This addresses a common pain point in Structured Streaming pipelines
using Kafka + Avro (via ABRiS), where PermissiveRecordExceptionHandler
produces non-null structs with all-null fields on deserialization failure.
The current workaround `to_json(col) =!= "{}"` forces full JSON
serialization per row. The new functions operate directly on the
InternalRow null bitmap (a single bitwise AND per field on UnsafeRow),
achieving zero allocations and short-circuit evaluation.
Semantics:
- is_struct_empty(NULL) -> NULL
- is_struct_empty(struct(null, null)) -> TRUE
- is_struct_empty(struct(1, null)) -> FALSE
- is_struct_non_empty is the logical complement
Implementation details:
- Whole-stage codegen with two strategies: unrolled AND/OR chain for
narrow structs (<=8 fields), loop with break for wide structs
- Type checking via ExpectsInputTypes with StructType
- Shallow check only (nested non-null structs count as non-null)
- nullIntolerant=true for optimizer IsNotNull inference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces two new built-in SQL functions for detecting structs where all fields are null:
is_struct_empty(struct)— Returnstrueif the struct is non-null and all of its fields are null. Returnsnullif the struct itself is null.is_struct_non_empty(struct)— Returnstrueif the struct is non-null and at least one field is non-null. Returnsnullif the struct itself is null.Both functions perform a shallow check only: nested structs that are themselves non-null (even if all their children are null) count as non-null at the parent level.
The implementation includes:
IsStructEmpty,IsStructNonEmpty) with full codegen support incomplexTypeCreator.scala, including an unrolled AND/OR chain for narrow structs (≤8 fields) and a loop with early exit for wider structs.FunctionRegistry.functions.scala.Why are the changes needed?
In Structured Streaming pipelines with Kafka + Avro deserialization, permissive failure handlers (e.g.
PermissiveRecordExceptionHandler) replace malformed records with structs that are non-null but have all fields set to null. These "empty" structs can propagate downstream and cause runtime failures in sinks like Apache Kudu when struct fields map to primary keys that cannot be NULL.Currently the only workaround is
to_json(col) != "{}", which:The proposed functions replace this pattern with an efficient, schema-agnostic null-check that requires zero serialization, supports short-circuit evaluation, and participates fully in whole-stage codegen.
Does this PR introduce any user-facing change?
Yes. Two new built-in SQL functions are added:
Previously, users had to rely on
to_json:How was this patch tested?
ComplexTypeSuitecovering: non-null structs, all-null structs, null struct input, mixed nulls, single-field structs, zero-field (empty schema) structs, nested structs, structs with array/map fields, type validation, and nullable semantics.DataFrameComplexTypeSuitesimulating the Avro permissive handler use case end-to-end (both DataFrame API and SQL syntax), including equivalence check against theto_jsonworkaround.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)