[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326
Draft
davidm-db wants to merge 1 commit intoapache:masterfrom
Draft
[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326davidm-db wants to merge 1 commit intoapache:masterfrom
davidm-db wants to merge 1 commit intoapache:masterfrom
Conversation
2d1d82b to
e788aef
Compare
e788aef to
eb12469
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR implements Phase 3a (Storage Formats - Parquet) of the Spark Types Framework (SPARK-55444, parent: SPARK-53504). It adds a new optional
ParquetTypeOpstrait that enables framework-managed types to participate in all Parquet read/write/filter paths with zero per-type changes to Parquet infrastructure files.New trait:
ParquetTypeOpsinsql/core(packageo.a.s.sql.execution.datasources.parquet.types.ops) following the Phase 1c pattern (ConnectArrowTypeOps — separate module, separate factory). The trait covers:Reference implementation:
TimeTypeParquetOpsvalidates all paths (schema, write, row-read, vectorized-read, filter pushdown, type gates).Dispatch pattern: Framework FIRST at all 9 integration sites, with the entire original code extracted to
*Defaultmethods as fallback — the sameOps(dt).map(_.method).getOrElse(methodDefault(dt))pattern established in Phase 1a (PR #54223) and Phase 1c (PR #54905). TimeType is always tested through the framework path when the feature flag is ON.Integration sites (9 existing files modified):
ParquetSchemaConverterconvertFieldDefault) + read-path reverse lookupParquetWriteSupportconsumeGroup,writeFields) + framework-firstmakeWriterParquetRowConverternewConverterwith method overloading (simple for primitive, extended for struct-backed)ParquetVectorUpdaterFactory(Java)getUpdatervia Java-friendlygetVectorUpdaterOrNullVectorizedColumnReader(Java)isLazyDecodingSupportedvia Java-friendlyisLazyDecodingSupportedForParquetFileFormatsupportDataTypeParquetUtilsisBatchReadSupportedParquetFiltersFrameworkFilterOpscustom extractor +orElseon 7 PartialFunctions + framework-firstvalueCanMakeFilterOnParquetReadSupportclipParquetTypefor struct-backed types viaparquetStructSchemaDesign decisions:
ParquetTypeOpsis a separate trait insql/core(not onTypeOpsinsql/catalyst) because Parquet types (RecordConsumer,ParquetVectorUpdater, etc.) live insql/coreand catalyst cannot reference them.rowRepresentationType(Phase 1b) is NOT used for Parquet — it is scoped to row infrastructure only. Using it would erase type identity in Parquet value paths, create dispatch asymmetry between struct-backed and primitive types, and extend it beyond its designed scope.parquetStructSchemais independent ofPhysicalDataType— Parquet storage representation may differ from internal row representation.recordConsumeris passed as() => RecordConsumer(lazy supplier) becausemakeWriteris called duringinit()whenrecordConsumeris still null (set later inprepareForWrite()).consumeGroup,writeFields) are extracted toParquetWriteSupportcompanion asprivate[parquet]static methods — shared by both existing infrastructure and framework ops.FrameworkFilterOpscustom extractor (single lookup) insideParquetFiltersbecauseParquetSchemaTypeis a private inner class that cannot be referenced from outside.Why are the changes needed?
Adding a new data type to Spark currently requires modifying 8+ Parquet files with scattered, type-specific logic. This PR enables the framework to handle all Parquet concerns for new types — a new type implements
ParquetTypeOpsand registers in the companion'sapply(), and all 9 Parquet infrastructure files dispatch through it automatically.This is the first storage format integration (Phase 3a). TimeType serves as the reference implementation validating all paths in OSS.
Does this PR introduce any user-facing change?
No. This is a refactoring that adds framework dispatch to Parquet infrastructure. When the framework flag (
spark.sql.types.framework.enabled) is ON (default in tests), TimeType's Parquet handling goes through the framework. When OFF, the original*Defaultcode paths execute unchanged. Behavior is identical in both cases.How was this patch tested?
All existing Parquet test suites pass with the framework enabled (default in tests):
ParquetSchemaSuite: 131 tests passedParquetIOSuite: 88 tests passed (including "Read TimeType for the logical TIME type")ParquetVectorizedSuite: 25 tests passedParquetV1FilterSuite+ParquetV2FilterSuite: 101 tests passed (including "SPARK-51687: filter pushdown - time")Framework ON/OFF equivalence: all tests pass identically with
spark.sql.types.framework.enabled = trueandfalse.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)