[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet) by davidm-db · Pull Request #55326 · apache/spark

davidm-db · 2026-04-13T17:35:46Z

What changes were proposed in this pull request?

This PR implements Phase 3a (Storage Formats - Parquet) of the Spark Types Framework (SPARK-55444, parent: SPARK-53504). It adds a new optional ParquetTypeOps trait that enables framework-managed types to participate in all Parquet read/write/filter paths with zero per-type changes to Parquet infrastructure files.

New trait: ParquetTypeOps in sql/core (package o.a.s.sql.execution.datasources.parquet.types.ops) following the Phase 1c pattern (ConnectArrowTypeOps — separate module, separate factory). The trait covers:

Schema conversion (Spark DataType ↔ Parquet schema type)
Value write (writing values to Parquet RecordConsumer)
Row-based read (creating Parquet converters for reading into InternalRow)
Vectorized read (batch updaters for columnar reading)
Filter pushdown (Parquet filter predicates for predicate pushdown)
Type gates (supportDataType, isBatchReadSupported)
Schema clipping (column pruning for struct-backed types)

Reference implementation: TimeTypeParquetOps validates all paths (schema, write, row-read, vectorized-read, filter pushdown, type gates).

Dispatch pattern: Framework FIRST at all 9 integration sites, with the entire original code extracted to *Default methods as fallback — the same Ops(dt).map(_.method).getOrElse(methodDefault(dt)) pattern established in Phase 1a (PR #54223) and Phase 1c (PR #54905). TimeType is always tested through the framework path when the feature flag is ON.

Integration sites (9 existing files modified):

File	Dispatch
`ParquetSchemaConverter`	Write-path schema (framework-first + `convertFieldDefault`) + read-path reverse lookup
`ParquetWriteSupport`	Companion utility extraction (`consumeGroup`, `writeFields`) + framework-first `makeWriter`
`ParquetRowConverter`	Framework-first `newConverter` with method overloading (simple for primitive, extended for struct-backed)
`ParquetVectorUpdaterFactory` (Java)	Framework-first `getUpdater` via Java-friendly `getVectorUpdaterOrNull`
`VectorizedColumnReader` (Java)	Framework-first `isLazyDecodingSupported` via Java-friendly `isLazyDecodingSupportedFor`
`ParquetFileFormat`	Framework-first `supportDataType`
`ParquetUtils`	Framework-first `isBatchReadSupported`
`ParquetFilters`	`FrameworkFilterOps` custom extractor + `orElse` on 7 PartialFunctions + framework-first `valueCanMakeFilterOn`
`ParquetReadSupport`	Framework-first `clipParquetType` for struct-backed types via `parquetStructSchema`

Design decisions:

ParquetTypeOps is a separate trait in sql/core (not on TypeOps in sql/catalyst) because Parquet types (RecordConsumer, ParquetVectorUpdater, etc.) live in sql/core and catalyst cannot reference them.
rowRepresentationType (Phase 1b) is NOT used for Parquet — it is scoped to row infrastructure only. Using it would erase type identity in Parquet value paths, create dispatch asymmetry between struct-backed and primitive types, and extend it beyond its designed scope.
parquetStructSchema is independent of PhysicalDataType — Parquet storage representation may differ from internal row representation.
recordConsumer is passed as () => RecordConsumer (lazy supplier) because makeWriter is called during init() when recordConsumer is still null (set later in prepareForWrite()).
Write utilities (consumeGroup, writeFields) are extracted to ParquetWriteSupport companion as private[parquet] static methods — shared by both existing infrastructure and framework ops.
Filter dispatch uses a FrameworkFilterOps custom extractor (single lookup) inside ParquetFilters because ParquetSchemaType is a private inner class that cannot be referenced from outside.

Why are the changes needed?

Adding a new data type to Spark currently requires modifying 8+ Parquet files with scattered, type-specific logic. This PR enables the framework to handle all Parquet concerns for new types — a new type implements ParquetTypeOps and registers in the companion's apply(), and all 9 Parquet infrastructure files dispatch through it automatically.
This is the first storage format integration (Phase 3a). TimeType serves as the reference implementation validating all paths in OSS.

Does this PR introduce any user-facing change?

No. This is a refactoring that adds framework dispatch to Parquet infrastructure. When the framework flag (spark.sql.types.framework.enabled) is ON (default in tests), TimeType's Parquet handling goes through the framework. When OFF, the original *Default code paths execute unchanged. Behavior is identical in both cases.

How was this patch tested?

All existing Parquet test suites pass with the framework enabled (default in tests):

ParquetSchemaSuite: 131 tests passed
ParquetIOSuite: 88 tests passed (including "Read TimeType for the logical TIME type")
ParquetVectorizedSuite: 25 tests passed
ParquetV1FilterSuite + ParquetV2FilterSuite: 101 tests passed (including "SPARK-51687: filter pushdown - time")
Framework ON/OFF equivalence: all tests pass identically with spark.sql.types.framework.enabled = true and false.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

davidm-db force-pushed the davidm-db/types_framework_3a branch from 2d1d82b to e788aef Compare April 13, 2026 17:39

initial commit - phase 3a

eb12469

davidm-db force-pushed the davidm-db/types_framework_3a branch from e788aef to eb12469 Compare April 13, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326
davidm-db wants to merge 1 commit intoapache:masterfrom
davidm-db:davidm-db/types_framework_3a

davidm-db commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidm-db commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidm-db commented Apr 13, 2026 •

edited

Loading