Skip to content

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326

Draft
davidm-db wants to merge 1 commit intoapache:masterfrom
davidm-db:davidm-db/types_framework_3a
Draft

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326
davidm-db wants to merge 1 commit intoapache:masterfrom
davidm-db:davidm-db/types_framework_3a

Conversation

@davidm-db
Copy link
Copy Markdown
Contributor

@davidm-db davidm-db commented Apr 13, 2026

What changes were proposed in this pull request?

This PR implements Phase 3a (Storage Formats - Parquet) of the Spark Types Framework (SPARK-55444, parent: SPARK-53504). It adds a new optional ParquetTypeOps trait that enables framework-managed types to participate in all Parquet read/write/filter paths with zero per-type changes to Parquet infrastructure files.

New trait: ParquetTypeOps in sql/core (package o.a.s.sql.execution.datasources.parquet.types.ops) following the Phase 1c pattern (ConnectArrowTypeOps — separate module, separate factory). The trait covers:

  • Schema conversion (Spark DataType ↔ Parquet schema type)
  • Value write (writing values to Parquet RecordConsumer)
  • Row-based read (creating Parquet converters for reading into InternalRow)
  • Vectorized read (batch updaters for columnar reading)
  • Filter pushdown (Parquet filter predicates for predicate pushdown)
  • Type gates (supportDataType, isBatchReadSupported)
  • Schema clipping (column pruning for struct-backed types)

Reference implementation: TimeTypeParquetOps validates all paths (schema, write, row-read, vectorized-read, filter pushdown, type gates).

Dispatch pattern: Framework FIRST at all 9 integration sites, with the entire original code extracted to *Default methods as fallback — the same Ops(dt).map(_.method).getOrElse(methodDefault(dt)) pattern established in Phase 1a (PR #54223) and Phase 1c (PR #54905). TimeType is always tested through the framework path when the feature flag is ON.

Integration sites (9 existing files modified):

File Dispatch
ParquetSchemaConverter Write-path schema (framework-first + convertFieldDefault) + read-path reverse lookup
ParquetWriteSupport Companion utility extraction (consumeGroup, writeFields) + framework-first makeWriter
ParquetRowConverter Framework-first newConverter with method overloading (simple for primitive, extended for struct-backed)
ParquetVectorUpdaterFactory (Java) Framework-first getUpdater via Java-friendly getVectorUpdaterOrNull
VectorizedColumnReader (Java) Framework-first isLazyDecodingSupported via Java-friendly isLazyDecodingSupportedFor
ParquetFileFormat Framework-first supportDataType
ParquetUtils Framework-first isBatchReadSupported
ParquetFilters FrameworkFilterOps custom extractor + orElse on 7 PartialFunctions + framework-first valueCanMakeFilterOn
ParquetReadSupport Framework-first clipParquetType for struct-backed types via parquetStructSchema

Design decisions:

  • ParquetTypeOps is a separate trait in sql/core (not on TypeOps in sql/catalyst) because Parquet types (RecordConsumer, ParquetVectorUpdater, etc.) live in sql/core and catalyst cannot reference them.
  • rowRepresentationType (Phase 1b) is NOT used for Parquet — it is scoped to row infrastructure only. Using it would erase type identity in Parquet value paths, create dispatch asymmetry between struct-backed and primitive types, and extend it beyond its designed scope.
  • parquetStructSchema is independent of PhysicalDataType — Parquet storage representation may differ from internal row representation.
  • recordConsumer is passed as () => RecordConsumer (lazy supplier) because makeWriter is called during init() when recordConsumer is still null (set later in prepareForWrite()).
  • Write utilities (consumeGroup, writeFields) are extracted to ParquetWriteSupport companion as private[parquet] static methods — shared by both existing infrastructure and framework ops.
  • Filter dispatch uses a FrameworkFilterOps custom extractor (single lookup) inside ParquetFilters because ParquetSchemaType is a private inner class that cannot be referenced from outside.

Why are the changes needed?

Adding a new data type to Spark currently requires modifying 8+ Parquet files with scattered, type-specific logic. This PR enables the framework to handle all Parquet concerns for new types — a new type implements ParquetTypeOps and registers in the companion's apply(), and all 9 Parquet infrastructure files dispatch through it automatically.
This is the first storage format integration (Phase 3a). TimeType serves as the reference implementation validating all paths in OSS.

Does this PR introduce any user-facing change?

No. This is a refactoring that adds framework dispatch to Parquet infrastructure. When the framework flag (spark.sql.types.framework.enabled) is ON (default in tests), TimeType's Parquet handling goes through the framework. When OFF, the original *Default code paths execute unchanged. Behavior is identical in both cases.

How was this patch tested?

All existing Parquet test suites pass with the framework enabled (default in tests):

  • ParquetSchemaSuite: 131 tests passed
  • ParquetIOSuite: 88 tests passed (including "Read TimeType for the logical TIME type")
  • ParquetVectorizedSuite: 25 tests passed
  • ParquetV1FilterSuite + ParquetV2FilterSuite: 101 tests passed (including "SPARK-51687: filter pushdown - time")
    Framework ON/OFF equivalence: all tests pass identically with spark.sql.types.framework.enabled = true and false.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

@davidm-db davidm-db force-pushed the davidm-db/types_framework_3a branch from 2d1d82b to e788aef Compare April 13, 2026 17:39
@davidm-db davidm-db force-pushed the davidm-db/types_framework_3a branch from e788aef to eb12469 Compare April 13, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant