Skip to content

Spark: Add session-level read split configs#16185

Open
liucao-dd wants to merge 1 commit intoapache:mainfrom
liucao-dd:spark/split-read-conf-session-config
Open

Spark: Add session-level read split configs#16185
liucao-dd wants to merge 1 commit intoapache:mainfrom
liucao-dd:spark/split-read-conf-session-config

Conversation

@liucao-dd
Copy link
Copy Markdown

@liucao-dd liucao-dd commented May 1, 2026

Spark Split Planning Session Configs

Summary

This change adds Spark session-level configuration for Iceberg split planning:

  • spark.sql.iceberg.split.target-size
  • spark.sql.iceberg.split.planning-lookback
  • spark.sql.iceberg.split.open-file-cost
  • spark.sql.iceberg.split.adaptive-size.enabled

The goal is to let Spark SQL reads and row-level operations tune split planning without requiring
table property changes or DataFrame read options.

Naming

The config names intentionally mirror the existing Iceberg table properties:

  • read.split.target-size
  • read.split.planning-lookback
  • read.split.open-file-cost
  • read.split.adaptive-size.enabled

if reviewers prefer names closer to existing read options, such as spark.sql.iceberg.split-size, I am open to adjusting.

Motivation

Iceberg already supports per-read split planning through DataFrame read options and table
properties. Those are not always practical for SQL-first workloads:

  • SQL statements such as MERGE INTO, UPDATE, and DELETE do not expose DataFrame read options.
  • Changing table properties affects all readers, which is too broad when different jobs need
    different split-planning behavior.
  • Session configs let one Spark application or notebook tune read parallelism without changing code
    or table metadata.

spark.sql.iceberg.split.target-size and spark.sql.iceberg.split.adaptive-size.enabled are the main knobs that applications would commonly tune at session level. The lookback and open-file-cost configs are there for consistency.

Backports

This PR updates Spark 4.1 first. Backports for Spark 4.0, 3.5, will follow in separate PRs to keep this change reviewable.

Test Layout

The new TestSparkReadConf file is intended as the read-side counterpart to TestSparkWriteConf.
It covers precedence and accessor behavior that is easiest to validate directly against
SparkReadConf, especially the *Option() accessors used by scan planning.

Allow SQL paths such as MERGE INTO to tune split planning without per-read options, especially split size and adaptive sizing. The config names mirror the read.split table properties; older Spark version backports will follow separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants