[GH-2672] Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit by jiayuasu · Pull Request #2673 · apache/sedona

jiayuasu · 2026-02-24T09:31:57Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, this resolves #2672

What changes were proposed in this PR?

This PR adds a new Spark DataSourceV2-based raster data source that can load GeoTIFF files and optionally tile them on read. It replaces the legacy workflow of binaryFile + RS_FromGeoTiff + RS_TileExplode with a simpler one-liner API.

Usage

// Load and tile rasters in one step
val df = spark.read.format("raster")
  .option("retile", "true")
  .option("tileWidth", "256")
  .option("tileHeight", "256")
  .load("/path/to/rasters/")

// Load rasters without tiling
val df = spark.read.format("raster")
  .option("retile", "false")
  .load("/path/to/rasters/")

Key benefits over the legacy binaryFile + RS_TileExplode approach

Bypasses Spark's 2 GB ByteArray limit: With the old binaryFile approach, the entire raster file is loaded as a content column (Array[Byte]). Spark has a hard 2 GB limit on ByteArray values in UnsafeRow, so rasters larger than 2 GB cannot be loaded at all. The new raster data source uses HadoopImageInputStream to stream GeoTIFF data directly from Hadoop FS into GeoTools' GeoTiffReader, never materializing the entire file into a byte array. This removes the 2 GB limitation entirely.
Reduces memory pressure by avoiding double serialization: With binaryFile + RS_TileExplode, the full raster is serialized into Spark's UnsafeRow (via RasterUDT.serialize, which deep-copies all pixel data), then deserialized by RS_TileExplode, then each tile is re-serialized. The new data source tiles rasters directly inside the partition reader - the full raster never enters UnsafeRow, so only small tiles are serialized. This reduces peak memory per task by approximately 50% (e.g., from 64 MB to 32 MB for each 2048x2048 double raster tiled into 256x256 tiles).
Limit and sample pushdown: The data source implements SupportsPushDownLimit and SupportsPushDownTableSample. When retile=false, .limit(N) creates only N input partitions and reads only N files. When retile=true, limit is partially pushed (Spark's LimitExec stops calling next() after N tiles, so remaining tiles are never decoded). The old approach creates partitions for all files and relies on task cancellation.

Large raster benchmark

Tested with a 3.2 GB global nightlight GeoTIFF (86,400 x 33,601 pixels, Float32, 256x256 internal tiles):

Metric	Result
Driver memory	512 MB (JVM heap)
Tile size	256 x 256
Total tiles	44,616 (= ceil(86400/256) x ceil(33601/256))
Time	~95 seconds
Peak memory	Well within 512 MB - no OOM

The old binaryFile approach fails with NegativeArraySizeException on this file because getLen.toInt overflows for files > 2.1 GB.

Files changed

HadoopImageInputStream.java (new, common module): ImageInputStream adapter wrapping Hadoop's FSDataInputStream for GeoTiffReader. Supports (Path, Configuration) and (FSDataInputStream) constructors. Handles S3/cloud partial reads via robust read loop.
HadoopImageInputStreamTest.java (new): 3 tests - sequential read, random seek, and unstable stream (simulates S3 partial reads).
TileGenerator.java (new): Lazy tile generation engine with Tile, TileIterator (abstract), and InDbTileIterator.
RasterConstructors.java (modified): generateTiles() returns TileGenerator.TileIterator; added fromGeoTiff(ImageInputStream) overload.
RasterConstructorsTest.java (modified): References updated to TileGenerator.Tile.
RS_TileExplode in RasterConstructors.scala (modified): Added null check, lazy tileIterator with setAutoDisposeSource(true).
RasterOptions.scala (modified): Added read options (retile, tileWidth, tileHeight, padWithNoData).
RasterDataSource.scala (new): V2 data source entry point, shortName "raster", handles glob rewriting and recursive directory loading.
RasterTable.scala (new): Schema definition with columns rast(RasterUDT), x(Int), y(Int), name(String).
RasterScanBuilder.scala (new): Implements SupportsPushDownTableSample and SupportsPushDownLimit.
RasterInputPartition.scala (new): Simple case class for input partitions.
RasterPartitionReaderFactory.scala (new): Creates RasterPartitionReader instances.
RasterPartitionReader.scala (new): Reads GeoTIFF via HadoopImageInputStream -> GeoTiffReader, creates GridCoverage2D, generates tiles lazily.
META-INF/services (modified): Registered RasterDataSource (replaced RasterFileFormat).
rasterIOTest.scala (modified): 10 new read tests added.
Raster-loader.md (modified): Added raster data source section, deprecated binaryFile section.
common/pom.xml (modified): Added hadoop-client as provided scope dependency.

How was this patch tested?

10 new tests in rasterIOTest.scala (20/20 pass including 10 existing write tests)
3 new tests in HadoopImageInputStreamTest.java (sequential, random, unstable stream)
Manual benchmark: 3.2 GB nightlight GeoTIFF -> 44,616 tiles with 512 MB driver heap in ~95 seconds

Test coverage includes:

Explicit tiling with retile=true and tileWidth/tileHeight
Padding with padWithNoData=true
Limit and sample pushdown verification (inspects physical plan)
Loading without tiling (retile=false)
Auto-tiling (using internal tile size from COG)
Error handling for bad tiling configurations
Partitioned directory reading
Recursive directory loading (trailing /)
Glob pattern support

Did this PR include necessary documentation updates?

Yes. The Raster-loader.md page now documents the new raster data source API with Scala/Java/Python examples, available options, and deprecates the old binaryFile-based approach.

- Add TileGenerator.java with lazy InDbTileIterator for memory-efficient tiling - Update RasterConstructors.generateTiles() to return lazy TileIterator - Update RS_TileExplode to use lazy iterator with autoDispose - Add Spark DataSourceV2 raster reader: RasterDataSource, RasterTable, RasterScanBuilder, RasterInputPartition, RasterPartitionReaderFactory, RasterPartitionReader - Support GeoTiff, AsciiGrid, and NetCDF format detection by extension - Add read options: retile, tileWidth, tileHeight, padWithNoData - Support limit/sample pushdown, glob path rewriting, recursive directory loading - Register V2 RasterDataSource (remove V1 RasterFileFormat from META-INF) - Port 11 read tests from enterprise, adapted for in-db (no out-db references) - All 42 rasterIOTest, 314 rasteralgebraTest, 19 RasterConstructorsTest pass

Copilot

Pull request overview

This PR introduces a Spark DataSourceV2-based raster reader (format("raster")) that reads GeoTIFFs directly and can optionally retile them during read, aiming to simplify the legacy binaryFile + RS_FromGeoTiff + RS_TileExplode workflow while enabling limit/sample pushdown when retile=false.

Changes:

Added a new raster DataSourceV2 implementation (table/scan/partition/reader) with optional retile behavior and recursive/glob path handling.
Introduced a new lazy tiling engine (TileGenerator) and updated raster tiling APIs to return a lazy iterator.
Expanded raster IO tests and updated SQL docs to document the new data source and deprecate the legacy binaryFile-based section.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
spark/common/src/test/scala/org/apache/sedona/sql/rasterIOTest.scala	Adds read-path tests for the new raster data source, including pushdown checks and path handling.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterDataSource.scala	New V2 entry point (`shortName = "raster"`) and path rewrite/recursive-loading behavior.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterTable.scala	New FileTable for raster reads (schema + scan builder wiring).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterScanBuilder.scala	Implements limit + table sample pushdown for file selection (primarily when `retile=false`).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterInputPartition.scala	Defines input partitions used by the raster scan.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReaderFactory.scala	Creates partition readers and wraps them with partition values.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReader.scala	Implements GeoTIFF reading + optional tiling and writes rows via `UnsafeRowWriter`.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterOptions.scala	Adds read options (`retile`, tile sizes, padding) and exposes options class.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterConstructors.scala	Updates `RS_TileExplode` to use the new lazy tile iterator API.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister	Registers the new `RasterDataSource` for `format("raster")`.
docs/api/sql/Raster-loader.md	Documents the new raster data source API and deprecates the binaryFile-based approach.
common/src/main/java/org/apache/sedona/common/raster/TileGenerator.java	New lazy in-db tiling engine (tile + iterator).
common/src/main/java/org/apache/sedona/common/raster/RasterConstructors.java	Changes tiling API to return a lazy iterator and updates `rsTile` accordingly.
common/src/test/java/org/apache/sedona/common/raster/RasterConstructorsTest.java	Updates tests to consume the new lazy tile iterator API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.../common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReader.scala

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterScanBuilder.scala

common/src/main/java/org/apache/sedona/common/raster/TileGenerator.java

...n/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterConstructors.scala

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterTable.scala

…w comments - Replace byte-array-based raster loading with stream-based HadoopImageInputStream to support GeoTIFF files larger than 2GB (fixes getLen.toInt overflow) - Add HadoopImageInputStream adapter (common module) wrapping Hadoop FSDataInputStream into javax.imageio.stream.ImageInputStream for GeoTiffReader - Add fromGeoTiff(ImageInputStream) overload to RasterConstructors - Add hadoop-client as provided dependency in common/pom.xml - Add HadoopImageInputStreamTest with sequential, random, and unstable stream tests - Fix table sample pushdown to respect lowerBound (PR review comment #2) - Fix tile rect computation for images with non-zero minX/minY (PR review comment #3)

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterOptions.scala

spark/common/src/test/scala/org/apache/sedona/sql/rasterIOTest.scala

common/src/main/java/org/apache/sedona/common/raster/inputstream/HadoopImageInputStream.java

...on/src/test/java/org/apache/sedona/common/raster/inputstream/HadoopImageInputStreamTest.java

...n/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterConstructors.scala

.../common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReader.scala

…fixes - RS_TileExplode: return Iterator.empty when raster is null instead of NPE - HadoopImageInputStream: guard against non-progressing stream (ret_len == 0) - HadoopImageInputStreamTest: fix UnstableInputStream to return at least 1 byte - RasterOptions: only enforce tileWidth/tileHeight pairing when retile=true - RasterPartitionReader: use floating-point division for aspect ratio check - rasterIOTest: remove explain(true) noise from CI logs

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterDataSource.scala

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spark/common/src/test/scala/org/apache/sedona/sql/rasterIOTest.scala

.../common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReader.scala

spark/common/src/test/scala/org/apache/sedona/sql/rasterIOTest.scala

github-actions bot added docs sedona-common sedona-spark labels Feb 24, 2026

jiayuasu changed the title ~~[SEDONA-737] Add DataSourceV2-based raster data source for GeoTIFF loading with tiling on read~~ [GH-2672] Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit Feb 24, 2026

jiayuasu added this to the sedona-1.9.0 milestone Feb 24, 2026

jiayuasu added the affect public APIs label Feb 24, 2026

jiayuasu requested a review from Copilot February 24, 2026 10:19

Copilot started reviewing on behalf of jiayuasu February 24, 2026 10:20 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

jiayuasu marked this pull request as draft February 25, 2026 01:08

jiayuasu requested a review from Copilot February 25, 2026 07:12

Copilot started reviewing on behalf of jiayuasu February 25, 2026 07:13 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

jiayuasu requested a review from Copilot February 25, 2026 07:43

Copilot started reviewing on behalf of jiayuasu February 25, 2026 07:44 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterDataSource.scala Outdated Show resolved Hide resolved

Make glob path regex case-insensitive for .TIF/.TIFF extensions

f64bd56

jiayuasu requested a review from Copilot February 25, 2026 08:04

Copilot started reviewing on behalf of jiayuasu February 25, 2026 08:05 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

jiayuasu requested a review from Copilot February 25, 2026 08:59

Copilot started reviewing on behalf of jiayuasu February 25, 2026 09:00 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

jiayuasu marked this pull request as ready for review February 25, 2026 09:13

jiayuasu merged commit 189cfd0 into master Feb 25, 2026
50 checks passed

Comments

Conversation

jiayuasu commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Usage

Key benefits over the legacy binaryFile + RS_TileExplode approach

Large raster benchmark

Files changed

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiayuasu commented Feb 24, 2026 •

edited

Loading