Skip to content

Comments

[GH-2672] Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit#2673

Merged
jiayuasu merged 4 commits intomasterfrom
port-raster-datasource
Feb 25, 2026
Merged

[GH-2672] Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit#2673
jiayuasu merged 4 commits intomasterfrom
port-raster-datasource

Conversation

@jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented Feb 24, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

Yes, this resolves #2672

What changes were proposed in this PR?

This PR adds a new Spark DataSourceV2-based raster data source that can load GeoTIFF files and optionally tile them on read. It replaces the legacy workflow of binaryFile + RS_FromGeoTiff + RS_TileExplode with a simpler one-liner API.

Usage

// Load and tile rasters in one step
val df = spark.read.format("raster")
  .option("retile", "true")
  .option("tileWidth", "256")
  .option("tileHeight", "256")
  .load("/path/to/rasters/")

// Load rasters without tiling
val df = spark.read.format("raster")
  .option("retile", "false")
  .load("/path/to/rasters/")

Key benefits over the legacy binaryFile + RS_TileExplode approach

  1. Bypasses Spark's 2 GB ByteArray limit: With the old binaryFile approach, the entire raster file is loaded as a content column (Array[Byte]). Spark has a hard 2 GB limit on ByteArray values in UnsafeRow, so rasters larger than 2 GB cannot be loaded at all. The new raster data source uses HadoopImageInputStream to stream GeoTIFF data directly from Hadoop FS into GeoTools' GeoTiffReader, never materializing the entire file into a byte array. This removes the 2 GB limitation entirely.
  2. Reduces memory pressure by avoiding double serialization: With binaryFile + RS_TileExplode, the full raster is serialized into Spark's UnsafeRow (via RasterUDT.serialize, which deep-copies all pixel data), then deserialized by RS_TileExplode, then each tile is re-serialized. The new data source tiles rasters directly inside the partition reader - the full raster never enters UnsafeRow, so only small tiles are serialized. This reduces peak memory per task by approximately 50% (e.g., from 64 MB to 32 MB for each 2048x2048 double raster tiled into 256x256 tiles).
  3. Limit and sample pushdown: The data source implements SupportsPushDownLimit and SupportsPushDownTableSample. When retile=false, .limit(N) creates only N input partitions and reads only N files. When retile=true, limit is partially pushed (Spark's LimitExec stops calling next() after N tiles, so remaining tiles are never decoded). The old approach creates partitions for all files and relies on task cancellation.

Large raster benchmark

Tested with a 3.2 GB global nightlight GeoTIFF (86,400 x 33,601 pixels, Float32, 256x256 internal tiles):

Metric Result
Driver memory 512 MB (JVM heap)
Tile size 256 x 256
Total tiles 44,616 (= ceil(86400/256) x ceil(33601/256))
Time ~95 seconds
Peak memory Well within 512 MB - no OOM

The old binaryFile approach fails with NegativeArraySizeException on this file because getLen.toInt overflows for files > 2.1 GB.

Files changed

  • HadoopImageInputStream.java (new, common module): ImageInputStream adapter wrapping Hadoop's FSDataInputStream for GeoTiffReader. Supports (Path, Configuration) and (FSDataInputStream) constructors. Handles S3/cloud partial reads via robust read loop.
  • HadoopImageInputStreamTest.java (new): 3 tests - sequential read, random seek, and unstable stream (simulates S3 partial reads).
  • TileGenerator.java (new): Lazy tile generation engine with Tile, TileIterator (abstract), and InDbTileIterator.
  • RasterConstructors.java (modified): generateTiles() returns TileGenerator.TileIterator; added fromGeoTiff(ImageInputStream) overload.
  • RasterConstructorsTest.java (modified): References updated to TileGenerator.Tile.
  • RS_TileExplode in RasterConstructors.scala (modified): Added null check, lazy tileIterator with setAutoDisposeSource(true).
  • RasterOptions.scala (modified): Added read options (retile, tileWidth, tileHeight, padWithNoData).
  • RasterDataSource.scala (new): V2 data source entry point, shortName "raster", handles glob rewriting and recursive directory loading.
  • RasterTable.scala (new): Schema definition with columns rast(RasterUDT), x(Int), y(Int), name(String).
  • RasterScanBuilder.scala (new): Implements SupportsPushDownTableSample and SupportsPushDownLimit.
  • RasterInputPartition.scala (new): Simple case class for input partitions.
  • RasterPartitionReaderFactory.scala (new): Creates RasterPartitionReader instances.
  • RasterPartitionReader.scala (new): Reads GeoTIFF via HadoopImageInputStream -> GeoTiffReader, creates GridCoverage2D, generates tiles lazily.
  • META-INF/services (modified): Registered RasterDataSource (replaced RasterFileFormat).
  • rasterIOTest.scala (modified): 10 new read tests added.
  • Raster-loader.md (modified): Added raster data source section, deprecated binaryFile section.
  • common/pom.xml (modified): Added hadoop-client as provided scope dependency.

How was this patch tested?

  • 10 new tests in rasterIOTest.scala (20/20 pass including 10 existing write tests)
  • 3 new tests in HadoopImageInputStreamTest.java (sequential, random, unstable stream)
  • Manual benchmark: 3.2 GB nightlight GeoTIFF -> 44,616 tiles with 512 MB driver heap in ~95 seconds

Test coverage includes:

  • Explicit tiling with retile=true and tileWidth/tileHeight
  • Padding with padWithNoData=true
  • Limit and sample pushdown verification (inspects physical plan)
  • Loading without tiling (retile=false)
  • Auto-tiling (using internal tile size from COG)
  • Error handling for bad tiling configurations
  • Partitioned directory reading
  • Recursive directory loading (trailing /)
  • Glob pattern support

Did this PR include necessary documentation updates?

Yes. The Raster-loader.md page now documents the new raster data source API with Scala/Java/Python examples, available options, and deprecates the old binaryFile-based approach.

- Add TileGenerator.java with lazy InDbTileIterator for memory-efficient tiling
- Update RasterConstructors.generateTiles() to return lazy TileIterator
- Update RS_TileExplode to use lazy iterator with autoDispose
- Add Spark DataSourceV2 raster reader: RasterDataSource, RasterTable,
  RasterScanBuilder, RasterInputPartition, RasterPartitionReaderFactory,
  RasterPartitionReader
- Support GeoTiff, AsciiGrid, and NetCDF format detection by extension
- Add read options: retile, tileWidth, tileHeight, padWithNoData
- Support limit/sample pushdown, glob path rewriting, recursive directory loading
- Register V2 RasterDataSource (remove V1 RasterFileFormat from META-INF)
- Port 11 read tests from enterprise, adapted for in-db (no out-db references)
- All 42 rasterIOTest, 314 rasteralgebraTest, 19 RasterConstructorsTest pass
@jiayuasu jiayuasu changed the title [SEDONA-737] Add DataSourceV2-based raster data source for GeoTIFF loading with tiling on read [GH-2672] Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit Feb 24, 2026
@jiayuasu jiayuasu added this to the sedona-1.9.0 milestone Feb 24, 2026
@jiayuasu jiayuasu requested a review from Copilot February 24, 2026 10:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Spark DataSourceV2-based raster reader (format("raster")) that reads GeoTIFFs directly and can optionally retile them during read, aiming to simplify the legacy binaryFile + RS_FromGeoTiff + RS_TileExplode workflow while enabling limit/sample pushdown when retile=false.

Changes:

  • Added a new raster DataSourceV2 implementation (table/scan/partition/reader) with optional retile behavior and recursive/glob path handling.
  • Introduced a new lazy tiling engine (TileGenerator) and updated raster tiling APIs to return a lazy iterator.
  • Expanded raster IO tests and updated SQL docs to document the new data source and deprecate the legacy binaryFile-based section.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/rasterIOTest.scala Adds read-path tests for the new raster data source, including pushdown checks and path handling.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterDataSource.scala New V2 entry point (shortName = "raster") and path rewrite/recursive-loading behavior.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterTable.scala New FileTable for raster reads (schema + scan builder wiring).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterScanBuilder.scala Implements limit + table sample pushdown for file selection (primarily when retile=false).
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterInputPartition.scala Defines input partitions used by the raster scan.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReaderFactory.scala Creates partition readers and wraps them with partition values.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterPartitionReader.scala Implements GeoTIFF reading + optional tiling and writes rows via UnsafeRowWriter.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/io/raster/RasterOptions.scala Adds read options (retile, tile sizes, padding) and exposes options class.
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/raster/RasterConstructors.scala Updates RS_TileExplode to use the new lazy tile iterator API.
spark/common/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Registers the new RasterDataSource for format("raster").
docs/api/sql/Raster-loader.md Documents the new raster data source API and deprecates the binaryFile-based approach.
common/src/main/java/org/apache/sedona/common/raster/TileGenerator.java New lazy in-db tiling engine (tile + iterator).
common/src/main/java/org/apache/sedona/common/raster/RasterConstructors.java Changes tiling API to return a lazy iterator and updates rsTile accordingly.
common/src/test/java/org/apache/sedona/common/raster/RasterConstructorsTest.java Updates tests to consume the new lazy tile iterator API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as draft February 25, 2026 01:08
…w comments

- Replace byte-array-based raster loading with stream-based HadoopImageInputStream
  to support GeoTIFF files larger than 2GB (fixes getLen.toInt overflow)
- Add HadoopImageInputStream adapter (common module) wrapping Hadoop FSDataInputStream
  into javax.imageio.stream.ImageInputStream for GeoTiffReader
- Add fromGeoTiff(ImageInputStream) overload to RasterConstructors
- Add hadoop-client as provided dependency in common/pom.xml
- Add HadoopImageInputStreamTest with sequential, random, and unstable stream tests
- Fix table sample pushdown to respect lowerBound (PR review comment #2)
- Fix tile rect computation for images with non-zero minX/minY (PR review comment #3)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…fixes

- RS_TileExplode: return Iterator.empty when raster is null instead of NPE
- HadoopImageInputStream: guard against non-progressing stream (ret_len == 0)
- HadoopImageInputStreamTest: fix UnstableInputStream to return at least 1 byte
- RasterOptions: only enforce tileWidth/tileHeight pairing when retile=true
- RasterPartitionReader: use floating-point division for aspect ratio check
- rasterIOTest: remove explain(true) noise from CI logs
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review February 25, 2026 09:13
@jiayuasu jiayuasu merged commit 189cfd0 into master Feb 25, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a new raster data source reader that can automatically tile GeoTiffs and bypass the Spark record limit

1 participant