feat(table): add support for merge-on-read delete by alexandre-normand · Pull Request #721 · apache/iceberg-go

alexandre-normand · 2026-02-12T05:22:03Z

This adds support for merge-on-read deletes. It offers an alternative to the copy-on-write to generate position delete files instead of rewriting existing data files.

I'm not very confident in the elegance of my solution as I'm still new to the internals of iceberg-go but the high-level is:

Reuse the classification code from the existing delete implementation to get the list of files of dropped files vs files with partial deletes
Reuse the arrow scanning facilities to filter records from the data files with partial deletes and emit position delete records with file path and position.
- This is done by reusing the pipeline code and function and making the first stage in the pipeline one to enrich the RecordBatch with the file Path and position before the original position is lost due to filtering.
- After filtering, the RecordBatch is projected to the position delete schema (i.e. the original schema fields are dropped)
Once we have filtered PositionDelete records that need to be emitted, we reuse the record to file writing to generate position delete files.

Testing

Integration tests were added to exercise the partitioned and unpartitioned paths and the data is such that it's meant to actually produce a position delete file rather than just go through the quick path that drops an entire file because all records are gone.

Indirect fixes

While working on this change and adding the testing for the partitioned table deletions, I realized that the manifest evaluation when the filter affected a field that was part of a partition spec was not built correctly. It needed to use similar code as what's done during scanning to build projections and build a manifest evaluator per partition id. This is fixed in this PR but this technically also applies to copy-on-write and overwrite paths so the fix goes beyond the scope of the merge-on-read.

Fixes #487.

laskoviymishka

Overall, this looks correct (I mostly compared it with Iceberg Java), but I think a couple of things need attention:

panic/recover is used in normal write-path error handling;
position-delete fanout needs additional tests;
missing focused local unit tests for invariant-heavy code (enrichRecordsWithPosDeleteFields, position-delete fanout).

Regarding the API change: the ToDataFile change is internal (package scope), but the ManifestWriter.ToManifestFile signature change is public and should be explicitly called out in the compatibility/release notes.

I would advocate for not changing this API and instead adding a new one with the extra argument, but this decision, in my opinion, is not critical.

zeroshade

this is looking good to me. There's a conflict to resolve in README.md, and I'll wait for @laskoviymishka to give feedback before merging.

…recordsToDataFiles

alexandre-normand · 2026-02-26T00:25:39Z

Here's the updated status of the TODO list:

Blockers (must fix before merge)

JSON syntax error in TestDeleteMergeOnReadPartitioned
Missing comma in array.TableFromJSON.
The partitioned integration test currently panics and never runs.

summary() does not account for positionDeleteFiles (snapshot_producers.go)
appendPositionDeleteFile() was added, but summary() was not updated.
MoR delete snapshots currently report zero delete metrics.

Follow-ups (can be addressed separately)

Partition routing via path.Split instead of (spec, partition) pair
(writePositionDeletesForFiles, processBatch)

Position delete files always written under CurrentSpec()
(positionDeleteRecordsToDataFiles)
At minimum, should be documented as safe only for single-spec tables.

writeUUID not threaded from commitUUID
(writePositionDeletesForFiles)

dropFile=true sentinel batch uses data schema instead of PositionalDeleteArrowSchema
(producePosDeletesFromTask)
Harmless today, but a latent schema mismatch.

No sort by (file_path, pos)
Performance gap vs Java’s SortedPositionDeleteWriter.

referenced_data_file not populated in delete manifest entries
Optional spec optimization.

Thanks for the in-depth review, @laskoviymishka .

For the lack of sorting, this would be a bigger effort because arrow-go doesn't yet have support for sorting (related issue which, I think, would be the right place to put that code in.
For the referenced_data_file, I'm going to call this out of scope for this one. I'm a little out of breath from this work and the deep in the trenches hunting of the long tail of missing little pieces and I think this is probably a good milestone to call this one done?

laskoviymishka

Reviewed the latest changes — the core logic looks correct:

MoR positional delete pipeline ordering is sound (positions assigned before filtering, remain valid through projection)
Manifest evaluator fix for partition spec evolution is a genuine correctness improvement, applies to CoW paths too
pos type fix (Int32 → Int64) aligns with the Iceberg spec

Two minor issues noted but will address in follow-up PRs to keep this moving:

Arrow array refcount leak in enrichRecordsWithPosDeleteFields — NewArray() results are never Release()d after NewRecordBatch retains them
Goroutine leak from iter.Pull in the partitioned path of positionDeleteRecordsToDataFiles — stopCount not called in the partitioned branch (same pattern from #718)

Neither blocks correctness for typical use. LGTM to merge.

zeroshade · 2026-02-26T16:10:12Z

Thanks to both of you!

alexandre-normand · 2026-02-26T17:15:28Z

Thanks for the review, @laskoviymishka and @zeroshade !

…teFields Arrays returned by NewArray() have refcount=1. NewRecordBatch calls Retain() on each column, bumping to refcount=2. Without an explicit Release() on the temporary arrays, the count never drops back to 1 when the record batch is released by the caller. Fix by assigning NewArray() results to local variables and deferring their Release(), so the lifecycle is: NewArray() -> refcount 1, NewRecordBatch Retain() -> refcount 2, deferred Release() -> refcount 1 (owned by outData), caller releases outData -> refcount 0 -> freed. Also extend TestEnrichRecordsWithPosDeleteFields to use memory.NewCheckedAllocator with mem.AssertSize(t, 0) to catch this class of leak going forward. Fixes leak introduced in apache#721.

related to apache#721 * remove premature decoder close in the constructor so that reader can actually read the entries * add explicit close method for resource cleanup * call close in ReadManifest to prevent leak * add zstd codec based regression test Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

ferhatelmas · 2026-03-02T22:03:42Z

+	defer func() {
+		_ = dec.Close()
+	}()


I think there is a regression here to fix the leak #766

related to apache#721 * remove premature decoder close in the constructor so that reader can actually read the entries * add explicit close method for resource cleanup * call close in ReadManifest to prevent leak * add zstd codec based regression test Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

related to #721 * remove premature decoder close in the constructor so that reader can actually read the entries * add explicit close method for resource cleanup * call close in ReadManifest to prevent leak * add zstd codec based regression test Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

ferhatelmas · 2026-03-02T23:03:45Z

+}
+
+func (p *positionDeletePartitionedFanoutWriter) partitionPath(partitionContext partitionContext) (string, error) {
+	data := partitionRecord(slices.Collect(maps.Values(partitionContext.partitionData)))


I think order matters here: #767

Arrays returned by NewArray() have refcount=1. NewRecordBatch calls Retain() on each column, bumping to refcount=2. Without an explicit Release() on the temporary arrays, the count never drops back to 1 when the record batch is released by the caller. Fix by assigning NewArray() results to local variables and deferring their Release(), so the lifecycle is: NewArray() -> refcount 1, NewRecordBatch Retain() -> refcount 2, deferred Release() -> refcount 1 (owned by outData), caller releases outData -> refcount 0 -> freed. Also extend TestEnrichRecordsWithPosDeleteFields to use memory.NewCheckedAllocator with mem.AssertSize(t, 0) to catch this class of leak going forward. Fixes leak introduced in #721.

Follow-up to apache#721. Test TBD.

partition record order is expected to match partition spec but maps.Values can change it and cause cross-partition writer reuse, delete expected files, etc. related to #721 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

![delete this](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZXlvdDl4dnl0d2d1OGEybXc3NTZkbHg5eXplMzZkbzF3c2xkZXl2ZSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/xULW8N9O5WD32L5052/giphy.gif) This adds support for merge-on-read deletes. It offers an alternative to the copy-on-write to generate position delete files instead of rewriting existing data files. I'm not very confident in the elegance of my solution as I'm still new to the internals of iceberg-go but the high-level is: * Reuse the classification code from the existing delete implementation to get the list of files of dropped files vs files with partial deletes * Reuse the arrow scanning facilities to filter records from the data files with partial deletes and emit position delete records with file path and position. * This is done by reusing the pipeline code and function and making the first stage in the pipeline one to enrich the `RecordBatch` with the file Path and position before the original position is lost due to filtering. * After filtering, the RecordBatch is projected to the position delete schema (i.e. the original schema fields are dropped) * Once we have filtered PositionDelete records that need to be emitted, we reuse the record to file writing to generate position delete files. ## Testing Integration tests were added to exercise the partitioned and unpartitioned paths and the data is such that it's meant to actually produce a position delete file rather than just go through the quick path that drops an entire file because all records are gone. ## Indirect fixes While working on this change and adding the testing for the partitioned table deletions, I realized that the manifest evaluation when the filter affected a field that was part of a partition spec was not built correctly. It needed to use similar code as what's done during scanning to build projections and build a manifest evaluator per partition id. This is fixed in this PR but this technically also applies to copy-on-write and overwrite paths so the fix goes beyond the scope of the `merge-on-read`. Fixes apache#487.

related to apache#721 * remove premature decoder close in the constructor so that reader can actually read the entries * add explicit close method for resource cleanup * call close in ReadManifest to prevent leak * add zstd codec based regression test Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

…ache#762) Arrays returned by NewArray() have refcount=1. NewRecordBatch calls Retain() on each column, bumping to refcount=2. Without an explicit Release() on the temporary arrays, the count never drops back to 1 when the record batch is released by the caller. Fix by assigning NewArray() results to local variables and deferring their Release(), so the lifecycle is: NewArray() -> refcount 1, NewRecordBatch Retain() -> refcount 2, deferred Release() -> refcount 1 (owned by outData), caller releases outData -> refcount 0 -> freed. Also extend TestEnrichRecordsWithPosDeleteFields to use memory.NewCheckedAllocator with mem.AssertSize(t, 0) to catch this class of leak going forward. Fixes leak introduced in apache#721.

partition record order is expected to match partition spec but maps.Values can change it and cause cross-partition writer reuse, delete expected files, etc. related to apache#721 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from 5079248 to 114fc57 Compare February 12, 2026 23:42

alexandre-normand commented Feb 13, 2026

View reviewed changes

Comment thread manifest.go

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from c9e30c4 to a7a7ce6 Compare February 15, 2026 06:45

alexandre-normand marked this pull request as ready for review February 15, 2026 06:46

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from a7a7ce6 to 6d41651 Compare February 16, 2026 19:35

laskoviymishka requested changes Feb 16, 2026

View reviewed changes

Comment thread table/arrow_utils.go Outdated

Comment thread table/arrow_scanner.go

Comment thread manifest.go Outdated

Comment thread table/pos_delete_partitioned_fanout_writer.go

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 8 times, most recently from ec1a905 to abb1cf4 Compare February 21, 2026 04:38

alexandre-normand requested a review from laskoviymishka February 21, 2026 04:47

zeroshade approved these changes Feb 23, 2026

View reviewed changes

alexandre-normand added 6 commits February 23, 2026 14:12

feat(table) add support for merge-on-read delete

fabeba5

Add ManifestFileOption to preserve backward-compatibility

8720277

add unit tests for enrichRecordsWithPosDeleteFields

8d2a311

Improve panic/error handling in positionDeleteRecordsToDataFiles and …

ef79d21

…recordsToDataFiles

Fix linter errors in arrow_scanner_test

891853e

Add tests for PositionDeletePartitionedFanoutWriter.processBatch

74c65a9

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from a27e3da to aaeb40b Compare February 25, 2026 22:30

Fix position delete info not included in snapshot summary

1311f05

alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from aaeb40b to 1311f05 Compare February 25, 2026 22:34

Fix content hardcoded to data in manifest file header

324b047

alexandre-normand requested a review from laskoviymishka February 26, 2026 00:25

laskoviymishka approved these changes Feb 26, 2026

View reviewed changes

zeroshade merged commit 8f3c302 into apache:main Feb 26, 2026
13 checks passed

laskoviymishka mentioned this pull request Feb 28, 2026

fix(table): fix refcount leak in enrichRecordsWithPosDeleteFields #762

Merged

ferhatelmas mentioned this pull request Mar 2, 2026

fix(manifest): handle lifecycle of the decoder in reader #766

Merged

ferhatelmas reviewed Mar 2, 2026

View reviewed changes

ferhatelmas mentioned this pull request Mar 2, 2026

fix(table): ensure partition path is deterministic #767

Merged

ferhatelmas reviewed Mar 2, 2026

View reviewed changes

laskoviymishka added a commit to laskoviymishka/iceberg-go that referenced this pull request Mar 4, 2026

fix(table): stop iter.Pull counter goroutine in partitioned write paths

56fcaca

Follow-up to apache#721. Test TBD.

laskoviymishka mentioned this pull request Mar 4, 2026

fix(table): stop iter.Pull counter goroutine in partitioned write paths #768

Closed

laskoviymishka mentioned this pull request Apr 3, 2026

feat: v2 spec completion tracking issue #829

Open

96 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table): add support for merge-on-read delete#721

feat(table): add support for merge-on-read delete#721
zeroshade merged 14 commits intoapache:mainfrom
alexandre-normand:alex.normand/merge-on-read-delete

alexandre-normand commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

laskoviymishka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeroshade left a comment

Uh oh!

alexandre-normand commented Feb 26, 2026

Blockers (must fix before merge)

Follow-ups (can be addressed separately)

Uh oh!

laskoviymishka left a comment

Uh oh!

zeroshade commented Feb 26, 2026

Uh oh!

Uh oh!

alexandre-normand commented Feb 26, 2026

Uh oh!

ferhatelmas Mar 2, 2026

Uh oh!

ferhatelmas Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexandre-normand commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Indirect fixes

Uh oh!

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

alexandre-normand commented Feb 26, 2026

Blockers (must fix before merge)

Follow-ups (can be addressed separately)

Uh oh!

laskoviymishka left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade commented Feb 26, 2026

Uh oh!

Uh oh!

alexandre-normand commented Feb 26, 2026

Uh oh!

ferhatelmas Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

ferhatelmas Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexandre-normand commented Feb 12, 2026 •

edited

Loading