feat: add functions for add and replacing data directly with datafiles by agaddis02 · Pull Request #723 · apache/iceberg-go

agaddis02 · 2026-02-12T20:58:01Z

Context

If you want to write your own parquet files and only use iceberg to handle the metadata, you are only left with the option (for the most part) of leveraging the ReplaceDataFiles function.

This function takes in a list of existing files and a list of new file paths to override that previous data with.

This function works fine for the most part, but the function includes a scan in it which means it's not actually taking your word that your new parquet files match the table schema.

This scan proves to be problematic in some cases when you are writing files very fast and leveraging multipart uploads. You know the location of all files, know they are valid parquet files, but the commit has the possibility to return an error because at the time of commit the file might not be fully available.

the error looks something like this at commit time: failed to replace data files: error encountered during file conversion: parquet: could not read 8 bytes from end of file.

Solution

We have tested this out in vendor code and opened a fork that adds a new function.

ReplaceDataFiles is scanning your file paths to try and ensure the schema of said files match the schema of the table you are inputting them into.

We, and I would assume a lot of people writing their own parquet files, don't need this. Our ingestion framework guarantees we will never get a incorrect parquet file, and we also have access to our Parquet Schema and Arrow Schema for the entirety of the ingestion.

So I can build data files directly and would much rather just pass my own datafiles to this function, as I know the files will eventually be available and they will be correct. all this is doing is telling the metadata where to look at said file, there is no real harm in committing before that file is actually available unless you are querying it right away and it happens to not be available.

This also speeds up the commit time tremendously as this library doesn't need to go through scan all of the files for every single commit.

agaddis02 · 2026-02-15T04:31:30Z

@rockwotj, @zeroshade & @subkanthi, this pull request has been updated based on the comments left on the other PR: #710.

Let me know if anything else needs to be added or changed, I believe this addresses the main points of concerns in terms of adding test, explicitly warning users that this can be dangerous, and attempting to curb that danger as much as possible by validating all of the possible items we can (with out scanning the file).

zeroshade

I think the tests are missing some of the error cases such as if AddDataFiles or ReplaceDataFiles is trying to add a duplicate file path

agaddis02 · 2026-02-16T20:01:39Z

That makes sense, will add more test cases that try to reach all possible paths of the functions

agaddis02 · 2026-02-17T15:19:52Z

@zeroshade updated to include test that should test all possible error paths of the new functions.

Also resolved conflicts.

zeroshade

Only one nitpick, but otherwise this looks good!

rockwotj · 2026-02-19T20:33:20Z

Thank you @zeroshade - any ideas when the next release is? I can use tip of main but it's been since October and there are plenty of good things queued up for release (schema update apis, this etc)

zeroshade · 2026-02-19T21:31:05Z

I was just thinking about that @rockwotj, I'm thinking that once we get #709 merged that'll be a good milestone for the next release.

ferhatelmas · 2026-02-19T23:27:22Z

+func (t *Transaction) validateDataFilesToAdd(dataFiles []iceberg.DataFile, operation string) (map[string]struct{}, error) {
+	currentSpec, err := t.meta.CurrentSpec()
+	if err != nil || currentSpec == nil {
+		return nil, fmt.Errorf("could not get current partition spec: %w", err)


err might be nil here and it would break formatting.

Might be better to return a separate error no current partition spec found

ferhatelmas · 2026-02-19T23:30:59Z

+	if partitionData == nil {
+		partitionData = map[int]any{}
+	}


Reading nil map is fine, we don't need this, right?

ferhatelmas · 2026-02-19T23:32:45Z

+		}
+
+		if len(referenced) > 0 {
+			return fmt.Errorf("cannot add files that are already referenced by table, files: %s", referenced)


referenced is a slice, not a string. %v would be better or converted to a string with join

related to #723 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

apache#723) If you want to write your own parquet files and only use iceberg to handle the metadata, you are only left with the option (for the most part) of leveraging the `ReplaceDataFiles` function. This function takes in a list of existing files and a list of new file paths to override that previous data with. This function works fine for the most part, but the function includes a scan in it which means it's not actually taking your word that your new parquet files match the table schema. This scan proves to be problematic in some cases when you are writing files very fast and leveraging multipart uploads. You know the location of all files, know they are valid parquet files, but the commit has the possibility to return an error because at the time of commit the file might not be fully available. the error looks something like this at commit time: `failed to replace data files: error encountered during file conversion: parquet: could not read 8 bytes from end of file`. We have tested this out in vendor code and opened a fork that adds a new function. `ReplaceDataFiles` is scanning your file paths to try and ensure the schema of said files match the schema of the table you are inputting them into. We, and I would assume a lot of people writing their own parquet files, don't need this. Our ingestion framework guarantees we will never get a incorrect parquet file, and we also have access to our Parquet Schema and Arrow Schema for the entirety of the ingestion. So I can build data files directly and would much rather just pass my own datafiles to this function, as I know the files will eventually be available and they will be correct. all this is doing is telling the metadata where to look at said file, there is no real harm in committing before that file is actually available unless you are querying it right away and it happens to not be available. This also speeds up the commit time tremendously as this library doesn't need to go through scan all of the files for every single commit. Co-authored-by: Adam Gaddis <adamtyler@cloudflare.com>

related to apache#723 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

agaddis02 marked this pull request as ready for review February 12, 2026 22:16

agaddis02 mentioned this pull request Feb 12, 2026

Allow for writing datafiles without iceberg scanning #724

Closed

agaddis02 changed the title ~~feat: add functions for add and replacing data directly with datafiles~~ #724 feat: add functions for add and replacing data directly with datafiles Feb 12, 2026

subkanthi mentioned this pull request Feb 13, 2026

table: add ability to add DataFile directly #710

Closed

zeroshade reviewed Feb 16, 2026

View reviewed changes

rockwotj mentioned this pull request Feb 17, 2026

introduce iceberg output connector redpanda-data/connect#3978

Merged

agaddis02 force-pushed the main branch from 27273a7 to 274165b Compare February 17, 2026 15:09

zeroshade reviewed Feb 17, 2026

View reviewed changes

Comment thread table/transaction.go

zeroshade changed the title ~~#724 feat: add functions for add and replacing data directly with datafiles~~ feat: add functions for add and replacing data directly with datafiles Feb 18, 2026

feat: add functions for add and replacing data directly with datafiles

f80188f

agaddis02 force-pushed the main branch from 274165b to f80188f Compare February 18, 2026 20:56

zeroshade approved these changes Feb 19, 2026

View reviewed changes

zeroshade merged commit a576506 into apache:main Feb 19, 2026
13 checks passed

ferhatelmas reviewed Feb 19, 2026

View reviewed changes

ferhatelmas mentioned this pull request Feb 20, 2026

fix(table): error formatting #747

Merged

zeroshade pushed a commit that referenced this pull request Feb 20, 2026

fix(table): error formatting (#747)

1114e38

related to #723 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

zeroshade mentioned this pull request Feb 23, 2026

Support non-arrow writes without reading parquet metadata #713

Closed

BrewTestBot mentioned this pull request Mar 5, 2026

iceberg-cli 0.5.0 Homebrew/homebrew-core#270855

Merged

rockwotj pushed a commit to rockwotj/iceberg-go that referenced this pull request Mar 18, 2026

fix(table): error formatting (apache#747)

ed31cd2

related to apache#723 Signed-off-by: ferhat elmas <elmas.ferhat@gmail.com>

This was referenced Apr 3, 2026

feat: v2 spec completion tracking issue #829

Open

feat: table compaction (RewriteDataFiles) #832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add functions for add and replacing data directly with datafiles#723

feat: add functions for add and replacing data directly with datafiles#723
zeroshade merged 1 commit intoapache:mainfrom
agaddis02:main

agaddis02 commented Feb 12, 2026 •

edited

Loading

Uh oh!

agaddis02 commented Feb 15, 2026

Uh oh!

zeroshade left a comment

Uh oh!

agaddis02 commented Feb 16, 2026

Uh oh!

agaddis02 commented Feb 17, 2026

Uh oh!

zeroshade left a comment

Uh oh!

Uh oh!

Uh oh!

rockwotj commented Feb 19, 2026

Uh oh!

zeroshade commented Feb 19, 2026

Uh oh!

ferhatelmas Feb 19, 2026

Uh oh!

ferhatelmas Feb 20, 2026

Uh oh!

ferhatelmas Feb 19, 2026

Uh oh!

ferhatelmas Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

agaddis02 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Solution

Uh oh!

agaddis02 commented Feb 15, 2026

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

agaddis02 commented Feb 16, 2026

Uh oh!

agaddis02 commented Feb 17, 2026

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rockwotj commented Feb 19, 2026

Uh oh!

zeroshade commented Feb 19, 2026

Uh oh!

ferhatelmas Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

ferhatelmas Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

ferhatelmas Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

ferhatelmas Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

agaddis02 commented Feb 12, 2026 •

edited

Loading