Skip to content

feat(update): add MergingSnapshotUpdate#682

Open
gty404 wants to merge 13 commits into
apache:mainfrom
gty404:merging-snapshot-update
Open

feat(update): add MergingSnapshotUpdate#682
gty404 wants to merge 13 commits into
apache:mainfrom
gty404:merging-snapshot-update

Conversation

@gty404
Copy link
Copy Markdown
Contributor

@gty404 gty404 commented May 26, 2026

Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.

@gty404 gty404 force-pushed the merging-snapshot-update branch 6 times, most recently from e950bed to 0eebdbf Compare May 26, 2026 12:39
Abstract base for merge-based snapshot operations (MergeAppend,
OverwriteFiles, RowDelta, etc.), implementing the filter → write →
merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a
use-after-free bug in SnapshotUpdate::DeleteFile.
@gty404 gty404 force-pushed the merging-snapshot-update branch from 0eebdbf to 3085e61 Compare May 26, 2026 15:40
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/manifest/manifest_filter_manager.cc Outdated
Comment thread src/iceberg/manifest/manifest_merge_manager.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
@@ -94,7 +94,6 @@ void ManifestFilterManager::DeleteFile(std::string_view path) {
Status ManifestFilterManager::DeleteFile(std::shared_ptr<DataFile> file) {
ICEBERG_PRECHECK(file != nullptr, "Cannot delete file: null");
delete_paths_.insert(file->file_path);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep object deletes distinct from path deletes. Java DeleteFileSet keys delete files by location + content offset + content size; path-only matching can remove every DV blob in the same Puffin file.


// Group delete files by partition spec ID, mirroring WriteNewDataManifests().
std::unordered_map<int32_t, std::vector<PendingDeleteFile>> delete_files_by_spec;
for (const auto& pending_file : new_delete_files_) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalize staged deletes before writing and summarizing. Java dedupes v2 deletes and merges DVs by referenced data file first; writing new_delete_files_ as-is can over-count duplicates and commit multiple DVs for one data file.

const TableMetadata& metadata, int64_t starting_snapshot_id,
const PartitionSet& partition_set, const std::shared_ptr<Snapshot>& parent,
std::shared_ptr<FileIO> io) {
ICEBERG_ASSIGN_OR_RAISE(auto deletes, AddedDeleteFiles(metadata, starting_snapshot_id,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass partition_set into AddedDeleteFiles. Java filters delete manifests before indexing; post-filtering ReferencedDeleteFiles can still fail on unrelated deletes, for example duplicate DVs outside these partitions.

const TableMetadata& /*metadata*/, int64_t /*starting_snapshot_id*/,
std::shared_ptr<Expression> /*conflict_filter*/,
const std::shared_ptr<Snapshot>& /*parent*/, std::shared_ptr<FileIO> /*io*/) {
return NotImplemented(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement Java’s concurrent-DV check here. Returning NotImplemented leaves v3 row-delta style writers unable to validate the guard that rejects another DV for the same referenced data file.

snapshot, tracked_factory));

// Track deleted data files in the summary builder.
for (const auto& file : data_filter_manager_.FilesToBeDeleted()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also propagate duplicate-delete counts into the summary. Java reports deleted-duplicate-files when filtering removes the same file more than once; FilesToBeDeleted() dedupes by path, so that signal is lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants