feat(update): add MergingSnapshotUpdate#682
Conversation
e950bed to
0eebdbf
Compare
Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer. Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.
0eebdbf to
3085e61
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| @@ -94,7 +94,6 @@ void ManifestFilterManager::DeleteFile(std::string_view path) { | |||
| Status ManifestFilterManager::DeleteFile(std::shared_ptr<DataFile> file) { | |||
| ICEBERG_PRECHECK(file != nullptr, "Cannot delete file: null"); | |||
| delete_paths_.insert(file->file_path); | |||
There was a problem hiding this comment.
Keep object deletes distinct from path deletes. Java DeleteFileSet keys delete files by location + content offset + content size; path-only matching can remove every DV blob in the same Puffin file.
|
|
||
| // Group delete files by partition spec ID, mirroring WriteNewDataManifests(). | ||
| std::unordered_map<int32_t, std::vector<PendingDeleteFile>> delete_files_by_spec; | ||
| for (const auto& pending_file : new_delete_files_) { |
There was a problem hiding this comment.
Normalize staged deletes before writing and summarizing. Java dedupes v2 deletes and merges DVs by referenced data file first; writing new_delete_files_ as-is can over-count duplicates and commit multiple DVs for one data file.
| const TableMetadata& metadata, int64_t starting_snapshot_id, | ||
| const PartitionSet& partition_set, const std::shared_ptr<Snapshot>& parent, | ||
| std::shared_ptr<FileIO> io) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto deletes, AddedDeleteFiles(metadata, starting_snapshot_id, |
There was a problem hiding this comment.
Pass partition_set into AddedDeleteFiles. Java filters delete manifests before indexing; post-filtering ReferencedDeleteFiles can still fail on unrelated deletes, for example duplicate DVs outside these partitions.
| const TableMetadata& /*metadata*/, int64_t /*starting_snapshot_id*/, | ||
| std::shared_ptr<Expression> /*conflict_filter*/, | ||
| const std::shared_ptr<Snapshot>& /*parent*/, std::shared_ptr<FileIO> /*io*/) { | ||
| return NotImplemented( |
There was a problem hiding this comment.
Implement Java’s concurrent-DV check here. Returning NotImplemented leaves v3 row-delta style writers unable to validate the guard that rejects another DV for the same referenced data file.
| snapshot, tracked_factory)); | ||
|
|
||
| // Track deleted data files in the summary builder. | ||
| for (const auto& file : data_filter_manager_.FilesToBeDeleted()) { |
There was a problem hiding this comment.
Also propagate duplicate-delete counts into the summary. Java reports deleted-duplicate-files when filtering removes the same file more than once; FilesToBeDeleted() dedupes by path, so that signal is lost.
Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer.
Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.