Skip to content

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814

Open
il9ue wants to merge 2 commits into
Altinity:antalya-26.3from
il9ue:fix/iceberg-empty-stats-26.3
Open

Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) — antalya-26.3#1814
il9ue wants to merge 2 commits into
Altinity:antalya-26.3from
il9ue:fix/iceberg-empty-stats-26.3

Conversation

@il9ue
Copy link
Copy Markdown

@il9ue il9ue commented May 20, 2026

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading from manifests written without per-file column statistics (typical of non-Spark writers like pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants. Antalya 26.3 fix for Altinity/ClickHouse#1545.

Description

Antalya-specific bug fix on antalya-26.3. No upstream cherry-pick — this bug exists only on Antalya, introduced by Altinity/ClickHouse#1069 ("Read optimization using Iceberg metadata"). Mirror of the 25.8 fix in Altinity/ClickHouse#1688.

Why this fires

When reading an Iceberg table written by a non-Spark writer that omits per-file column statistics from the manifest's Avro schema (pyiceberg with default settings, format v1 writers, and others), the allow_experimental_iceberg_read_optimization path produces silent data loss: correct row counts, every column value NULL. The reporter confirmed it on icebergLocal; investigation showed the same code path fires for icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

Root cause

IcebergIterator always populates file_meta_info before yielding objects, so the file_meta_data.has_value() check in the optimization passes. The issue is what's inside the populated DataFileMetaInfo: when the manifest's data_file.value_counts / column_sizes / null_value_counts Avro fields are all absent (per the Iceberg spec, all three are optional), DataFileMetaInfo::columns_info stays empty.

The optimization's second loop in StorageObjectStorageSource::createReader then iterates every requested column, finds none of them in the empty columns_info map, and adds them all to constant_columns_with_values with Field() (NULL). requested_columns_copy is cleared, need_only_count = true, the Parquet reader returns row count only, and generate() injects every column as a constant-NULL column at the correct row count.

The optimization conflates "no stats were written" with "all columns are absent." Absent stats tell us nothing about which columns are physically present in the file.

The fix

Add any_stats_field_present (bool) to DataFileMetaInfo. Populate it during manifest parsing in AvroForIcebergDeserializer.cpptrue if any of value_counts, column_sizes, or null_value_counts were emitted by the writer. Gate the optimization's absent-NULL loop on this flag: when no stats were emitted, skip the loop entirely and fall through to the Parquet reader, which correctly handles both physically-present columns (read normally) and schema-evolved-absent columns (handled upstream by IcebergMetadata::getInitialSchemaByPath setting the file's own schema as initial_header).

A per-column presence set was considered but is unnecessary because schema evolution is already handled upstream of the optimization; the boolean is sufficient.

JSON serialization (cluster reads via toJson() / JSON-ptr constructor) is updated to round-trip the new field. Missing-on-deserialization defaults to false, which matches pre-fix behavior.

Files changed

  • src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.h: added any_stats_field_present field to DataFileMetaInfo; constructor signature updated.
  • src/Storages/ObjectStorage/DataLakes/IDataLakeMetadata.cpp: JSON serde round-trips the new field; missing-on-deserialize defaults to false.
  • src/Storages/ObjectStorage/DataLakes/Iceberg/ManifestFile.h: header updates for ParsedManifestFileEntry.
  • src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp: tracks whether any stats Avro field was present during manifest parsing on 26.3.
  • src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp: forwards the new bool when constructing DataFileMetaInfo.
  • src/Storages/ObjectStorage/StorageObjectStorageSource.cpp: the absent-NULL loop now skips when any_stats_field_present is false.

Note: 26.3 uses AvroForIcebergDeserializer.cpp for manifest parsing where 25.8 / 26.1 use ManifestFile.cpp (file was split upstream). Same logic, different file.

Tested

  • Integration test tests/integration/test_storage_iceberg_no_spark/test_iceberg_read_optimization_empty_stats.py ported from the 25.8 PR. Test logic and conftest fixture (started_cluster_iceberg_no_spark) compatible with 26.3 as-is. Four scenarios:
    • test_iceberg_local_returns_actual_rows_with_stats_less_manifest — reproducer, fails without the fix.
    • test_iceberg_local_returns_correct_rows_when_optimization_disabled — control.
    • test_iceberg_local_partial_stats_manifest_reads_correctly — manifest with value_counts only.
    • test_iceberg_local_full_stats_manifest_reads_correctly — full Spark-style stats regression guard.
  • Local build verification: changed files passed clang -fsyntax-only against 26.3's source headers in the verification round of Backport #100607 to 25.8.16: Re-add {database} macro support in clickhouse-client prompt #1688. Full integration test execution will run on CI.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@il9ue il9ue requested review from ianton-ru and zvonand May 20, 2026 08:01
@il9ue il9ue force-pushed the fix/iceberg-empty-stats-26.3 branch 2 times, most recently from 4c61947 to 4f483f6 Compare May 20, 2026 12:21
@il9ue
Copy link
Copy Markdown
Author

il9ue commented May 20, 2026

Heads up

CI is failing with a workflow template error:

Error: The template is not valid. .github/workflows/pull_request.yml
(Line: 5285, Col: 18): Error reading JToken from JsonReader.
Path '', line 0, position 0.

This appears to be coming from antalya-26.3's pull_request.yml — that file is identical between this PR and the base branch (verified via git diff origin/antalya-26.3 -- .github/workflows/pull_request.yml), and the failing line references fromJson(needs.config_workflow.outputs.data) which suggests config_workflow isn't producing the expected output upstream.

This looks like an infrastructure issue on antalya-26.3 affecting all PRs targeting that branch, not anything specific to this PR. The code itself builds clean locally with clang-21 + RelWithDebInfo.

Is this something the CI team is aware of, or should I open a separate issue?

@strtgbb
Copy link
Copy Markdown
Collaborator

strtgbb commented May 20, 2026

This is a known issue. We have separate workflows for internal and external branches. When the author of an external branch is an org member, some jobs get confused.


void DataFileMetaInfo::serialize(WriteBuffer & out) const
{
writeIntBinary(static_cast<UInt8>(stats_were_read), out);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this break backward compatibility?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one — yes, the binary serialize/deserialize path is a backward-compat concern for *Cluster reads during rolling upgrades, since writeIntBinary is positional and there's no version byte to branch on. An old worker reading from a new initiator would leave the new UInt8 in the stream and misalign every subsequent field; the reverse direction would short-read or grab garbage.
The JSON path (toJson() / JSON-ptr ctor) is fine — it tolerates the missing key and defaults to false, matching pre-fix behavior. The binary path is the one that needs attention.
Two options I'd like your read:

  • Version the binary format. Prepend a UInt8 format-version byte; old code reads the version, new code reads version + optional new fields. Cleanest forward path but adds machinery the rest of DataFileMetaInfo doesn't currently have.
  • Keep the new field out of the binary stream. The flag is only consumed by StorageObjectStorageSource::createReader on the worker that's actually parsing the manifest, so if the binary serde is only used in paths where the receiver re-parses (or in paths where the optimization is a no-op), we can leave the binary format untouched and rely solely on the JSON round-trip for cluster reads. I want to verify this before claiming it, though — do you know off the top of your head which code paths actually go through the binary serialize here vs. JSON?

If neither works, I'll fall back to (1). Want to make sure I match whatever conventions Antalya has for cross-version compat on this struct.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does logic really need a new field? Is it possible to interpret something like file_meta_data.value()->columns_info.size() == 0 as stats weren't read?

Changing protocol in a fork is not a best idea, even with versioning. It creates a lot of problem in future, if upstream changed something with different way.

Comment thread src/Storages/ObjectStorage/DataLakes/Common/AvroForIcebergDeserializer.cpp Outdated
@alsugiliazova
Copy link
Copy Markdown
Member

Audit: PR #1814 — Iceberg read optimization NULLs for stats-less manifests (antalya-26.3)

AI audit note: This review comment was generated by AI (Composer).

Audit update for PR #1814 (Fix Iceberg read optimization returning NULLs for stats-less manifests — antalya-26.3):

Confirmed defects

Medium: Empty per-column stats arrays still take the all-NULL optimization path

  • Impact: Same silent data loss as Read using IcebergLocal returns Nulls instead of actual rows #1545 (correct row count, every column NULL) when the manifest Avro writer schema includes value_counts / column_sizes / null_value_counts but the per-file values are empty arrays/maps.
  • Anchor: AvroForIcebergDeserializer.cpp (hasPath + any_stats_field_present) → StorageObjectStorageSource::createReader (absent-NULL loop gated only on stats_were_read).
  • Trigger: allow_experimental_iceberg_read_optimization=1 and a manifest entry whose writer schema declares stats fields but emits no column entries (empty value_counts, etc.).
  • Why defect: hasPath reflects schema presence, not non-empty runtime values. That sets any_stats_field_present=true while columns_infos stays empty; stats_were_read is true so the absent-NULL loop treats every nullable column as schema-evolved-absent and injects constant NULLs.
  • Fix direction: Gate the absent-NULL loop on non-empty columns_info as well (e.g. stats_were_read && !columns_info.empty()), or set any_stats_field_present only after at least one stat entry is parsed.
  • Regression test direction: Integration test with stats fields present in the Avro schema but empty value_counts: [] in the data file record.

Low: Binary DataFileMetaInfo serde extended without versioning (latent)

  • Impact: No current production path calls DataFileMetaInfo::serialize / deserialize (cluster distribution uses JSON via CommandInTaskResponse::toString() / toJson()). If binary serde is wired later or reused, a mixed-version cluster would mis-parse the stream (extra leading UInt8 shifts all following fields).
  • Anchor: IDataLakeMetadata.cpp (serialize / deserialize).
  • Trigger: Any future or out-of-tree caller serializing DataFileMetaInfo across nodes running different builds during a rolling upgrade.
  • Why defect: Positional binary format changed with no version byte; reviewer concern on the PR thread is valid even though the path is unused in-tree today.
  • Fix direction: Omit the field from binary serde (JSON-only, as cluster already does) or add a format-version prefix before extending the stream.
  • Regression test direction: N/A until binary path is used; if kept, round-trip test across “old writer / new reader” buffers.

Coverage summary

  • Scope reviewed: End-to-end path from manifest Avro parse → IcebergIterator::nextDataFileMetaInfo (local + JSON cluster task) → StorageObjectStorageSource::createReader optimization branches; compared with closed sibling PR Fix Iceberg read optimization returning NULLs for stats-less manifests (#1545) #1764 (26.1). Integration tests read (4 icebergLocal scenarios); cluster *Cluster path traced via StorageObjectStorageStableTaskDistributor + CommandInTaskResponse JSON (not binary serde).
  • Categories failed: Logical fault injection — “stats fields present in schema but empty values”; protocol — binary serde extension on unused path.
  • Categories passed: Main Read using IcebergLocal returns Nulls instead of actual rows #1545 reproducer (no stats fields in schema); partial/full stats regression cases; JSON cluster round-trip backward compat (missing stats_were_read defaults safe); concurrency (read-only metadata per task); C++ lifetime/UB classes not implicated in the diff.
  • Assumptions/limits: Static audit only; CI not executed; no iceberg*Cluster integration test in the PR; rolling upgrade with old workers still running pre-fix optimization until all nodes are upgraded (operational, not introduced by JSON compat).

@alsugiliazova alsugiliazova added the verified-with-issues Verified by QA and issues found. label May 29, 2026
Copy link
Copy Markdown

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does logic really need a new field? Is it possible to interpret something like file_meta_data.value()->columns_info.size() == 0 as stats weren't read?

Changing protocol in a fork is not a best idea, even with versioning. It creates a lot of problem in future, if upstream changed something with different way.

When an Iceberg manifest's per-file column statistics are absent or
empty (common for non-Spark writers like pyiceberg with default
settings), DataFileMetaInfo::columns_info is empty. The optimization
in StorageObjectStorageSource::createReader misread this as "all
columns are absent from the file" and returned constant NULLs for
every row while still returning the correct row count. Result: silent
data loss on icebergLocal, icebergS3, icebergAzure, icebergHDFS, and
all *Cluster variants.

Gate the optimization's absent-NULL loop directly on
columns_info.empty() instead of introducing a separate stats-presence
flag. When no usable per-column stats were parsed -- whether the
manifest omitted the stats fields entirely or declared them but left
them empty -- fall through to the Parquet reader, which correctly
handles physically-present columns (read normally) and
schema-evolved-absent columns (handled by
IcebergMetadata::getInitialSchemaByPath setting the file's own schema
as initial_header). columns_info is already serialized to workers in
the cluster JSON path, so this changes no serialization format and
keeps the fork's DataFileMetaInfo serde identical to upstream.

Closes Altinity#1545.
Mirror of Altinity#1688 (antalya-25.8 fix).

Signed-off-by: Daniel Q. Kim <daniel.kim@altinity.com>
@il9ue il9ue force-pushed the fix/iceberg-empty-stats-26.3 branch from 4f483f6 to 5ec03a9 Compare June 1, 2026 13:40
@il9ue
Copy link
Copy Markdown
Author

il9ue commented Jun 1, 2026

Thanks both — agreed on each point, and they turned out to be the same fix.

On the new field: agree that it's unnecessary. columns_info.empty() already expresses the real invariant — "did we parse any usable per-column stats" — so the separate flag was just a proxy for it. It was also a leaky proxy: it was derived from Avro schema presence, so a writer that declares value_counts / column_sizes / null_value_counts in the schema but emits them empty would set the flag true while columns_info stayed empty, re-triggering the same all-NULL data loss through a different door.

I've dropped the field entirely and gated the absent-NULL loop directly on !columns_info.empty(). One condition now covers both cases — stats fields absent, and stats fields present-but-empty — both yield an empty map and fall through to the Parquet reader. Partial- and full-stats manifests produce a non-empty map, so the loop still runs for them and genuinely schema-evolved-absent columns are still nullified; that path is unchanged.

On changing the protocol in a fork: fully agree, and dropping the field removes the concern at the root rather than versioning around it. columns_info is already serialized to workers via the existing JSON task path, so deriving the gate from its emptiness adds nothing to any serde — JSON or binary. There's no new field, no version byte, and no positional format change, so DataFileMetaInfo serialization stays byte-identical to upstream and a future upstream change there can't collide with this fork.

The reworked commit is now effectively two files: the one-line guard change in StorageObjectStorageSource.cpp and the integration tests. The three deserializer/metadata files that previously carried the flag are no longer in the diff. Regression coverage includes the stats-fields-present-but-empty case, which exercises the gate directly, plus partial- and full-stats controls. Force-pushed.

Copy link
Copy Markdown

@ianton-ru ianton-ru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alsugiliazova
Copy link
Copy Markdown
Member

Verification Report: PR #1814 — Iceberg Read Optimization (Stats-Less Manifests)

Verdict

CI status RED (5 failed jobs)
Block merge on CI? Policy-dependent — no failure tied to the PR fix or new tests
Fix validated in CI? Yes — all four new integration scenarios passed on amd_binary
Requested Iceberg regression Did not run (ci_regression_jobs: [], regression jobs skipped)

Failed jobs (5)

Job Failing test(s) Relation to PR
Stateless amd_tsan, sequential, 1/2 03760_backup_tar_archive [FAIL]; filesystem-cache suite [BROKEN] (02242, 02241, 02240, 02226) Unrelated — backup/restore; missing azure_cache storage policy on runner
Stateless amd_tsan, s3 storage, sequential, 1/2 03760_backup_tar_archive [FAIL] Unrelated
Stateless arm_binary, sequential Filesystem-cache tests [BROKEN] (azure_cache); 00157_cache_dictionary [FAIL] Unrelated — env / flaky dictionary
Integration amd_tsan, 4/6 test_storage_iceberg_with_spark/test_remote_initiator.py::test_remote_initiator_after_non_remote[s3] Unrelated — different Iceberg test (Spark path, remote initiator)
Integration amd_msan, 4/6 Same test_remote_initiator_after_non_remote[s3] Unrelated

PR-scoped failures: 0

Filesystem-cache BROKEN pattern

Stateless tests fail at CREATE TABLE ... SETTINGS storage_policy='azure_cache' with:

Code: 478. DB::Exception: Unknown storage policy 'azure_cache'. (UNKNOWN_POLICY)

This is a CI environment/configuration issue, not a regression from StorageObjectStorageSource.cpp.


PR-specific tests (all passed)

On Integration tests (amd_binary, 4/5) — job that runs changed integration tests:

Test Result
test_iceberg_local_returns_actual_rows_with_stats_less_manifest PASSED
test_iceberg_local_returns_correct_rows_when_optimization_disabled PASSED
test_iceberg_local_partial_stats_manifest_reads_correctly PASSED
test_iceberg_local_full_stats_manifest_reads_correctly PASSED

Recommendation

  1. Merge gate (strict CI green): Re-run failed jobs or confirm failures are pre-existing on antalya-26.3 before merging.
  2. Merge gate (fix-only): CI supports the fix — new tests green on primary integration binary job; no failure in changed files’ test path.
  3. Run Iceberg regression if required for release — not executed in this workflow run despite PR checkbox.
  4. Optional: Query gh-data.checks for flake history on 03760_backup_tar_archive and test_remote_initiator_after_non_remote[s3] on PR 1814 vs base branch.

CI verdict: Red workflow with no identified PR-caused test failure; actionable gaps are infra/flaky unrelated tests and missing requested regression run.

@alsugiliazova
Copy link
Copy Markdown
Member

Audit (re-run): PR #1814 — Iceberg read optimization NULLs for stats-less manifests (antalya-26.3)

AI audit note: This review comment was generated by AI (Composer).

Audit update for PR #1814 (Fix Iceberg read optimization returning NULLs for stats-less manifests — antalya-26.3):

No confirmed defects in reviewed scope.

Both defects from the previous audit are resolved by the reworked diff:

  • Previous Medium ("empty stats arrays still take the all-NULL path"): resolved. The new gate !file_meta_data.value()->columns_info.empty() is the real invariant ("did we parse any per-column stats"). It catches both "stats fields absent in writer schema" and "stats fields declared but empty values" with one condition, eliminating the leaky hasPath-based proxy.
  • Previous Low ("binary DataFileMetaInfo serde extended without versioning"): resolved. DataFileMetaInfo is no longer modified; binary/JSON serialization stays byte-identical to upstream.

Coverage summary

  • Scope reviewed: End-to-end path IcebergIterator::nextDataFileMetaInfo (local + cluster JSON task via CommandInTaskResponse::toJson) → StorageObjectStorageSource::createReader optimization branches. Verified the outer if (file_meta_data.has_value()) block (line 820) and the new inner guard interact correctly with the pre-existing first loop computing constants from columns_info stats. Verified schema-evolved-absent columns are still handled upstream by IcebergMetadata::getInitialSchemaByPath setting initial_header to the file's own schema, so falling through to the Parquet reader when columns_info is empty is safe.
  • Categories failed: none.
  • Categories passed: main Read using IcebergLocal returns Nulls instead of actual rows #1545 reproducer (no stats fields in schema); stats fields declared but empty values (same gate, equivalent behavior); partial stats (value_counts only); full Spark-style stats; optimization disabled; cluster path (columns_info already serialized via JSON, no new protocol surface); C++ lifetime/UB/exception classes (single const predicate, no mutations or allocations); concurrency (per-task metadata, no new shared state).
  • Observations (not defects):
    • Sparse-per-column stats (writer emits stats for a subset of top-level columns) is a latent, pre-existing assumption of the optimization, not introduced or worsened by this PR; not realistically triggered by pyiceberg/Spark/others which emit per-column maps either fully or not at all for top-level fields.
    • The new guard redundantly re-checks file_meta_data.has_value() (already established by the enclosing block at line 820); harmless style nit.
    • No dedicated integration test for the "stats fields declared in Avro schema but empty arrays" case; the gate condition (columns_info.empty()) is identical for that input to the covered "no stats fields" case, so behavior is equivalent and the existing *_with_stats_less_manifest test exercises the gate.
  • Assumptions/limits: static audit only; CI verification report (separate comment) shows the four new integration scenarios passing on amd_binary, no PR-caused failures among red jobs.

@alsugiliazova
Copy link
Copy Markdown
Member

Audit (re-run): PR #1814 — Iceberg read optimization NULLs for stats-less manifests (antalya-26.3)

AI audit note: This review comment was generated by AI.

Audit update for PR #1814 (Fix Iceberg read optimization returning NULLs for stats-less manifests — antalya-26.3):

No confirmed defects in reviewed scope.

Both defects from the previous audit are resolved.

@alsugiliazova
Copy link
Copy Markdown
Member

Could you please reopen this pr from Altinity/ClickHouse repo so we get direct .deb package URLs for ClickHouse regression?

Your CI only exposes DEB_ARM_RELEASE as a GitHub Actions artifact zip which I can not use directly to run tests in https://github.com/Altinity/clickhouse-regression/actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 verified-with-issues Verified by QA and issues found.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants