fix(expr): correct CanContainNulls/CanContainNaNs when field missing from stats map#686
fix(expr): correct CanContainNulls/CanContainNaNs when field missing from stats map#686sentomk wants to merge 2 commits into
Conversation
…from stats map When null_value_counts or nan_value_counts is non-empty but does not contain an entry for the queried field, the evaluator incorrectly returned false (cannot contain nulls/NaNs). This caused comparison operators to erroneously return kRowsMustMatch, skipping row-level filtering for rows that may not satisfy the predicate. Fix CanContainNulls to: - Return false for required (non-optional) fields per schema - Return true (conservative) when field is absent from a non-empty map Fix CanContainNaNs to: - Return false for non-floating-point types - Return true (conservative) when field is absent from a non-empty map This aligns with the Java StrictMetricsEvaluator behavior.
There was a problem hiding this comment.
Pull request overview
This PR fixes StrictMetricsEvaluator’s handling of missing per-field entries in null_value_counts / nan_value_counts, making the evaluator conservative (i.e., assume nulls/NaNs may exist) when the stats maps are non-empty but omit the queried field, preventing incorrect kRowsMustMatch results that could skip row-level filtering.
Changes:
- Update
CanContainNullsto returntruewhen a field is absent from a non-emptynull_value_countsmap, andfalsefor required schema fields. - Update
CanContainNaNsto returntruewhen a field is absent from a non-emptynan_value_countsmap, andfalsefor non-floating-point fields. - Add regression tests for missing-entry scenarios (with one test needing adjustment to actually exercise the bug condition).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/iceberg/expression/strict_metrics_evaluator.cc |
Makes null/NaN containment checks conservative when stats maps omit a field, and adds required/non-floating fast paths. |
src/iceberg/test/strict_metrics_evaluator_test.cc |
Adds regression coverage for missing null/NaN count entries (one test currently doesn’t set bounds, so it won’t regress the original issue). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f20015a to
e7344ea
Compare
Cover the scenario where a field is absent from null_value_counts or nan_value_counts while the map is non-empty. Verify that comparison operators conservatively return kRowsMightNotMatch instead of incorrectly claiming kRowsMustMatch.
e7344ea to
68c68b3
Compare
|
|
||
| bool CanContainNulls(int32_t id) { | ||
| auto field_result = schema_.GetFieldById(id); | ||
| if (field_result.has_value() && field_result->has_value() && |
There was a problem hiding this comment.
We should preserve error returned by schema_.GetFieldById, so we need to change the return type to Result<bool>. BTW, why calling GetFieldById instead of FindFieldById? They are semantically different.
| } | ||
|
|
||
| bool CanContainNaNs(int32_t id) { | ||
| // nan counts might be null for early version writers when nan counters are not |
There was a problem hiding this comment.
I think the original implementation matches the Java parity exactly, which assumes null_value_counts should populate all fields but nan_value_counts don't (when they are not empty). I agree current PR is a nice fix. I still suggest keeping this comment since it still holds.
| ExpectShouldRead(Expressions::NotNull("struct.nested_col_with_stats"), false); | ||
| } | ||
|
|
||
| TEST_F(StrictMetricsEvaluatorMigratedTest, MissingNullCountForField) { |
There was a problem hiding this comment.
This is not migrated from Java implementation so please do not use this confusing name.
Summary
StrictMetricsEvaluator::CanContainNullsto returntrue(conservative) when a field is absent from a non-emptynull_value_countsmap, andfalsefor required fields per schemaStrictMetricsEvaluator::CanContainNaNsto returntrue(conservative) when a field is absent from a non-emptynan_value_countsmap, andfalsefor non-floating-point typesClose #685
Motivation
When
null_value_countsornan_value_countsis non-empty but does not contain an entry for the queried field, the evaluator incorrectly returnedfalse. This caused comparison operators to erroneously returnkRowsMustMatch, potentially skipping row-level filtering and returning rows that do not satisfy the predicate.Test plan