Skip to content

bug: StrictMetricsEvaluator returns incorrect result when null/NaN counts are missing for a field. #685

@sentomk

Description

@sentomk

Summary

StrictMetricsEvaluator::CanContainNulls and CanContainNaNs incorrectly return false when the null_value_counts / nan_value_counts map is non-empty but does not contain an entry for the queried field. This causes the evaluator to erroneously return kRowsMustMatch, potentially skipping row-level filtering and returning rows that do not satisfy the predicate.

Root Cause

In src/iceberg/expression/strict_metrics_evaluator.cc:

bool CanContainNulls(int32_t id) {
  if (data_file_.null_value_counts.empty()) {
  return true;
  }
  auto it = data_file_.null_value_counts.find(id);
  return it != data_file_.null_value_counts.cend() && it->second > 0; 
  //       ^^^ when field is missing from map, this evaluates to false
}

The same pattern exists in CanContainNaNs.

Reproduction

  auto data_file = std::make_shared<DataFile>();
  data_file->record_count = 50;
  data_file->value_counts = {{14, 50L}};
  data_file->null_value_counts = {{4, 0L}, {5, 0L}};  // field 14 missing
  data_file->nan_value_counts = {{8, 0L}};             // field 14 missing
  data_file->upper_bounds = {{14, Literal::Double(100.0).Serialize().value()}};
  data_file->lower_bounds = {{14, Literal::Double(1.0).Serialize().value()}};

  // Evaluating: no_nan_stats < 200.0
  // Expected: kRowsMightNotMatch (null count unknown)
  // Actual:   kRowsMustMatch (incorrectly skips filtering)

Proposed Fix

CanContainNulls: if the field is required per schema, return false; if the field is not found in a non-empty map, return true (conservative).
CanContainNaNs: if the field type is not float/double, return false; if the field is not found in a non-empty map, return true (conservative).

This aligns with Java's StrictMetricsEvaluator.canContainNulls() / canContainNaNs() which return true when the field is missing from the map.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions