Skip to content

[MaterializedView] Harden partition-state engine#18642

Open
hongkunxu wants to merge 3 commits into
apache:masterfrom
hongkunxu:feat/sse_mv_hardening
Open

[MaterializedView] Harden partition-state engine#18642
hongkunxu wants to merge 3 commits into
apache:masterfrom
hongkunxu:feat/sse_mv_hardening

Conversation

@hongkunxu
Copy link
Copy Markdown
Contributor

@hongkunxu hongkunxu commented Jun 1, 2026

Description

Hardens the MV partition-state engine: two correctness fixes, then consolidation of mutations and fingerprint logic.

Bugfix

  1. Backfill after retention DELETE — Old DELETE removed the runtime partition entry (VACANT). ConsistencyMgr did not mark absent in-coverage buckets STALE, so backfill never triggered OVERWRITE. DELETE now leaves VALID + empty fingerprint; ConsistencyMgr can synthesize VACANT → STALE for in-coverage windows.
  2. Watermark vs wall clockwatermarkMs was advanced toward now - bufferMs even when the base table had no newer segments, so broker freshness (now - watermarkMs) could report stale MV data as fresh. Watermark advancement is now capped at the latest source segment endTime.

Refactor

  • Single CAS owner: MaterializedViewPartitionManager (replaces three ad-hoc retry loops).
  • Shared MaterializedViewTaskUtils#computeWindowFingerprint for scheduler, executor, and consistency paths.

Design reference

The full partition-state machine — state definitions, operations, allowed transitions, and how the consistency manager keeps the partition map in sync with base-table segment changes — is documented here, with the core tables inlined below for reviewer convenience:

MV partition state engine — design notes (Google Doc)

Partition state definitions

State Description
VACANT No PartitionInfo entry exists for this bucket key in the partition map. The bucket has not been recorded by the MV's per-partition state machine.
VALID A PartitionInfo entry exists in the partition map with state = VALID and a PartitionFingerprint (segment count + farm-hash CRC, or PartitionFingerprint.EMPTY).
STALE A PartitionInfo entry exists in the partition map with state = STALE. The MV's recorded fingerprint is known (or assumed) to no longer match the current base-table source for this bucket — typically because ConsistencyMgr observed a base-table change.

Operation definitions

Trigger Operation Partition effect Notes
MinionTaskCycle MinionTaskCycle_APPEND VACANT → VALID Runs at the frontier (>= watermarkMs); advances watermarkMs; fingerprint is computed over overlapping base-table segments (or PartitionFingerprint.EMPTY if the source window has no overlapping segments).
MinionTaskCycle_OVERWRITE STALE → VALID In-coverage STALE bucket whose source still has data; re-materializes MV segments via segment-lineage replace; watermarkMs unchanged; fingerprint = recomputed value.
MinionTaskCycle_DELETE STALE → VALID In-coverage STALE bucket whose source has been retention-deleted (segmentCount == 0); drops MV segments via segment-lineage replace (segmentsTo = []); watermarkMs unchanged; fingerprint = PartitionFingerprint.EMPTY. The entry is intentionally not removed from the map — keeping it as VALID-empty eliminates VACANT as a runtime state for processed windows so a later backfill flips through the standard VALID → STALE → OVERWRITE cycle.
ConsistencyMgr ConsistencyMgr_MARK_STALE * → STALE Sole side effect of ConsistencyMgr. A base-table segment add / update / delete event is run through a filter pipeline (pre-coverage check, post-coverage / watermark cap, existence + state check); surviving events drive this op. After the bug fix it covers both VALID → STALE and in-coverage VACANT → STALE (synthesizing a fresh STALE entry).
Manual Manual_MARK_STALE * → STALE Shared op for the future REFRESH MV family of admin commands: REFRESH_MV (whole MV), REFRESH_MV_RANGE (time range, with optional coverage extension below min(partKey)), and REFRESH_MV_PARTITION (single bucket). All three variants have identical per-partition semantics — they differ only in selection scope (which buckets are targeted and whether VACANT buckets are synthesized).
Manual_DROP_MV * → VACANT DDL DROP MATERIALIZED VIEW. Atomically deletes the runtime znode (and the table config / schema). Bypasses the per-partition state machine — partitions are not first marked STALE; the entire entry tree disappears in a single ZooKeeper delete.

State transition table

From \ To VACANT VALID STALE
VACANT MinionTaskCycle_APPEND ConsistencyMgr_MARK_STALE, Manual_MARK_STALE
VALID Manual_DROP_MV ConsistencyMgr_MARK_STALE, Manual_MARK_STALE
STALE Manual_DROP_MV MinionTaskCycle_OVERWRITE, MinionTaskCycle_DELETE

Sync with base-table change

The full filter pipeline (pre-coverage, post-coverage / watermark cap, existence + state check) that ConsistencyMgr applies to base-segment add / update / delete events — including how it decides which buckets to enumerate and which transitions to emit — lives in the linked design doc.

This PR is the first implementation pass that fully matches that document; the post-merge state machine has a single owner (MaterializedViewPartitionManager) and one test surface that exercises every transition in the table above.

Why introduce MaterializedViewPartitionManager

Before this PR the per-partition mutation logic was spread across three sites — the minion executor, the controller-side task scheduler, and the consistency manager — each shipping its own ~50-line CAS-write retry loop with its own retry budget, its own jittered backoff, and its own hand-rolled Map<Long, PartitionInfo> copy/edit code:

Site Trigger Bespoke CAS loop
Executor APPEND / OVERWRITE / DELETE task commit DEFAULT_MAX_RUNTIME_UPDATE_ATTEMPTS
Scheduler False-positive STALE revert MAX_PARTITION_STATE_PERSIST_RETRIES (8, fixed)
Consistency mgr Segment-change flush MAX_MARK_RETRIES + ThreadLocalRandom backoff

Concrete problems this caused:

  • Three retry/backoff policies that could (and did) drift apart whenever any one was tuned. Operators had no single knob to control "MV runtime znode contention pressure".
  • Watermark advancement on APPEND was a separate ZK write from the bucket insert in some refactor proposals; under concurrent writers this opens a window where the partition map and watermarkMs disagree, which in turn poisons the broker's now - watermarkMs <= stalenessThresholdMs check.
  • Adding a new state to the state machine required touching ~3 files with ~3 different idioms. Every new transition was an opportunity to forget a precondition check (e.g. existing.getState() == STALE before OVERWRITE).
  • Cross-package callers could synthesize arbitrary PartitionInfo objects directly, bypassing every invariant the manager would otherwise enforce.

The new design replaces all three loops with one CAS engine (MaterializedViewPartitionManager#applyMutation), exposes one public method per documented operation (appendValid / refreshValid / clearValid / revertValid / markStale / deletePartition), and bundles watermark advancement into the same atomic write that mutates the bucket. The PartitionInfo constructor is locked to package access so production code outside the metadata package physically cannot bypass the manager — tests opt in via a single named factory PartitionInfo.forTesting(...) annotated @VisibleForTesting.

Net effect: one retry budget, one backoff implementation, one set of preconditions, one place to add the next state. Future transitions (VACANT → STALE synthesize for in-coverage backfill, an explicit VALID-empty state, etc.) become a one-method addition with predictable behavior, not a three-site coordination exercise.

Bug fixes that landed alongside the refactor

Two separate correctness bugs were fixed as part of this pass — both touch the same partition-state machinery the manager now owns, and both now have regression tests in the manager's unit-test surface:

  1. Watermark over-advance past latest source data. The MV scheduler kept advancing watermarkMs toward now - bufferMs even when the base table had stopped receiving new data, which made the broker's freshness check now - watermarkMs <= stalenessThresholdMs return true for stale MV data. watermarkMs is now capped at the latest endTime present in the source segments, so the staleness check reflects real data freshness instead of wall-clock drift.

  2. Backfill into a previously-deleted bucket silently dropped. When a STALE bucket's source data was retention-deleted, the executor's DELETE branch removed the runtime PartitionInfo entry entirely. That left the bucket in the absent state, which the consistency manager's markStale pass deliberately skips, so any subsequent base-table backfill into that window never propagated to the MV. The DELETE branch now writes VALID + PartitionFingerprint.EMPTY instead, so the bucket follows the standard VALID → STALE → OVERWRITE cycle on backfill. PartitionFingerprint.EMPTY is byte-identical to what computeWindowFingerprint already produces when the overlapping segment list is empty (both feed an empty input through farmHashFingerprint64), so existing ZK records remain comparable across rolling upgrades — no migration required.

Other clean-ups bundled in

Two verbatim copies of the filter + sort + farmHashFingerprint64 window-fingerprint algorithm (one in the scheduler, one in the executor) are collapsed into MaterializedViewTaskUtils#computeWindowFingerprint. Drift between the two copies would silently break the executor's commit-time fingerprint validation and produce tasks that retry forever without advancing the watermark — one of the worst classes of MV bugs to triage.

Tests

  • New: MaterializedViewPartitionManagerTest exercises every public method, including CAS retry / version-conflict paths and precondition violations.
  • Updated: scheduler / executor / consistency-manager / rewrite-engine tests rebased onto the manager and the PartitionInfo.forTesting factory.
  • Manual end-to-end validation against a local Pinot cluster: created airlineStats_mv_hardening_smoke with 1m refresh and exercised the segment-delete → STALE → executor-DELETE → backfill path; verified that the VALID-empty write keeps the bucket re-syncing on backfill, and that the watermark stops advancing once base-table ingestion stalls.

Suggested labels

bugfix, refactor

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 71.15385% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.58%. Comparing base (eddf4c0) to head (a71f721).
⚠️ Report is 39 commits behind head on master.

Files with missing lines Patch % Lines
...materializedview/MaterializedViewTaskExecutor.java 3.70% 26 Missing ⚠️
...iew/metadata/MaterializedViewPartitionManager.java 88.60% 7 Missing and 11 partials ⚠️
...onsistency/MaterializedViewConsistencyManager.java 50.00% 12 Missing and 4 partials ⚠️
...dview/scheduler/MaterializedViewTaskScheduler.java 39.13% 14 Missing ⚠️
...lizedview/scheduler/MaterializedViewTaskUtils.java 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18642      +/-   ##
============================================
+ Coverage     56.79%   64.58%   +7.79%     
- Complexity        7     1288    +1281     
============================================
  Files          2579     3373     +794     
  Lines        149557   208651   +59094     
  Branches      24165    32583    +8418     
============================================
+ Hits          84939   134761   +49822     
- Misses        57424    63060    +5636     
- Partials       7194    10830    +3636     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.58% <71.15%> (+7.79%) ⬆️
temurin 64.58% <71.15%> (+7.79%) ⬆️
unittests 64.58% <71.15%> (+7.79%) ⬆️
unittests1 56.92% <ø> (+0.13%) ⬆️
unittests2 37.22% <71.15%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal correctness issue; see inline comment.

// see `MaterializedViewPartitionManager#clearValid` for the design rationale (no
// fingerprint validation needed; the source contained zero overlapping segments by
// construction at scheduler dispatch time).
partitionManager.clearValid(tableName, windowStartMs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes the commit-time source check from the DELETE path. If a backfill lands after the scheduler emitted DELETE but before postProcess() runs, the create-segment event is ignored because the partition is already STALE; then clearValid() flips it to VALID + EMPTY and the broker will rewrite queries to an empty MV bucket even though source rows now exist. The old behavior of removing the entry still fell back to base-table routing, so this is a silent wrong-results regression. DELETE needs a commit-time fingerprint/emptiness validation (or another way to preserve STALE on mid-flight backfills) before writing VALID + EMPTY.

@xiangfu0 xiangfu0 added bug Something is not working as expected refactor Code restructuring without changing behavior materialized-view labels Jun 2, 2026
…partition-state centralization

This change consolidates two materialized-view bug fixes and two refactors
that together close the gap between the broker's freshness contract and
the actual MV ingestion state, and remove duplicated CAS / fingerprint
logic that had drifted across the executor, scheduler, and consistency
manager.

1. bugfix: Prevent watermarkMs from advancing past latest source data

   The MV scheduler previously kept advancing watermarkMs toward
   `now - bufferMs` even when the base table had stopped receiving new
   data, which caused the rewrite engine to treat stale MV data as
   fresh.  watermarkMs is now capped at the latest endTime present in
   the source segments, so the staleness check
   `now - watermarkMs <= stalenessThresholdMs` reflects real data
   freshness instead of wall-clock drift.

2. bugfix: Persist VALID-empty on DELETE so backfilled buckets re-sync

   When a STALE bucket's source data was retention-deleted, the
   executor's DELETE branch used to remove the runtime PartitionInfo
   entry entirely.  That left the bucket in the `absent` state, which
   the consistency manager's STALE-marking pass deliberately skips, so
   any subsequent base-table backfill into that window would never
   propagate to the MV.

   Fix: the DELETE branch now writes `VALID + PartitionFingerprint.EMPTY`
   instead of removing the entry.  `absent` is now reserved for
   cold-start only; once any task has run for a bucket, an entry exists.
   A backfill into a previously-empty bucket flows through the standard
   `VALID -> STALE -> OVERWRITE` cycle the consistency manager already
   drives.

   PartitionFingerprint.EMPTY is byte-identical to what the scheduler's
   and executor's `computeWindowFingerprint` already produce when the
   overlapping segment list is empty (both feed an empty input through
   farmHashFingerprint64).  This keeps empty-by-DELETE and
   empty-by-APPEND on a single representation, so existing ZK records
   remain comparable across rolling upgrades.

3. refactor: Centralize computeWindowFingerprint in MaterializedViewTaskUtils

   The scheduler and the minion executor each carried a verbatim copy
   of the filter+sort+farmHash64 algorithm used to fingerprint the
   source segments overlapping a partition window.  Any drift between
   the two copies (different hasher, encoding, or sort key) would
   silently break the executor's commit-time fingerprint validation,
   producing one of the hardest classes of MV bugs to diagnose: tasks
   that fail validation on the way in, retry forever, and never advance
   the watermark.

   The algorithm now lives in
   MaterializedViewTaskUtils#computeWindowFingerprint as the single
   source of truth.  Both the scheduler call sites and the executor
   call site invoke the utility directly with
   `_context.getSegmentsZKMetadata` inlined at the call site - no
   per-module wrapper survives, so there is nowhere for a future
   contributor to silently re-implement it.

4. refactor: Centralize all partition-state mutations through
   MaterializedViewPartitionManager

   Adds a state-change DSL backed by a single CAS engine that
   consolidates every per-partition mutation on the materialized-view
   runtime znode.  The public methods
   (appendValid / refreshValid / clearValid / revertValid / markStale /
   deletePartition) map one-to-one to the per-partition state machine;
   private CRUD primitives enforce structural invariants on the
   in-memory partition map; applyMutation centralizes version-checked
   CAS retries with two retry profiles (critical, cluster-tunable;
   revert, fixed budget).

   The executor (APPEND -> appendValid, OVERWRITE -> refreshValid,
   DELETE -> clearValid), scheduler (false-positive STALE revert ->
   revertValid), and consistency manager (segment-change flush ->
   markStale) are switched onto the manager.  Watermark advancement on
   APPEND now happens inside the manager's mutator under the same
   atomic write that adds the bucket entry, so map state and watermark
   cannot diverge under concurrent writers.

   The PartitionInfo constructor is locked to package access to encode
   the contract at compile time: production code outside the metadata
   package physically cannot synthesize a PartitionInfo without going
   through the manager.  Cross-package tests use a single named factory
   `PartitionInfo.forTesting(...)` annotated `@VisibleForTesting`.

   Three bespoke CAS retry loops (DEFAULT_MAX_RUNTIME_UPDATE_ATTEMPTS in
   the executor, MAX_PARTITION_STATE_PERSIST_RETRIES in the scheduler,
   MAX_MARK_RETRIES + jittered backoff in the consistency manager) and
   their associated ThreadLocalRandom imports are removed; retry budget
   is now governed by a single CLUSTER_CONFIG_KEY_MAX_RUNTIME_UPDATE_ATTEMPTS
   cap.

Signed-off-by: Hongkun Xu <xuhongkun666@163.com>
Co-authored-by: Xiang Fu <xiangfu.1024@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Apache Pinot’s Materialized View (MV) partition-state engine by centralizing runtime znode mutations into a single CAS-based manager, fixing correctness issues around retention-delete/backfill tracking and watermark advancement, and deduplicating window-fingerprint computation across scheduler/executor paths.

Changes:

  • Introduces MaterializedViewPartitionManager as the single owner for MV runtime partition-map + watermark mutations (CAS retries/backoff consolidated).
  • Changes DELETE handling to keep partitions “tracked but empty” (VALID + PartitionFingerprint.EMPTY) instead of removing the entry, enabling later backfill to follow the normal refresh cycle.
  • Caps scheduler APPEND cutoff by the latest source-segment endTimeMs and unifies window fingerprint computation via MaterializedViewTaskUtils#computeWindowFingerprint.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pinot-plugins/pinot-minion-tasks/pinot-minion-builtin-tasks/src/main/java/org/apache/pinot/plugin/minion/tasks/materializedview/MaterializedViewTaskExecutor.java Switches post-commit runtime updates to MaterializedViewPartitionManager and adjusts DELETE semantics to persist VALID-empty.
pinot-plugins/pinot-minion-tasks/pinot-minion-builtin-tasks/src/test/java/org/apache/pinot/plugin/minion/tasks/materializedview/MaterializedViewTaskExecutorTest.java Rebases fingerprint/contiguous-watermark tests onto MaterializedViewTaskUtils and adds coverage for empty-overlap fingerprint equality with EMPTY.
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/MaterializedViewPartitionManager.java New central CAS mutation engine + state-change DSL for MV runtime partition state transitions and watermark updates.
pinot-materialized-view/src/test/java/org/apache/pinot/materializedview/metadata/MaterializedViewPartitionManagerTest.java New unit-test surface covering manager operations, preconditions, and CAS retry behavior.
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/scheduler/MaterializedViewTaskScheduler.java Caps APPEND scheduling cutoff by max source segment end time; routes false-positive STALE reverts through the manager; uses shared fingerprint helper.
pinot-materialized-view/src/test/java/org/apache/pinot/materializedview/scheduler/MaterializedViewTaskSchedulerTest.java Adds tests for computeMaxSourceEndTimeMs behavior, including legacy/unset end times.
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/scheduler/MaterializedViewTaskUtils.java Adds shared computeWindowFingerprint implementation used by both scheduler and executor.
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/consistency/MaterializedViewConsistencyManager.java Replaces bespoke CAS retry logic with manager-based markStale; refactors bucket enumeration + watermark cap behavior.
pinot-materialized-view/src/test/java/org/apache/pinot/materializedview/consistency/MaterializedViewConsistencyManagerTest.java Updates fixtures to use PartitionInfo.forTesting(...).
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/PartitionFingerprint.java Introduces PartitionFingerprint.EMPTY with farmHash64(empty) to canonicalize “empty window” fingerprints across paths.
pinot-materialized-view/src/test/java/org/apache/pinot/materializedview/metadata/PartitionFingerprintTest.java Adds tests pinning EMPTY’s invariants (segmentCount=0, CRC matches farmHash64(empty), encode/decode).
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/PartitionInfo.java Makes constructor package-private and adds PartitionInfo.forTesting(...) to discourage cross-package construction.
pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/PartitionState.java Updates state-model documentation to reflect tracked-empty semantics and reserved meaning of “absent”.
pinot-materialized-view/src/test/java/org/apache/pinot/materializedview/rewrite/MaterializedViewQueryRewriteEngineTest.java Updates test fixture construction to use PartitionInfo.forTesting(...).

Comment on lines 174 to 176
// DELETE mode: skip query execution; remove existing MV segments and rewrite the
// runtime PartitionInfo to VALID-empty (see updateMaterializedViewRuntime DELETE branch).
if (MaterializedViewTask.TASK_MODE_DELETE.equals(taskMode)) {
Comment on lines +394 to +397
/// via segment lineage replace (segmentsFrom=[old segments], segmentsTo=[]). No query
/// is executed and no new MV segments are created; the runtime PartitionInfo is rewritten
/// to `VALID + PartitionFingerprint.EMPTY` by [#updateMaterializedViewRuntime] in
/// [#postProcess], so the partition stays tracked-but-empty rather than disappearing.
Comment on lines +220 to +228
overlapping.sort(Comparator.comparing(SegmentZKMetadata::getSegmentName));
Hasher hasher = Hashing.farmHashFingerprint64().newHasher();
for (SegmentZKMetadata seg : overlapping) {
hasher.putString(seg.getSegmentName(), StandardCharsets.UTF_8);
hasher.putByte((byte) 0);
hasher.putLong(seg.getCrc());
hasher.putByte((byte) '\n');
}
return new PartitionFingerprint(overlapping.size(), hasher.hash().asLong());
Comment on lines +435 to 440
List<Long> buckets = new ArrayList<>();
long partStart = Math.floorDiv(affectedStartMs, bucketMs) * bucketMs;
while (partStart <= affectedEndMs) {
PartitionInfo info = updatedInfos.get(partStart);
if (info != null && info.getState() == PartitionState.VALID) {
updatedInfos.put(partStart, info.withState(PartitionState.STALE));
anyChanged = true;
markedCount++;
}
while (partStart <= cappedEnd) {
buckets.add(partStart);
partStart += bucketMs;
}
Comment on lines +308 to +312
/// Lenient: if the bucket is absent or already STALE, the call is a no-op. Today the
/// in-coverage VACANT case is a deliberate skip (matching the prior consistency manager
/// behavior); a future change will synthesize a STALE entry for in-coverage VACANT
/// buckets to fix the backfill silent-data-loss case.
///
@xiangfu0
Copy link
Copy Markdown
Contributor

xiangfu0 commented Jun 5, 2026

Review: PR #18642 — [MaterializedView] Harden partition-state engine

Repo: apache/pinot
Branch: feat/sse_mv_hardening
Author: hongkunxu (Hongkun Xu)
Date: 2026-06-02

Repo context loaded:

  • CLAUDE.md read (root CLAUDE.md defining build commands, conventions, mandatory code review)
  • CODEOWNERS read (no CODEOWNERS file in repo)
  • Module-level docs for touched paths read (pinot-materialized-view/DESIGN.md)

Stage 1: Triage & Evidence Collection

PR description

  • Description present
  • States what problem is being solved
  • States why this approach
  • States risky areas

Findings:

  • Severity: observation. Confidence: high. Evidence: PR body. The description is exceptionally thorough — fixes two correctness bugs (watermark over-advance, backfill-after-delete), refactors three retry sites into one CAS engine, centralizes fingerprinting. The "Why introduce MaterializedViewPartitionManager" and "Backward compatibility / rolling upgrade" sections both state the risk surface explicitly. ZK format is unchanged, public REST surface unchanged. No regression-test gaps are claimed.

Change summary

  • Full diff read
  • Summary produced
  • Files listed by module

Summary:
The PR introduces MaterializedViewPartitionManager as the single source of truth for all per-partition state transitions on the MV runtime ZNode. Before this PR, three separate sites (executor, scheduler, consistency manager) each shipped their own CAS retry loop with subtly different budgets, jitter, and exception classification. The new manager exposes a typed DSL (appendValid / refreshValid / clearValid / revertValid / markStale / deletePartition) and routes every mutation through one applyMutation CAS engine with two retry profiles (critical / revert). Two correctness bugs are fixed alongside: (1) the APPEND watermark is now capped at max(source segment endTimeMs) so it cannot drift past real data when ingestion stalls, and (2) the DELETE task now writes VALID + PartitionFingerprint.EMPTY instead of removing the entry so backfill follows the standard VALID → STALE → OVERWRITE cycle. Two duplicate computeWindowFingerprint copies are collapsed into MaterializedViewTaskUtils#computeWindowFingerprint. PartitionInfo's constructor becomes package-private with a named forTesting factory.

Files:

  • pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/
    • MaterializedViewPartitionManager.java (new, 499 lines)
    • PartitionFingerprint.java (+24 — adds EMPTY constant)
    • PartitionInfo.java (+17 / -1 — package-private constructor + forTesting factory)
    • PartitionState.java (+14 / -7 — javadoc updates clarifying absent semantics)
  • pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/scheduler/
    • MaterializedViewTaskScheduler.java (+62 / -121 — removes inline computeWindowFingerprint and persistPartitionStateChangeWithRetry; adds computeMaxSourceEndTimeMs)
    • MaterializedViewTaskUtils.java (+55 — centralizes computeWindowFingerprint)
  • pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/consistency/
    • MaterializedViewConsistencyManager.java (+97 / -117 — replaces bespoke retry loop with manager routing)
  • pinot-materialized-view/src/test/java/...
    • metadata/MaterializedViewPartitionManagerTest.java (new, 477 lines)
    • metadata/PartitionFingerprintTest.java (+39 — EMPTY contract tests)
    • scheduler/MaterializedViewTaskSchedulerTest.java (+57 — computeMaxSourceEndTimeMs tests)
    • consistency/MaterializedViewConsistencyManagerTest.java, rewrite/MaterializedViewQueryRewriteEngineTest.javaPartitionInfoPartitionInfo.forTesting switches
  • pinot-plugins/pinot-minion-tasks/pinot-minion-builtin-tasks/src/main/java/...
    • materializedview/MaterializedViewTaskExecutor.java (+50 / -178 — replaces bespoke CAS loop with manager calls; removes inline computeWindowFingerprint)
  • Same test class: MaterializedViewTaskExecutorTest.java (+31 / -12 — references the moved helpers + adds an EMPTY-overlap byte-equality test)

Risk classification

  • Classified as low / medium / high
  • Reason stated

Classification: Medium

Reason: The PR touches concurrency-sensitive ZK CAS code paths shared by three subsystems (executor, scheduler, consistency manager) and changes a public API (PartitionInfo constructor) in a way that's source-incompatible across module boundaries. It is contained to the pinot-materialized-view module + the MV minion-task plugin; ZK serialization is unchanged. The materialized-view feature itself is relatively new and gated by table config, so blast radius is bounded to MV-enabled tables. CI is green across binary compatibility, multi-stage query engine, integration, and unit test sets.

Test gaps

  • Changed behaviors/contracts checked for corresponding tests
  • Test coverage of new/changed behavior verified

Gaps:

  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManagerTest.java (new). Every public method of the manager is tested with strict/lenient precondition variants, CAS retry classification (testCasConflictRetriesUntilSuccess), and budget exhaustion (testRevertBudgetExhaustedThrowsRuntimeException). Strong coverage.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewTaskSchedulerTest adds 5 computeMaxSourceEndTimeMs cases covering empty-list, mixed-negative, all-negative, and zero-endTime boundaries — directly exercising the watermark-cap bug fix.
  • Severity: concern. Confidence: medium. Evidence: MaterializedViewTaskExecutorTest.testWindowFingerprintForEmptyOverlapMatchesEmptyConstant pins the byte-equality contract for the DELETE backfill bug fix, but there is no end-to-end test exercising the full STALE → DELETE → VALID-empty → backfill → STALE → OVERWRITE lifecycle. The PR body cites "manual end-to-end validation against a local Pinot cluster", but no automated regression test covers the backfill-after-delete scenario specifically. A future integration test would close this gap. CI's integration test sets pass but do not appear to exercise this specific MV path.
  • Severity: observation. Confidence: medium. Evidence: scheduler.tryHandleStalePartition. There is no unit test that asserts the scheduler invokes revertValid on a false-positive STALE marking via the manager (rather than the old persistPartitionStateChangeWithRetry it replaced) — coverage of the revertValid wiring relies on the manager test alone. Minor gap; integration coverage is presumed.

Stage 2: Deep Review

Does the implementation address the stated problem?

  • Problem statement identified from the PR description
  • Implementation traced through the code to check whether it appears to address the stated problem
  • No obvious gaps between what the PR claims to fix and what the code actually does

Findings:

  • Severity: observation. Confidence: high. Evidence: MaterializedViewTaskScheduler.java:274-285, computeMaxSourceEndTimeMs:962-971. The watermark over-advance fix is implemented as advertised — the scheduler now resolves the cutoff in two stages and clamps it at max(source segment endTimeMs). The two-stage shape avoids a ZK list call when the rough cutoff already excludes any candidate bucket.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewTaskExecutor.java:457-464 and MaterializedViewPartitionManager.java:248-266. The DELETE backfill fix routes through clearValid, which writes VALID + PartitionFingerprint.EMPTY instead of removing the entry. PartitionFingerprint.EMPTY (PartitionFingerprint.java:61-62) is constructed to be byte-identical to what computeWindowFingerprint produces on an empty overlap (verified by testWindowFingerprintForEmptyOverlapMatchesEmptyConstant).
  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManager.java. The CAS engine consolidation is exhaustive — all six documented state-machine ops have a corresponding public method, every transition flows through applyMutation, and the executor / scheduler / consistency manager construct or hold a manager rather than rolling their own loops.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewTaskUtils.java:210-229. computeWindowFingerprint is the single source of truth — the scheduler and executor both invoke MaterializedViewTaskUtils.computeWindowFingerprint (no remaining duplicate copies). Javadoc captures the byte-equality contract.

High-risk files (medium)

  • Files ranked by logical risk
  • Risk reasons stated per file

Findings:

  1. MaterializedViewPartitionManager.java (new, 499 lines). Severity: observation. Confidence: high. Evidence: file. Center of the change. New CAS engine; if its retry classification or precondition checks are wrong, every state transition is affected. Tests are comprehensive; the architecture has clear separation between public DSL, CRUD primitives, and CAS engine; the Mutator interface's null-return-means-no-op contract is documented and exercised. Risk is contained by the test surface but the class is now the bottleneck for MV correctness.
  2. MaterializedViewTaskExecutor.java. Severity: observation. Confidence: medium. Evidence: file. 178 lines of bespoke CAS loop removed, replaced by 50 lines of manager calls. The simplification is large enough that subtle pre-PR behavior could have shifted — but validateSourceFingerprintAtCommit is still called before appendValid/refreshValid (line 472), and the DELETE branch's lack of fingerprint validation is documented (no source-side state to validate, since clearValid writes a fixed EMPTY constant).
  3. MaterializedViewTaskScheduler.java. Severity: observation. Confidence: medium. Evidence: file. The watermark-cap logic is new and has direct impact on the broker's now - watermarkMs <= stalenessThresholdMs freshness check. The two-stage cutoff resolution adds a code path; the unit tests cover the helper function but not the integrated generateTasks path.
  4. MaterializedViewConsistencyManager.java. Severity: observation. Confidence: medium. Evidence: file. The bespoke retry loop is replaced with the manager. Two ZK reads per flush instead of one — explicitly justified in the new javadoc. The _partitionManager is wired via a volatile field with a deref-on-call lambda so a later setClusterConfigReader propagates correctly (init:139-147). Sequencing is correct: setClusterConfigReader is called AFTER init() in BaseControllerStarter.java:621-636.

Center of gravity (medium)

  • 1-3 key files or decisions identified

Files:

  1. pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/MaterializedViewPartitionManager.java — the new manager that owns the state machine.
  2. pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/metadata/PartitionFingerprint.java — defines EMPTY; its byte-equality with computeWindowFingerprint-of-empty is the contract that the DELETE backfill bug fix rests on.
  3. pinot-materialized-view/src/main/java/org/apache/pinot/materializedview/scheduler/MaterializedViewTaskScheduler.java — hosts the watermark-cap fix and the shared computeWindowFingerprint (via MaterializedViewTaskUtils).

Why these are central:
Every consumer (executor postProcess, scheduler revert, consistency manager flush) routes mutations through the manager; every fingerprint computation routes through MaterializedViewTaskUtils; the broker's freshness contract depends on the scheduler's watermark advancement.

Invariants — targeted (medium)

  • Authorization — N/A. No new endpoints or permission checks introduced; the PR is internal task plumbing.
  • Idempotency — revertValid, markStale, deletePartition are lenient (no-op if precondition violated). appendValid / refreshValid / clearValid are strict (throw on precondition violation). Helix will retry strict-fail tasks; that retry contract is documented and matches prior behavior.
  • Backward compatibility — APIs — Public REST surface unchanged.
  • Backward compatibility — Schemas — ZNRecord format unchanged. PartitionFingerprint.EMPTY is byte-identical to the existing APPEND-empty fingerprint ((0, farmHash64(""))), explicitly verified in testWindowFingerprintForEmptyOverlapMatchesEmptyConstant. Existing ZNodes are read-compatible.
  • Backward compatibility — SPI — PartitionInfo constructor went from public to package-private. Cross-package production callers would break. Audited — no production caller of new PartitionInfo(...) exists outside pinot.materializedview.metadata (verified via repo grep). All cross-package callers are tests, which now use PartitionInfo.forTesting.
  • Backward compatibility — Configuration — Existing cluster config key pinot.materialized.view.executor.runtime.update.max.attempts (CLUSTER_CONFIG_KEY_MAX_RUNTIME_UPDATE_ATTEMPTS) is now also consulted by the consistency manager and scheduler revert paths (via the manager's criticalMaxAttempts()). The default value is unchanged (128). Operators who tuned this for the executor only will see the new value applied to all three sites — this is the intended unification per the PR design.
  • Ordering guarantees — Watermark advancement is now bundled with the bucket insert in a single CAS write (appendValid). Prior code paths could split these into two ZK writes in some refactor scenarios; the manager closes that window.
  • Null/empty handling — Empty-window fingerprint canonicalized via PartitionFingerprint.EMPTY. markStale no-ops on empty collection (MaterializedViewPartitionManager.java:329-331). Empty-source-segments list returns Long.MIN_VALUE from computeMaxSourceEndTimeMs and caller falls back to roughCutoffMs.
  • Retry safety — CAS retry classification is documented (CasConflictException and other ZkException retry; IllegalStateException/IllegalArgumentException propagate). Mutator may return null to signal no-op without ZK write, used by lenient ops.
  • Observability — All ops emit INFO-level log lines including table, bucket, and transition. Replaces and consolidates ad-hoc log lines that lived in three sites. No metric changes.
  • Resource cleanup — N/A. No new resources (files / connections / locks) introduced.
  • Migration reversibility — No migration. The PR is wire-compatible. A rollback would re-introduce the bespoke loops; partitions persisted as VALID + EMPTY would deserialize without issue under the old code (the old code reads VALID + arbitraryFP).
  • Performance in hot paths — The consistency manager's flush now does two ZK reads (one for enumerateCandidateBuckets, one inside applyMutation). The added cost is one ZK GET per flush per affected MV; debounce window (5 s default) bounds the rate. The scheduler's APPEND path adds one getSegmentsZKMetadata call only when at least one bucket can fit before the rough cutoff (Stage 1 short-circuit). Both costs are bounded and explicitly justified in code comments.
  • API design — The manager DSL has one method per state-machine op, with documented strict / lenient preconditions and a per-method retry profile. Error responses (RuntimeException after budget exhaust, IllegalStateException on precondition violation) carry the offending table name and bucket. Names (appendValid, refreshValid, clearValid, revertValid, markStale, deletePartition) are reasonable. deletePartition is documented as "reserved for future manual admin commands; not currently invoked from any automatic path" — good defensive surface for upcoming REFRESH MV work.

Findings:

  • Severity: concern. Confidence: medium. Evidence: MaterializedViewTaskExecutor.java:457-464 and MaterializedViewTaskExecutor.java:469-484. The DELETE branch in postProcess skips validateSourceFingerprintAtCommit — by design, since the source was empty at scheduler dispatch time. A race window exists: between scheduler dispatch (segmentCount == 0) and executor commit (clearValid), a backfill could land in the same bucket. The MV is then persisted as VALID + EMPTY while the source has data. The consistency manager will mark the bucket STALE when the segment-add event flows through, so the system self-heals on the next debounce cycle. The window is bounded by debounce + scheduler periodicity but is real. A possible enhancement: validate source fingerprint inside clearValid (treat a non-empty source as a precondition violation that propagates back as IllegalStateException, which forces a Helix retry). The PR description acknowledges this design choice indirectly via the VALID-empty semantics. Leaving as concern, not blocker — self-healing path exists.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewPartitionManager.java:281-301 (revertValid). The revert profile uses REVERT_MAX_ATTEMPTS = 8 and is invoked via new MaterializedViewPartitionManager(...) for each revert (MaterializedViewTaskScheduler.java:390-391). The per-call allocation is cheap; the consistency manager and executor reuse a single instance. The asymmetry is acceptable but worth flagging in case a future refactor wants to hold a single scheduler-side manager too.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManager.java:316. The markStale single-bucket overload uses java.util.Collections.singletonList(...) (fully qualified). This works but stylistically inconsistent with the file's other unqualified Collections.emptyList() usages in callers. Minor.

Test adequacy (medium)

  • Tests prove intended behavior (not just that code runs)
  • Failure modes covered
  • Boundary conditions covered
  • Not over-coupled to implementation details

Findings:

  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManagerTest.java. 24 tests cover happy path, strict-precondition-throws, lenient-no-op, CAS retry-until-success, CAS budget exhaustion, and null-arg rejection. The retry test simulates a concurrent writer by bumping the stored version inside an Answer lambda — realistic.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewConsistencyManagerTest.testCasConflictTriggersRetry. Verifies the new manager routing still retries on CAS conflict. The test re-uses default Stat versions (0) for all reads, which works because set is stubbed to (false, true) regardless of version — slightly loose, but functional.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewTaskExecutorTest.testWindowFingerprintForEmptyOverlapMatchesEmptyConstant. Pins the byte-equality contract that the DELETE bug fix relies on. Good regression coverage.
  • Severity: observation. Confidence: low. The new tests don't exercise the concurrent interleaving of markStale and appendValid against the same map (e.g. markStale enumerating bucket X while appendValid adds bucket X+1). The manager's CAS engine should serialize these by version, but no test asserts the interleaving. Coverage gap that an integration test would close.

Collateral damage (medium)

  • No copy-paste duplication introduced
  • No partial renames (all references updated)
  • No dead code left behind
  • Docs and config match code changes
  • No one-off exceptions added to shared abstractions

Findings:

  • Severity: observation. Confidence: high. Evidence: diff. The two inline computeWindowFingerprint copies (one in scheduler, one in executor) are both fully removed and replaced with calls to MaterializedViewTaskUtils.computeWindowFingerprint. No leftover dead code.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewTaskExecutorTest.java and MaterializedViewConsistencyManagerTest.java. Cross-package test usages of new PartitionInfo(...) correctly switched to PartitionInfo.forTesting(...). Same-package test (MaterializedViewPartitionManagerTest.java) keeps direct constructor usage — fine, since the constructor remains package-accessible.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewTaskExecutor.java. The old computeContiguousUpperMs and computeWindowFingerprint @VisibleForTesting stubs in the executor are removed; the test file now references MaterializedViewTaskUtils.computeContiguousUpperMs / MaterializedViewTaskUtils.computeWindowFingerprint directly. No partial rename.
  • Severity: observation. Confidence: high. Evidence: PartitionState.java. Javadoc updated to reflect the new VALID-empty shape and the explicit "absent is cold-start only" semantic. Matches the code change.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewPartitionManager.java:236-242 ("Design context (open question)" javadoc on clearValid). The PR author flags the VALID-empty design as "under review" with an alternative model documented. This is honest — but it means a future PR may revise this decision. Reasonable for now; worth tracking.

Path-triggered review (medium)

For each touched path, the inferred failure mode and what was checked.

  • Touched paths listed
  • Failure mode inferred per path (what can break when this area changes?)
  • Reviewed for inferred risks
  • Boundary areas checked first (auth, APIs, persistence, queues, shared core, infra, multi-tenancy, manifests)

Touched paths and inferred risks:

Path Inferred failure mode What was checked Findings
pinot-materialized-view/.../metadata/MaterializedViewPartitionManager.java (new) Concurrency: CAS race, lost updates, retry-budget thrashing under contention Reviewed retry classification (CasConflictException → retry; ZkException → retry; IllegalStateException/IllegalArgumentException → propagate). Reviewed backoff envelopes (critical: 128 × 50–200 ms ≈ 25 s; revert: 8 × 5–25 ms ≈ 200 ms). Mutator returning null short-circuits without CAS write — lenient ops. No issues
pinot-materialized-view/.../metadata/PartitionInfo.java (constructor package-private) API/SPI: source-incompat for cross-package callers Searched repo for new PartitionInfo( outside the metadata package. Only same-package test (MaterializedViewPartitionManagerTest) uses constructor; all cross-package tests switched to forTesting. No production caller affected. No issues
pinot-materialized-view/.../metadata/PartitionFingerprint.java (EMPTY constant) ZK byte-equality drift across rolling upgrades Verified EMPTY = new PartitionFingerprint(0, farmHash64("")) exactly matches what computeWindowFingerprint produces for an empty segment list. Test testWindowFingerprintForEmptyOverlapMatchesEmptyConstant pins the contract. No issues
pinot-materialized-view/.../scheduler/MaterializedViewTaskScheduler.java (watermark cap) Broker freshness check correctness; APPEND scheduling drift computeMaxSourceEndTimeMs returns Long.MIN_VALUE for empty list (caller falls back to roughCutoffMs). endMs >= 0 filter mirrors the cold-start scan. Two-stage cutoff short-circuits the ZK list when no bucket can fit. No issues
pinot-materialized-view/.../consistency/MaterializedViewConsistencyManager.java (manager routing) Cluster-config reader timing; CAS retry semantic preservation Verified the volatile MaterializedViewPartitionManager _partitionManager is constructed at init() time with a closed-over _clusterConfigReader field deref (so a later setClusterConfigReader propagates). Verified BaseControllerStarter calls setClusterConfigReader AFTER init(). Verified markStaleThroughManager swallows manager RuntimeException with ERROR log — same behavior as the prior loop's exhausted-retry path. No issues
pinot-plugins/.../materializedview/MaterializedViewTaskExecutor.java (postProcess via manager) Task-commit correctness; fingerprint validation timing validateSourceFingerprintAtCommit is still called before appendValid and refreshValid (line 472). The DELETE branch deliberately skips fingerprint validation — see Concern 1 above. The per-call new MaterializedViewPartitionManager(...) allocation is acceptable. One concern flagged on DELETE branch race (see "Invariants — targeted")

Verdict

Risk: medium
Verdict: approved
Verdict confidence: medium

Confidence reasoning

The PR is well-scoped, internally consistent, and addresses the two stated correctness bugs with documented byte-equality guarantees for rolling upgrades. The new manager has comprehensive unit-test coverage (24 tests across happy/strict/lenient/CAS-retry/budget-exhaust paths) and the centralization of three retry loops into one is a genuine maintainability win. CI is green across all check categories including binary compatibility and integration tests.

Confidence is medium (not high) because:

  • The DELETE branch's lack of commit-time fingerprint validation creates a race window where a VALID-empty partition can be persisted while the source has backfilled data. Self-healing exists (consistency manager will re-mark STALE), but the window deserves a regression test or an explicit precondition check in clearValid.
  • No integration-level test exercises the full STALE → DELETE → VALID-empty → backfill → STALE → OVERWRITE lifecycle. The PR cites manual validation against a local cluster.
  • The cluster-config key pinot.materialized.view.executor.runtime.update.max.attempts semantics quietly broaden from "executor only" to "executor + consistency manager + scheduler revert". Operators who tuned this key may see different behavior. The default is unchanged so this is a soft regression only for those who explicitly overrode the value.

These are concerns, not blockers — none would justify holding the merge given the centralization win and the self-healing fallback behavior. The PR meaningfully reduces future surface area for state-machine drift and removes a class of real correctness bugs.

Blockers

None.

Concerns

  • Severity: concern. Confidence: medium. Evidence: MaterializedViewTaskExecutor.java:457-464. DELETE branch skips validateSourceFingerprintAtCommit. Backfill landing between scheduler dispatch and executor commit results in a transient VALID-empty MV partition over a non-empty source. Self-heals on next consistency-manager cycle; bounded by debounce window. Recommend either (a) adding a regression test that simulates the race, or (b) adding a source-fingerprint precondition inside clearValid so a non-empty source forces Helix to retry the DELETE.
  • Severity: concern. Confidence: low. Evidence: PR scope. No integration test covers the full STALE → DELETE → VALID-empty → backfill → STALE → OVERWRITE lifecycle. PR cites manual validation. A future integration test would close the gap.
  • Severity: concern. Confidence: medium. Evidence: CommonConstants.MaterializedViewTask.CLUSTER_CONFIG_KEY_MAX_RUNTIME_UPDATE_ATTEMPTS semantics. Operators who previously tuned this key for the executor only will now see it applied to consistency-manager and scheduler-revert paths too. Default unchanged, so impact is limited to operators who overrode the value. Worth noting in release notes / upgrade guide.

Observations

  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManager.java:39-105. The class-level javadoc is exceptionally thorough — documents the why, the architecture (three layers), the retry profiles with calibration math, the concurrency model, and the scope discipline. This is the level of documentation other shared-state subsystems should aspire to.
  • Severity: observation. Confidence: medium. Evidence: MaterializedViewPartitionManager.java:236-242. clearValid javadoc flags "Design context (open question)" — the VALID-empty shape is "under review" with an alternative model documented. Acknowledging the design tradeoff in code is honest and helps future contributors. Worth tracking through whatever issue replaces this.
  • Severity: observation. Confidence: high. Evidence: MaterializedViewPartitionManager.java:316. markStale(table, bucket) single-bucket overload uses java.util.Collections.singletonList(...) (fully qualified) inconsistently with other unqualified Collections usages. Cosmetic.
  • Severity: observation. Confidence: medium. Evidence: scheduler invokes revertValid via new MaterializedViewPartitionManager(...) per call. Cheap per-call allocation, but inconsistent with the consistency manager and executor that hold or reuse instances. A future refactor could hold a scheduler-side manager too.
  • Severity: observation. Confidence: medium. Evidence: PR description. The two bug fixes (watermark over-advance, backfill-after-delete) and the refactor (single CAS engine + central fingerprint) are bundled in one PR. Each is individually mergeable; bundling is reasonable here because the bug fixes naturally live in the same code paths the refactor centralizes, but a reviewer auditing post-merge would benefit from the per-fix breakdown the PR body provides.

Areas recommended for human focus

  1. DELETE-branch race window (MaterializedViewTaskExecutor.executeDeleteTask + MaterializedViewPartitionManager.clearValid). Reviewer should confirm the self-healing behavior via consistency-manager re-mark is acceptable, or push back on adding source-fingerprint validation inside clearValid.
  2. Cluster-config key semantic broadening (CLUSTER_CONFIG_KEY_MAX_RUNTIME_UPDATE_ATTEMPTS). Reviewer should confirm operators won't be surprised; consider release-note language.
  3. Test gap for full backfill lifecycle. Reviewer should confirm the manual-cluster validation is sufficient for the initial release of this hardening pass, or request a regression integration test for the next iteration.
  4. MaterializedViewPartitionManager.clearValid "open question" javadoc. Reviewer should confirm the VALID-empty semantic is acceptable as a temporary contract or request a follow-up issue tracking the alternative design.
  5. Concurrency interleaving coverage. Manager tests cover single-thread retry classification but not the concurrent markStale + appendValid race. Reviewer should confirm CAS-by-version is sufficient or ask for an interleaving test.

Status

  • All applicable Stage 1 boxes checked
  • All applicable Stage 2 boxes checked (or N/A for skipped tiers)
  • Verdict committed
  • Verdict confidence assigned with reasoning

xiangfu0 and others added 2 commits June 5, 2026 17:12
…ackfills

Addresses the PR review: the DELETE path wrote VALID+EMPTY unconditionally. If a
base-table backfill landed between the scheduler dispatching DELETE (on an empty
source) and the executor committing, the consistency manager's create-segment
event was a no-op (the bucket was already STALE) and clearValid then flipped it to
VALID-empty -- a terminal in-coverage state nothing revisits -- silently dropping
the backfilled rows (empty MV results), not self-healing.

- MaterializedViewPartitionManager#clearValid now takes a Supplier and re-reads the
  source fingerprint inside the CAS mutator on every attempt; when the source is no
  longer empty it leaves the bucket STALE (no-op) so the next scheduler cycle
  re-materializes it via OVERWRITE. The in-mutator re-read closes the dominant,
  contention-driven window the pre-fix snapshot-before-loop left open.
- MaterializedViewTaskExecutor#executeDeleteTask early-aborts the segment delete on
  backfill (avoids a transient empty-results window) and passes the re-read supplier
  to clearValid; the source-fingerprint read is extracted into a shared
  computeSourceWindowFingerprint helper reused by the APPEND/OVERWRITE validation.
- MaterializedViewTaskScheduler stamps SOURCE_TABLE_NAME_KEY on DELETE tasks so the
  executor can recompute the source fingerprint at commit; corrected stale comments
  that said DELETE removes the partition entry.

A narrow, irreducible cross-ZK residual window remains (documented in clearValid);
a follow-up will close it by having the consistency manager re-mark in-coverage
VALID-empty buckets STALE.

Tests: clearValid backfill no-op + in-CAS re-read on retry; scheduler DELETE config
carries the source table; executor abort-criterion branches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the DELETE-vs-backfill hardening and resolves the Copilot review feedback.

VALID-empty re-evaluation sweep (closes the residual race):
The executor's clearValid commit guard closes the dominant DELETE-vs-backfill window,
but a vanishingly narrow residual remains (a backfill's debounced STALE mark firing as a
no-op, while the bucket is still STALE, strictly between clearValid's source re-read and
its VALID-empty write). MaterializedViewConsistencyManager now runs a low-frequency
periodic sweep (default 5m, cluster-tunable via
pinot.materialized.view.consistency.empty.sweep.interval.ms) that re-evaluates in-coverage
VALID-empty partitions against the current source and re-marks STALE any whose source
window has regained segments, so the scheduler re-materializes them via OVERWRITE without
waiting for a fresh base-table event.
  - Reuses computeWindowFingerprint for the emptiness check (no second overlap impl).
  - Skips the source-segment read for MVs with no in-coverage VALID-empty buckets.
  - Routes the re-mark through MaterializedViewPartitionManager#markStale (version-checked
    CAS), so the advisory sweep snapshot is re-validated under the write.
  - Runs on the existing single-threaded scheduler; per-MV failures are isolated.

Review-comment fixes:
  - computeWindowFingerprint now returns the PartitionFingerprint.EMPTY constant for empty
    overlap (matches the documented contract; avoids a per-call allocation; byte-identical
    to the prior hash-of-empty-input result).
  - enumerateCandidateBuckets now iterates the existing partition keys in range instead of
    every bucket slot, bounding allocation to the partition count (a full-range invalidation
    with a small bucketMs and a large watermark no longer allocates ~watermarkMs/bucketMs
    entries). Functionally identical — markStale already no-ops absent buckets.
  - Cleaned up a stale updateMaterializedViewRuntime reference in an executor test comment.

Tests: findStrandedEmptyBuckets decision (source regained / still empty / non-empty-VALID /
STALE / past-watermark), end-to-end per-view re-mark, and per-MV failure isolation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected materialized-view refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants