Skip to content

perf(near_v1_signer_users): convert daily full rebuild to incremental append#9764

Open
a-monteiro wants to merge 1 commit into
mainfrom
andre/near-v1-signer-users-incremental
Open

perf(near_v1_signer_users): convert daily full rebuild to incremental append#9764
a-monteiro wants to merge 1 commit into
mainfrom
andre/near-v1-signer-users-incremental

Conversation

@a-monteiro

@a-monteiro a-monteiro commented Jun 12, 2026

Copy link
Copy Markdown
Member

near_v1_signer.users was a daily full rebuild that scanned all of near.actions (12B rows, 1.38 TB) plus near.logs (11.8B rows, 386 GB) every run -- 1.77 TB of IO and ~3.9 CPU-hrs per day -- to produce a ~6,100-row dimension of distinct (account_id, derivation_path, key_version) triples. The output is a monotonic set: triples only ever get added, never restated, so rebuilding history daily is pure waste.

This converts the model to incremental with strategy='append' plus a NOT EXISTS anti-join against {{ this }}. Merge with a unique_key was deliberately avoided: key_version is NULL for 655 of 6,136 rows (and derivation_path is nullable), and a merge ON clause never matches NULL keys, which would silently duplicate those rows. The anti-join uses IS NOT DISTINCT FROM to handle NULLs correctly (same pattern as #9754).

Incremental runs read only the last 3 days of block_date partitions on both sides of the join (incremental_predicate on action.block_date and log.block_date; both tables are block_date-partitioned). A constant block_date >= DATE '2024-08-01' floor (the v1.signer deployment block 124788114's date) also bounds full refreshes; proven a semantic no-op (0 rows exist with block_height >= 124788114 and an earlier block_date).

Proofs (read-only, prod data, spellbook-cd-large, UTC)

  • Floor safety: count(*) of rows with block_height >= 124788114 AND block_date < DATE '2024-08-01' = 0 on both near.actions and near.logs.
  • Floored full SELECT vs current full SELECT: identical row count (6,136) and identical checksum() on all three output columns.
  • Incremental simulation (current table UNION ALL 3-day window with the anti-join) vs full rebuild: FULL OUTER JOIN diff = 0 rows both ways.

A/B (medians of warm runs, same cluster)

full rebuild (current) incremental run ratio
IO scanned 1,766 GB 4.37 GB 404x
CPU ~12,100 s ~24 s ~500x
wall 53-106 s (540 s on prod cluster) ~1.7 s
peak memory 37-67 GB 0.4 GB

Projected per-day on spellbook-daily (1 build/day): 3.9 CPU-hrs -> ~0.01; 1.77 TB IO -> ~4.4 GB.

No backfill needed: the existing table is already the correct full-history set; the first incremental run simply appends from it. A --full-refresh reproduces the table exactly (proven by checksum above).

@github-actions github-actions Bot added WIP work in progress dbt: daily covers the Daily dbt subproject labels Jun 12, 2026

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@a-monteiro a-monteiro marked this pull request as ready for review June 12, 2026 09:26
@a-monteiro a-monteiro requested a review from a team June 12, 2026 09:26
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Logic change to how the small dimension table is built; NULL-safe dedup and partition bounds must stay correct or rows could be missed or duplicated.

Overview
near_v1_signer.users stops doing a daily full scan of near.actions and near.logs and instead runs as incremental + append, only reading recent block_date partitions via incremental_predicate on both join sides.

The sign-request logic is unchanged but wrapped in a sign_requests CTE; incremental runs add a NOT EXISTS anti-join against {{ this }} with IS NOT DISTINCT FROM on nullable derivation_path / key_version so new (account_id, derivation_path, key_version) triples are appended without merge unique_key NULL pitfalls.

Full refreshes also get a block_date >= DATE '2024-08-01' floor (alongside the existing deployment block_height) so partition pruning matches v1.signer deployment semantics.

Reviewed by Cursor Bugbot for commit 10b3d6d. Configure here.

@github-actions github-actions Bot added ready-for-review this PR development is complete, please review and removed WIP work in progress labels Jun 12, 2026

@tomfutago tomfutago left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression Report: PR #9764

Overall status: pass. CI output for near_v1_signer_users matches production exactly.

near_v1_signer_users

  • Profile: grain=(account_id, derivation_path, key_version) dimension; time=none; bound=none
  • Relations: prod=near_v1_signer.users; ci=dune.dune_spellbook_ci__tmp_pr9764_27405626462_1.near_v1_signer_users
  • Row count: prod 6,136, ci 6,136 -- pass
  • Summary metrics: both have 575 distinct accounts, 0 null account_id, 0 null derivation_path, 655 null key_version
  • Time coverage/range: skipped because the model has no time or block columns
  • Key-match: ci_only=0, prod_only=0 -- pass
  • Uniqueness: both prod and CI have 6,136 distinct keys and 0 duplicate rows

Verification queries:

@tomfutago tomfutago added ready-for-merging and removed ready-for-review this PR development is complete, please review labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dbt: daily covers the Daily dbt subproject ready-for-merging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants