Skip to content

test(jepsen): add ZSet safety workload with model-based checker#550

Open
bootjp wants to merge 32 commits into
mainfrom
feat/jepsen-zset-safety
Open

test(jepsen): add ZSet safety workload with model-based checker#550
bootjp wants to merge 32 commits into
mainfrom
feat/jepsen-zset-safety

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 19, 2026

Summary

Adds a Jepsen workload (elastickv-zset-safety-test) that verifies ZSet-specific safety properties under faults (network partitions, node kills), using a custom model-based checker. Goes beyond the simple add->read visibility check.

Properties verified:

  • Score correctness: a ZRANGE result's score for a member must equal the model's latest committed score, OR equal a score written by an operation concurrent with the read.
  • Order preservation: ZRANGE 0 -1 must be sorted by (score asc, member lex asc).
  • ZRANGEBYSCORE correctness: bounded range queries return exactly the members whose score falls in the bound, modulo concurrent mutations.
  • No phantom members: every read member must have been introduced by some successful or in-flight ZADD/ZINCRBY.
  • Atomicity: the checker treats every :ok operation as atomic; any visible inconsistency is reported.

Concurrent-ZADD handling uses an invoke/complete windowing approach. A mutation is "committed before" a read iff its :complete index is strictly less than the read's :invoke index. Mutations whose intervals overlap are "concurrent" and contribute to a per-member allowed-score set. Indeterminate (:info) mutations are treated as possibly-concurrent. ZINCRBY whose response is unknown sets :unknown-score? so the checker skips the strict score check for concurrent reads. ZREM carries the actual removed? boolean from the server reply so a no-op ZREM does not falsely mark the member as deleted.

Files

  • New: jepsen/src/elastickv/redis_zset_safety_workload.clj
  • New: jepsen/test/elastickv/redis_zset_safety_workload_test.clj (test-spec construction + checker edge cases: no-op ZREM, :info ZINCRBY, deterministic score mismatch).
  • Modified: jepsen/src/elastickv/jepsen_test.clj (entry point added).
  • Modified: .github/workflows/jepsen-test.yml (5s smoke run on every push).
  • Modified: .github/workflows/jepsen-test-scheduled.yml (150s default run every 6h).

Running locally

The workload is invoked directly via its own -main, not through jepsen-test/-main:

cd jepsen
lein run -m elastickv.redis-zset-safety-workload \
  --time-limit 60 --rate 10 --concurrency 5 \
  --ports 63791,63792,63793 --host 127.0.0.1

(elastickv.jepsen-test exposes elastickv-zset-safety-test only as a Clojure function for REPL use; CI and ad-hoc runs use the namespace's own -main.

Test plan

  • passes (Java 21).
  • Run with partition + kill faults: confirm checker emits a clear failure when a stale-leader read returns a divergent score.
    EOF
    )

Summary by CodeRabbit

  • Tests
    • Added a new safety test suite for Redis ZSet operations, validating data consistency and correctness under concurrent operations and fault conditions.
    • Extended CI/CD pipeline to automatically run ZSet safety tests in both scheduled and on-demand test workflows.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants