Skip to content

fix: atomic subsystem namespace slot check via FDB transaction#1065

Open
boddumanohar wants to merge 1 commit into
mainfrom
fix/parallel-clone-subsys-namespace-race
Open

fix: atomic subsystem namespace slot check via FDB transaction#1065
boddumanohar wants to merge 1 commit into
mainfrom
fix/parallel-clone-subsys-namespace-race

Conversation

@boddumanohar
Copy link
Copy Markdown
Member

Problem

Parallel clone() and add_lvol_ha() requests with namespaced=True share a subsystem (NQN). All threads read the same namespace count from DB, each decides the slot is free, and each writes an lvol with the same NQN — resulting in more lvols than max_namespace_per_subsys in one subsystem.

This is a classic TOCTOU race: the check (get_next_available_subsystem_on_node) and the act (write_to_db) were not atomic.

Fix

Replace the bare lvol.write_to_db() with a new db_controller.write_lvol_with_ns_check(lvol) that wraps both the count check and the write in a single FDB transaction:

def _write_lvol_with_ns_check_tx(self, tr, node_id, nqn, max_ns, lvol_key, lvol_data):
    live = 0
    for _, v in tr.get_range_startswith(b"object/LVol/"):
        d = json.loads(v)
        if d.get("node_id") == node_id and d.get("nqn") == nqn                 and d.get("status") not in (STATUS_IN_DELETION, STATUS_DELETED):
            live += 1
    if live >= max_ns:
        return False
    tr[lvol_key] = lvol_data
    return True

FDB's OCC means: if two transactions both read the object/LVol/ range and try to commit, the one that sees a stale read loses, gets retried with fresh data, and correctly sees the slot is now taken. No explicit lock, no serialisation of unrelated requests — parallel creates on different subsystems are completely unaffected.

Callers changed

File Location
snapshot_controller.py clone() — namespaced clones
lvol_controller.py add_lvol_ha() — namespaced lvol creates

Both return a retryable error ("Subsystem namespace limit reached concurrently; retry") instead of silently over-allocating when the OCC check fails.

Why not a mutex?

A per-node lock would serialise all clone requests on a node (even ones targeting different subsystems), adding ~15–25 ms of queue wait per request under parallel load. OCC only serialises the rare actual conflict.

Diff size

55 lines across 3 files.


🤖 Generated with Claude Code

Parallel clone/create requests sharing a subsystem (namespaced=True) all
read the same namespace count from DB and can each decide the slot is
free, resulting in more lvols than max_namespace_per_subsys being written
to one NQN.

Fix: replace the bare lvol.write_to_db() call with a single FDB
transactional function (write_lvol_with_ns_check) that re-counts active
namespaces for the target NQN inside the transaction and writes the new
lvol record only when the subsystem still has room.

Because the range-read (b'object/LVol/') and the write share one FDB
transaction, concurrent writers that race on the same NQN trigger an OCC
conflict on commit.  FDB automatically retries the loser with fresh data,
serialising the slot allocation without any explicit lock — parallel
creates on *different* subsystems continue to run without any contention.

Affected callers:
  - snapshot_controller.clone()      (namespaced clones)
  - lvol_controller.add_lvol_ha()    (namespaced lvol creates)

Both paths now return a retryable error instead of silently over-
allocating when the slot is taken after the OCC conflict is resolved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant