Skip to content

Implement TABLETS_ROUTING_V2#913

Open
dawmd wants to merge 5 commits into
scylladb:masterfrom
dawmd:leader-awareness
Open

Implement TABLETS_ROUTING_V2#913
dawmd wants to merge 5 commits into
scylladb:masterfrom
dawmd:leader-awareness

Conversation

@dawmd

@dawmd dawmd commented Jun 26, 2026

Copy link
Copy Markdown

Placeholder for the time being. I'll edit the cover letter tomorrow.

The PR was tested against scylladb/scylladb#30291.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

dawmd added 5 commits June 26, 2026 16:13
Add per-connection negotiation of the TABLETS_ROUTING_V2 extension, the
successor to TABLETS_ROUTING_V1. When the server advertises it in the
SUPPORTED response, the driver echoes it back during STARTUP to opt in;
a driver that negotiates v2 does not negotiate v1.

While the feature is experimental the wire name carries the
`_EXPERIMENTAL` suffix (TABLETS_ROUTING_V2_EXPERIMENTAL), and the server
only advertises it when started with the `strongly-consistent-tables`
experimental feature enabled.

Also add the optional trailing tablet_version_block byte to the EXECUTE
message body, written only when set, so later commits can carry the
cached tablet version to the server.
Store the server-provided 64-bit tablet_version on each cached Tablet
(None until learned) and add helpers to encode it into the one-byte
tablet_version_block exchanged on the wire:

* choose_tablet_version_block() packs a randomly chosen block index in
  the high nibble and that block's value in the low nibble, matching the
  server's locator::compare_tablet_version_block layout. Blocks are
  indexed from the least significant bits, so block i covers bits
  [i*4, i*4 + 4) of the version. A random index avoids any shared mutable
  counter on the hot path while still probing every nibble often enough
  to detect a server-side version change quickly.
* random_tablet_version_block() returns a random byte for cold start,
  when no version is cached yet.
Add KeyspaceMetadata.strongly_consistent, derived from the per-keyspace
`consistency` option in system_schema.scylla_keyspaces. A keyspace is
strongly consistent when its consistency is `local` or `global`
(null/`eventual` means eventually consistent). The lookup is cached per
schema refresh and degrades to "eventually consistent" on non-Scylla
clusters or older Scylla versions that lack the table or column.

This flag lets the leader-aware routing logic tell which tablet tables
actually have a Raft leader.
With TABLETS_ROUTING_V2 the server returns, on a tablet_version
mismatch, the tablet's replica set with the Raft leader first plus the
new tablet_version. Use this to route strongly-consistent reads and
writes straight to the leader:

* Every EXECUTE on a V2 connection carries a tablet_version_block
  computed from the cached version (or a random byte on cold start and
  for non-single-partition requests). The block is attached per
  connection, keyed off the connection that actually serves the request,
  so a rolling upgrade that mixes v1/v2 connections never desyncs the
  frame.
* On the response, the routing payload is parsed according to what the
  serving connection negotiated; the v2 tuple additionally carries the
  tablet_version, which is stored back on the tablet.
* TokenAwarePolicy yields the leader (the first replica) first, but only
  for strongly-consistent keyspaces -- a tablet_version is assigned to
  eventually-consistent tablet tables too and must not be mistaken for a
  leader hint.

To keep this hot path cheap, the ring token is memoized once per
statement (Statement.routing_token) and reused by the load balancing
policy, the shard selection in the pool, and the tablet_version_block
computation, instead of re-hashing the routing key three times.
Cover the end-to-end behaviour against a live ScyllaDB started with the
`strongly-consistent-tables` experimental feature: v2 negotiation,
payload-driven cache population, the tablet_version_block matching rules
(no payload on a matching block, exactly one matching value per index,
precedence over v1), the protocol error on a missing block, and
leader-aware routing targeting the Raft leader for a strongly-consistent
keyspace.
@dawmd dawmd requested review from Copilot and piodul June 26, 2026 16:55
@dawmd dawmd self-assigned this Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The driver adds tablets routing v2 support across feature negotiation, tablet version block generation, execute message framing, and shard selection. Response handling now parses tablets-routing-v1/v2 payloads into cluster tablet metadata, and schema parsing marks keyspaces as strongly consistent from Scylla system tables. Token-aware routing now uses tablet replicas and prefers the tablet leader for strongly consistent keyspaces.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ResponseFuture
  participant Statement
  participant HostConnection
  participant ExecuteMessage
  participant ScyllaServer
  participant ClusterMetadata

  Client->>ResponseFuture: execute(query)
  ResponseFuture->>Statement: routing_token()
  ResponseFuture->>ResponseFuture: _compute_tablet_version_block(query)
  ResponseFuture->>HostConnection: borrow_connection(...)
  HostConnection-->>ResponseFuture: connection
  ResponseFuture->>ExecuteMessage: set tablet_version_block
  ResponseFuture->>ScyllaServer: send EXECUTE
  ScyllaServer-->>ResponseFuture: response + tablets-routing payload
  ResponseFuture->>ClusterMetadata: register tablets
Loading
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is mostly a placeholder and does not summarize the changes or complete the required checklist. Replace the placeholder with a real PR summary, explain why the changes are needed, and complete the checklist items relevant to this patch.
Docstring Coverage ⚠️ Warning Docstring coverage is 54.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: adding TABLETS_ROUTING_V2 support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Scylla TABLETS_ROUTING_V2 negotiation and request/response handling in the driver, including encoding the per-request tablet_version_block, parsing the new tablets-routing-v2 custom payload (with tablet_version), and adding leader-aware routing behavior for strongly-consistent keyspaces.

Changes:

  • Add TABLETS_ROUTING_V2 protocol feature negotiation and ensure V2 subsumes V1 in STARTUP options.
  • Attach a per-request tablet_version_block to EXECUTE messages on V2-negotiated connections; parse and cache V2 tablet routing payloads including tablet_version.
  • Introduce strongly-consistent keyspace detection (Scylla-only) and leader-first routing in TokenAwarePolicy; add unit and integration coverage.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_tablets.py Adds unit tests for tablet_version_block encoding and for storing tablet_version on Tablet.
tests/unit/test_protocol_features.py Adds negotiation tests ensuring V2 is preferred over V1 in STARTUP options.
tests/unit/test_policies.py Adds unit tests for leader-aware routing behavior (leader-first, fallback, and non-SC cases).
tests/integration/standard/test_tablets_routing_v2.py Adds opt-in end-to-end tests validating negotiation, payload decoding, and wire behavior against a V2-capable Scylla cluster.
cassandra/tablets.py Adds tablet_version_block helpers and extends Tablet to store tablet_version.
cassandra/query.py Adds memoization for routing-key token computation via Statement.routing_token().
cassandra/protocol.py Extends ExecuteMessage to optionally append tablet_version_block byte when present.
cassandra/protocol_features.py Adds TABLETS_ROUTING_V2 constant and negotiation logic; makes V2 mutually exclusive with V1 in STARTUP.
cassandra/pool.py Adds per-pool V2 capability detection and reuses memoized routing token to avoid repeated hashing.
cassandra/policies.py Adds leader-first routing for strongly-consistent keyspaces when tablet replicas are available.
cassandra/metadata.py Adds KeyspaceMetadata.strongly_consistent and populates it from Scylla system_schema.scylla_keyspaces.
cassandra/cluster.py Computes/attaches tablet_version_block per connection and parses V2 routing payloads based on the serving connection’s negotiated features.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cassandra/query.py
Comment on lines +353 to +357
token = self._routing_token
if token is None:
token = token_class.from_key(routing_key)
self._routing_token = token
return token
Comment thread cassandra/pool.py
Comment on lines +451 to +453
@property
def tablets_routing_v2(self):
return any(c.features.tablets_routing_v2 for c in self._connections.values())
Comment thread cassandra/cluster.py
Comment on lines +5055 to +5076
if isinstance(message, ExecuteMessage):
# Attach the tablet_version_block only when the *specific connection*
# we are about to send on negotiated TABLETS_ROUTING_V2. The server
# reads the trailing tablet_version_block byte only on connections
# that negotiated V2 (gated on the cluster-wide feature), so keying
# off the borrowed connection -- which is already in hand here, since
# borrow_connection() ran above -- is both necessary and sufficient:
# * a V2 connection always gets the block, even if this pool was
# created (and any cached flag latched) before the cluster
# feature was enabled, e.g. mid rolling-upgrade;
# * a non-V2 connection never gets it, even if a sibling shard
# connection in the same pool already negotiated V2 -- which can
# happen transiently while connections opened before and after
# the feature flip coexist. Attaching the block to a non-V2
# connection would leave an unread trailing byte and desync the
# frame, so a pool-level flag cannot get this right regardless of
# how it is latched.
if connection.features.tablets_routing_v2:
message.tablet_version_block = self.session._compute_tablet_version_block(self.query)
else:
message.tablet_version_block = None

Comment thread cassandra/cluster.py
Comment on lines +3099 to +3102
"""
Compute the tablet_version_block byte for a BoundStatement.
Returns an int in [0, 255] or None if V2 should not be used.
"""
Comment thread cassandra/cluster.py
Comment on lines +5072 to +5075
if connection.features.tablets_routing_v2:
message.tablet_version_block = self.session._compute_tablet_version_block(self.query)
else:
message.tablet_version_block = None

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
tests/unit/test_policies.py (1)

1043-1082: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Optional: test name/docstring don't match what's actually exercised.

This test asserts the "no tablet_version (V1)" path, but the production code (make_query_plan) never reads tablet.tablet_version; leader-first is gated solely on ks_meta.strongly_consistent, which is set to False here. So the test passes because of the strongly_consistent=False flag, not because tablet_version=None. Consider renaming/reframing it (or setting strongly_consistent=True with tablet_version=None) so the test actually guards the behavior its name implies; otherwise it overlaps with test_no_leader_routing_for_eventually_consistent_keyspace.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_policies.py` around lines 1043 - 1082, The test name and
docstring in test_no_leader_routing_without_tablet_version do not match the
behavior actually being exercised, because make_query_plan only gates
leader-first routing on keyspace metadata strong consistency, not
Tablet.tablet_version. Update the test so it truly covers the intended case by
either renaming/reframing it to reflect strongly_consistent=False, or by setting
cluster.metadata.keyspaces['ks'].strongly_consistent to True while keeping
tablet_version=None to verify the V1 tablet path; keep the assertions aligned
with the TokenAwarePolicy routing behavior.
cassandra/policies.py (1)

510-514: 🚀 Performance & Scalability | 🔵 Trivial

Optional: Cache the child query plan to prevent double invocation and potential inconsistency.

child.make_query_plan(keyspace, query) is invoked at line 512 to filter replicas and is called again at line 551 to build the final plan. If the child policy is stateful (e.g., RoundRobinPolicy or DCAwareRoundRobinPolicy), the second invocation may advance internal pointers and return a different host sequence, causing the filtering logic to diverge from the execution order and introducing unnecessary overhead.

♻️ Suggested approach

Materialize the child plan once and reuse it:

-        if tablet is not None:
-            replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
-            child_plan = child.make_query_plan(keyspace, query)
-
-            replicas = [host for host in child_plan if host.host_id in replicas_mapped]
+        child_plan = list(child.make_query_plan(keyspace, query))
+        if tablet is not None:
+            replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
+            replicas = [host for host in child_plan if host.host_id in replicas_mapped]

Update the subsequent loop (line 551) to iterate over the cached child_plan instead of re-invoking child.make_query_plan(...).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/policies.py` around lines 510 - 514, Cache the result of
child.make_query_plan(keyspace, query) in the policy logic so it is only
evaluated once, then reuse that same child_plan for both the replica filtering
in the tablet block and the final plan construction loop. Update the query
planning flow in the policy method that handles tablet replicas to iterate over
the cached child_plan instead of calling child.make_query_plan(...) a second
time, which avoids stateful policy divergence and redundant work.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cassandra/cluster.py`:
- Around line 5055-5075: The ExecuteMessage tablet_version_block is only being
set in the pooled connection path, so control-connection fallback can miss the
mandatory V2 trailing byte. Update _query_control_connection() in
cassandra/cluster.py to use the same tablets_routing_v2 check and
_compute_tablet_version_block(self.query) logic as the ExecuteMessage handling
in the main send path, placing it after self._connection = connection and before
connection.send_msg(...) so V2 control connections always include the block and
non-V2 ones do not.
- Around line 5212-5231: The tablet-routing cache is using only
self.query.keyspace when adding tablets, which can store entries under a None
key and miss cache hits for session-level keyspaces. Update the tablet handling
in ResponseFuture to use the effective keyspace consistently, matching
_compute_tablet_version_block() by falling back to self.keyspace when
self.query.keyspace is unset, and then pass that resolved keyspace into
metadata._tablets.add_tablet().

In `@cassandra/metadata.py`:
- Around line 2658-2660: The fallback in metadata schema refresh is too broad
because the except block in the system_schema.scylla_keyspaces read path catches
every Exception, which masks unexpected failures. Narrow the handler around the
code in metadata.py that logs “Could not read system_schema.scylla_keyspaces” so
it only catches the specific expected driver/server schema-unavailable error(s),
and let other refresh errors propagate. Keep the existing debug log and exc_info
behavior for the expected case, but do not use a blanket Exception catch in this
branch.

In `@cassandra/pool.py`:
- Around line 451-453: The tablets_routing_v2 property in Pool should not
iterate the live _connections view directly, because concurrent
replacement/shutdown can mutate it and closed connections can incorrectly keep
V2 enabled. Snapshot the current connections first, then compute the flag only
from still-open/live connections before checking each connection’s
features.tablets_routing_v2.

In `@cassandra/query.py`:
- Line 278: The cached routing token in Statement is being reused without
considering token_class, which can cause the wrong ring/token to be used across
sessions or clusters. Update the token cache logic in Statement-related routing
code so _routing_token is keyed by token_class as well, and make the
lookup/reuse path in the affected routing methods distinguish tokens by the
partitioner class used. Ensure the changes cover the Statement fields and the
routing/token selection flow that currently reads and writes _routing_token.

---

Nitpick comments:
In `@cassandra/policies.py`:
- Around line 510-514: Cache the result of child.make_query_plan(keyspace,
query) in the policy logic so it is only evaluated once, then reuse that same
child_plan for both the replica filtering in the tablet block and the final plan
construction loop. Update the query planning flow in the policy method that
handles tablet replicas to iterate over the cached child_plan instead of calling
child.make_query_plan(...) a second time, which avoids stateful policy
divergence and redundant work.

In `@tests/unit/test_policies.py`:
- Around line 1043-1082: The test name and docstring in
test_no_leader_routing_without_tablet_version do not match the behavior actually
being exercised, because make_query_plan only gates leader-first routing on
keyspace metadata strong consistency, not Tablet.tablet_version. Update the test
so it truly covers the intended case by either renaming/reframing it to reflect
strongly_consistent=False, or by setting
cluster.metadata.keyspaces['ks'].strongly_consistent to True while keeping
tablet_version=None to verify the V1 tablet path; keep the assertions aligned
with the TokenAwarePolicy routing behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 587d3541-27be-4132-8dda-c90e66cff85c

📥 Commits

Reviewing files that changed from the base of the PR and between c1bfd54 and 7cbb8a9.

📒 Files selected for processing (12)
  • cassandra/cluster.py
  • cassandra/metadata.py
  • cassandra/policies.py
  • cassandra/pool.py
  • cassandra/protocol.py
  • cassandra/protocol_features.py
  • cassandra/query.py
  • cassandra/tablets.py
  • tests/integration/standard/test_tablets_routing_v2.py
  • tests/unit/test_policies.py
  • tests/unit/test_protocol_features.py
  • tests/unit/test_tablets.py

Comment thread cassandra/cluster.py
Comment on lines +5055 to +5075
if isinstance(message, ExecuteMessage):
# Attach the tablet_version_block only when the *specific connection*
# we are about to send on negotiated TABLETS_ROUTING_V2. The server
# reads the trailing tablet_version_block byte only on connections
# that negotiated V2 (gated on the cluster-wide feature), so keying
# off the borrowed connection -- which is already in hand here, since
# borrow_connection() ran above -- is both necessary and sufficient:
# * a V2 connection always gets the block, even if this pool was
# created (and any cached flag latched) before the cluster
# feature was enabled, e.g. mid rolling-upgrade;
# * a non-V2 connection never gets it, even if a sibling shard
# connection in the same pool already negotiated V2 -- which can
# happen transiently while connections opened before and after
# the feature flip coexist. Attaching the block to a non-V2
# connection would leave an unread trailing byte and desync the
# frame, so a pool-level flag cannot get this right regardless of
# how it is latched.
if connection.features.tablets_routing_v2:
message.tablet_version_block = self.session._compute_tablet_version_block(self.query)
else:
message.tablet_version_block = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Attach the V2 block on control-connection fallback too.

Line 5072 handles pooled connections, but _query_control_connection() can also send an ExecuteMessage; on a V2 control connection that path omits the mandatory trailing byte and can trigger a protocol error.

Proposed direction
+    def _set_tablet_version_block_for_connection(self, message, connection):
+        if not isinstance(message, ExecuteMessage):
+            return
+        if connection.features.tablets_routing_v2:
+            message.tablet_version_block = self.session._compute_tablet_version_block(self.query)
+        else:
+            message.tablet_version_block = None
+
     def _query(self, host, message=None, cb=None):
         ...
-            if isinstance(message, ExecuteMessage):
-                ...
-                if connection.features.tablets_routing_v2:
-                    message.tablet_version_block = self.session._compute_tablet_version_block(self.query)
-                else:
-                    message.tablet_version_block = None
+            self._set_tablet_version_block_for_connection(message, connection)

Then call the same helper in _query_control_connection() after self._connection = connection and before connection.send_msg(...).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/cluster.py` around lines 5055 - 5075, The ExecuteMessage
tablet_version_block is only being set in the pooled connection path, so
control-connection fallback can miss the mandatory V2 trailing byte. Update
_query_control_connection() in cassandra/cluster.py to use the same
tablets_routing_v2 check and _compute_tablet_version_block(self.query) logic as
the ExecuteMessage handling in the main send path, placing it after
self._connection = connection and before connection.send_msg(...) so V2 control
connections always include the block and non-V2 ones do not.

Comment thread cassandra/cluster.py
Comment on lines +5212 to +5231
keyspace = self.query.keyspace
table = self.query.table
if tablet:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
elif connection.features.tablets_routing_v1 and 'tablets-routing-v1' in self._custom_payload:
protocol = self.session.cluster.protocol_version
info = self._custom_payload.get('tablets-routing-v1')
ctype = ResponseFuture._TABLET_ROUTING_CTYPE
if ctype is None:
ctype = types.lookup_casstype('TupleType(LongType, LongType, ListType(TupleType(UUIDType, Int32Type)))')
ResponseFuture._TABLET_ROUTING_CTYPE = ctype
tablet_routing_info = ctype.from_binary(info, protocol)
first_token = tablet_routing_info[0]
last_token = tablet_routing_info[1]
tablet_replicas = tablet_routing_info[2]
tablet = Tablet.from_row(first_token, last_token, tablet_replicas)
keyspace = self.query.keyspace
table = self.query.table
if tablet:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Cache routing payloads under the effective keyspace.

Line 5212 and Line 5228 use only self.query.keyspace, but _compute_tablet_version_block() looks up tablets with query.keyspace or self.keyspace. Prepared statements executed in a session keyspace can therefore cache under (None, table) and miss the cache on every later request.

Proposed fix
-                    keyspace = self.query.keyspace
+                    keyspace = self.query.keyspace or self.session.keyspace
                     table = self.query.table
-                    if tablet:
+                    if tablet and keyspace and table:
                         self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
...
-                    keyspace = self.query.keyspace
+                    keyspace = self.query.keyspace or self.session.keyspace
                     table = self.query.table
-                    if tablet:
+                    if tablet and keyspace and table:
                         self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
keyspace = self.query.keyspace
table = self.query.table
if tablet:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
elif connection.features.tablets_routing_v1 and 'tablets-routing-v1' in self._custom_payload:
protocol = self.session.cluster.protocol_version
info = self._custom_payload.get('tablets-routing-v1')
ctype = ResponseFuture._TABLET_ROUTING_CTYPE
if ctype is None:
ctype = types.lookup_casstype('TupleType(LongType, LongType, ListType(TupleType(UUIDType, Int32Type)))')
ResponseFuture._TABLET_ROUTING_CTYPE = ctype
tablet_routing_info = ctype.from_binary(info, protocol)
first_token = tablet_routing_info[0]
last_token = tablet_routing_info[1]
tablet_replicas = tablet_routing_info[2]
tablet = Tablet.from_row(first_token, last_token, tablet_replicas)
keyspace = self.query.keyspace
table = self.query.table
if tablet:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
keyspace = self.query.keyspace or self.session.keyspace
table = self.query.table
if tablet and keyspace and table:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
elif connection.features.tablets_routing_v1 and 'tablets-routing-v1' in self._custom_payload:
protocol = self.session.cluster.protocol_version
info = self._custom_payload.get('tablets-routing-v1')
ctype = ResponseFuture._TABLET_ROUTING_CTYPE
if ctype is None:
ctype = types.lookup_casstype('TupleType(LongType, LongType, ListType(TupleType(UUIDType, Int32Type)))')
ResponseFuture._TABLET_ROUTING_CTYPE = ctype
tablet_routing_info = ctype.from_binary(info, protocol)
first_token = tablet_routing_info[0]
last_token = tablet_routing_info[1]
tablet_replicas = tablet_routing_info[2]
tablet = Tablet.from_row(first_token, last_token, tablet_replicas)
keyspace = self.query.keyspace or self.session.keyspace
table = self.query.table
if tablet and keyspace and table:
self.session.cluster.metadata._tablets.add_tablet(keyspace, table, tablet)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/cluster.py` around lines 5212 - 5231, The tablet-routing cache is
using only self.query.keyspace when adding tablets, which can store entries
under a None key and miss cache hits for session-level keyspaces. Update the
tablet handling in ResponseFuture to use the effective keyspace consistently,
matching _compute_tablet_version_block() by falling back to self.keyspace when
self.query.keyspace is unset, and then pass that resolved keyspace into
metadata._tablets.add_tablet().

Comment thread cassandra/metadata.py
Comment on lines +2658 to +2660
except Exception:
log.debug("Could not read system_schema.scylla_keyspaces; treating all "
"keyspaces as eventually consistent", exc_info=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Narrow the Scylla-keyspaces fallback exception.

Line 2658 should tolerate the expected “table/column unavailable” schema error, but catching every Exception also hides timeouts, connection failures, and parser bugs; Ruff flags this as BLE001. Please catch the specific expected driver/server exception(s) and let unexpected refresh failures surface.

🧰 Tools
🪛 Ruff (0.15.18)

[warning] 2658-2658: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/metadata.py` around lines 2658 - 2660, The fallback in metadata
schema refresh is too broad because the except block in the
system_schema.scylla_keyspaces read path catches every Exception, which masks
unexpected failures. Narrow the handler around the code in metadata.py that logs
“Could not read system_schema.scylla_keyspaces” so it only catches the specific
expected driver/server schema-unavailable error(s), and let other refresh errors
propagate. Keep the existing debug log and exc_info behavior for the expected
case, but do not use a blanket Exception catch in this branch.

Source: Linters/SAST tools

Comment thread cassandra/pool.py
Comment on lines +451 to +453
@property
def tablets_routing_v2(self):
return any(c.features.tablets_routing_v2 for c in self._connections.values())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Snapshot live connections before scanning V2 flags.

Line 453 iterates the mutable _connections dict view while replacement/shutdown paths can mutate it, which can raise RuntimeError on the request path; it also lets closed V2 connections keep the pool in V2 mode.

Proposed fix
     `@property`
     def tablets_routing_v2(self):
-        return any(c.features.tablets_routing_v2 for c in self._connections.values())
+        with self._lock:
+            connections = tuple(self._connections.values())
+        return any(
+            not c.is_closed and not c.is_defunct and c.features.tablets_routing_v2
+            for c in connections
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@property
def tablets_routing_v2(self):
return any(c.features.tablets_routing_v2 for c in self._connections.values())
`@property`
def tablets_routing_v2(self):
with self._lock:
connections = tuple(self._connections.values())
return any(
not c.is_closed and not c.is_defunct and c.features.tablets_routing_v2
for c in connections
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/pool.py` around lines 451 - 453, The tablets_routing_v2 property in
Pool should not iterate the live _connections view directly, because concurrent
replacement/shutdown can mutate it and closed connections can incorrectly keep
V2 enabled. Snapshot the current connections first, then compute the flag only
from still-open/live connections before checking each connection’s
features.tablets_routing_v2.

Comment thread cassandra/query.py

_serial_consistency_level = None
_routing_key = None
_routing_token = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Key the cached token by token_class too.

Line 353 reuses _routing_token regardless of which partitioner class was passed. If the same Statement is executed against another session/cluster with a different token_class, tablet lookup and shard selection can use a token from the wrong ring.

Proposed fix
     _serial_consistency_level = None
     _routing_key = None
     _routing_token = None
+    _routing_token_class = None
@@
         if consistency_level is not None:
             self.consistency_level = consistency_level
         self._routing_key = routing_key
+        self._routing_token = None
+        self._routing_token_class = None
@@
         # The memoized ring token is derived from the routing key; invalidate it.
         self._routing_token = None
+        self._routing_token_class = None
 
     def _del_routing_key(self):
         self._routing_key = None
         self._routing_token = None
+        self._routing_token_class = None
@@
         token = self._routing_token
-        if token is None:
+        if token is None or self._routing_token_class is not token_class:
             token = token_class.from_key(routing_key)
             self._routing_token = token
+            self._routing_token_class = token_class
         return token

Also applies to: 318-323, 338-357

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cassandra/query.py` at line 278, The cached routing token in Statement is
being reused without considering token_class, which can cause the wrong
ring/token to be used across sessions or clusters. Update the token cache logic
in Statement-related routing code so _routing_token is keyed by token_class as
well, and make the lookup/reuse path in the affected routing methods distinguish
tokens by the partitioner class used. Ensure the changes cover the Statement
fields and the routing/token selection flow that currently reads and writes
_routing_token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants