Make sure system.peers and peers_v2 match cluster metadata #4630

Open

minal-kyada wants to merge 2 commits intoapache:trunkfrom

minal-kyada:cassandra-168065336

minal-kyada commented Feb 24, 2026 •

edited

Loading

Description:
system.peers and system.peers_v2 can drift out of sync with ClusterMetadata, causing clients who use older C* version and tools that read these legacy tables to observe incorrect cluster topology.

Adds SystemPeersValidator which reconciles both peer tables against ClusterMetadata on startup- removing stale entries for nodes no longer in the cluster, repairing missing entries for joined nodes and updating the stale fields of nodes if present in both. Also exposes this as a JMX operation via StorageServiceMBean so operators can trigger it on demand without restarting.

Quick diagram of what the patch is about:

patch by @minal-kyada; reviewed by @ifesdjeen @krummas for CASSANDRA-21187


          initial commit

a0ad574

ifesdjeen reviewed

View reviewed changes

src/java/org/apache/cassandra/db/SystemPeersValidator.java Outdated Show resolved Hide resolved

ifesdjeen reviewed

View reviewed changes

src/java/org/apache/cassandra/db/SystemPeersValidator.java Outdated Show resolved Hide resolved

minal-kyada requested a review from ifesdjeen

March 4, 2026 16:27


          review comments

a23e2b7

minal-kyada force-pushed the cassandra-168065336 branch from f9e0982 to a23e2b7 Compare

March 4, 2026 17:39

beobal self-requested a review

March 5, 2026 07:41

beobal requested changes

View reviewed changes

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+              import static org.apache.cassandra.db.SystemKeyspace.LEGACY_PEERS;
+              import static org.apache.cassandra.db.SystemKeyspace.PEERS_V2;
+              /**

Contributor

beobal Mar 4, 2026

I have a couple of suggestions for this comment:

I would soften the language slightly
Remove mention of startup, as it can also be triggered directly
Mention that there is a virtual table which presents a live view of ClusterMetadata.
I prefer not to use TCM in the codebase as it's a bit vague and unspecific and sometimes is incorrectly used as a kind of catch-all label for anything related to internal metadata.

So, I'd reword this slightly to:

Validator to ensure system.peers and system.peers_v2 tables match ClusterMetadata.
These tables are maintained for existing clients and tools which read from them to determine 
topology and schema information, while Cassandra itself uses ClusterMetadata as the source of truth.

This validator detects inconsistencies and automatically repairs them by synchronizing
the peers tables with the current ClusterMetadata.

The system_views.peers virtual table provides a live view on the current ClusterMetadata 
and includes all members of the cluster, unlike the legacy tables which exclude the local node.

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                          expectedAddresses.put(endpoint.getAddress(), nodeId);
+                      }
+                      String deleteV2Query = String.format("DELETE FROM %s.%s WHERE peer = ? AND peer_port = ?",

Contributor

beobal Mar 4, 2026

This can be static and final

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                          }
+                      }
+                      String deleteLegacyQuery = String.format("DELETE FROM %s.%s WHERE peer = ?",

Contributor

beobal Mar 4, 2026

This can be static and final

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                  public static void validateAndRepair(ClusterMetadata metadata)
+                  {
+                      Map<InetAddressAndPort, UntypedResultSet.Row> peersV2Rows = getPeersV2Rows();
+                      Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();

Contributor

beobal Mar 5, 2026

Nit: I would generally just refer to these two tables as the peers/peersV2 tables, so peersRows/getPeersRows & peersV2Rows/getPeersV2Rows here. Same for the CQL queries (i.e. deletePeersQuery/deletePeersV2Query and so on.

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                      Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();
+                      Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();
+                      Map<InetAddress, NodeId> expectedAddresses = new HashMap<>();

Contributor

beobal Mar 5, 2026

This only needs to be a Set, the node id is never retrieved using the address.

Contributor

beobal Mar 5, 2026

nit: as well as changing to a Set I would maybe rename to knownAddresses

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                      Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();
+                      Map<InetAddress, NodeId> expectedAddresses = new HashMap<>();
+                      for (NodeId nodeId : getExpectedPeerNodes(metadata))

Contributor

beobal Mar 5, 2026

You could do this without getExpectedPeerNodes by just iterating metadata.directory.allJoinedEndpoints() here.

Author

minal-kyada Mar 6, 2026 •

edited

Loading

I had this because I wanted to eliminate localEndpoints, but there's a better way to do it - Inlined the logic directly into validateAndRepair, iterating
allJoinedEndpoints() once and looking up nodeId from endpoint directly and skipping the localEndpoint.

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                      Map<InetAddressAndPort, UntypedResultSet.Row> peersV2Rows = getPeersV2Rows();
+                      Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();
+                      Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();

Contributor

beobal Mar 5, 2026

nit: I would consider renaming to knownEndpoints

src/java/org/apache/cassandra/db/SystemPeersValidator.java

+                          UntypedResultSet.Row v2Row = peersV2Rows.get(endpoint);
+                          UntypedResultSet.Row legacyRow = legacyPeersRows.get(endpoint.getAddress());
+                          List<String> v2Discrepancies = collectV2Discrepancies(v2Row, nodeId, metadata);

Contributor

beobal Mar 5, 2026

I think this can be trimmed down a bit, as there's no need to collect differences and log them individually.

I would implement a boolean method (or methods) which checks equivalence between a row from one of the tables and the values in cluster metadata. If not equivalent, then just log that there was a mismatch and include everything in the row (you can use UntypedResultSet.Row::toString for that as it handles all the data typing and so on for you).

This hypothetical equivalence method should take into account potentially missing/unset columns, which I don't think the existing checks do (i.e. what if the row has has no value for one of the columns?)

There is a lot of commonality between the two table schemas so it should be possible to come up with something reasonably generic to check both (may need a couple of methods, but I expect they can share a lot).

Author

minal-kyada Mar 6, 2026

null safety

Thanks for sharing this, I should definitely add it, as except for getSet(), none of the getters have null guards. I'll add has() checks before each getter and treat a missing column as a discrepancy.

consolidating the two methods

I did consider it, but reverted that change because I found it more convenient to see the complete set of validations for a given table at a glance without jumping between methods. A shared helper would still need table-specific checks alongside it, so I'd end up fragmenting the logic across multiple methods for what is a fairly simple use case. That said, happy to revisit if you feel strongly about it.

per-field discrepancy logging

It tells an operator exactly what's wrong — e.g. Updating peer 10.0.0.1 in peers_v2 for stale fields [data_center, rack]. With Row::toString(), we will get pipe-delimited positional values without column names, and the operator would have to figure out the diff manually to identify what's stale. For something meant to be a diagnostic/repair tool, don't you think having the specific out-of-sync fields is more helpful? I also checked the rest of the codebase and we always extract fields and log them by name. Please let me know if it is useful to know this information from operator POV, else we can definitely add this change ?

test/unit/org/apache/cassandra/db/SystemPeersValidatorTest.java

    
                  }

                  @Test

                  public void testStaleFieldInPeersV2IsRepaired()

Contributor

beobal Mar 5, 2026

Is it possible to add similar verification for every field in the table, preferably without too much duplication in the test code?

Contributor

beobal Mar 5, 2026

Relatedly, what if the mismatch is due to a null column in the system table?

test/unit/org/apache/cassandra/db/SystemPeersValidatorTest.java

    
                      if (table.equals(PEERS_V2))

                          result = executeInternal(

                              String.format("SELECT data_center FROM %s.%s WHERE peer = ? AND peer_port = ?",

                                            SchemaConstants.SYSTEM_KEYSPACE_NAME, table),

Contributor

beobal Mar 5, 2026

I generally tend to add static imports for things like SchemaConstants.SYSTEM_KEYSPACE_NAME as it removes a fair bit of noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet