Skip to content

Make sure system.peers and peers_v2 match cluster metadata #4630

Open
minal-kyada wants to merge 2 commits intoapache:trunkfrom
minal-kyada:cassandra-168065336
Open

Make sure system.peers and peers_v2 match cluster metadata #4630
minal-kyada wants to merge 2 commits intoapache:trunkfrom
minal-kyada:cassandra-168065336

Conversation

@minal-kyada
Copy link

@minal-kyada minal-kyada commented Feb 24, 2026

Description:
system.peers and system.peers_v2 can drift out of sync with ClusterMetadata, causing clients who use older C* version and tools that read these legacy tables to observe incorrect cluster topology.

Adds SystemPeersValidator which reconciles both peer tables against ClusterMetadata on startup- removing stale entries for nodes no longer in the cluster, repairing missing entries for joined nodes and updating the stale fields of nodes if present in both. Also exposes this as a JMX operation via StorageServiceMBean so operators can trigger it on demand without restarting.

Quick diagram of what the patch is about:
image

patch by @minal-kyada; reviewed by @ifesdjeen @krummas for CASSANDRA-21187

@minal-kyada minal-kyada requested a review from ifesdjeen March 4, 2026 16:27
@minal-kyada minal-kyada force-pushed the cassandra-168065336 branch from f9e0982 to a23e2b7 Compare March 4, 2026 17:39
@beobal beobal self-requested a review March 5, 2026 07:41
import static org.apache.cassandra.db.SystemKeyspace.LEGACY_PEERS;
import static org.apache.cassandra.db.SystemKeyspace.PEERS_V2;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of suggestions for this comment:

  • I would soften the language slightly
  • Remove mention of startup, as it can also be triggered directly
  • Mention that there is a virtual table which presents a live view of ClusterMetadata.
  • I prefer not to use TCM in the codebase as it's a bit vague and unspecific and sometimes is incorrectly used as a kind of catch-all label for anything related to internal metadata.

So, I'd reword this slightly to:

Validator to ensure system.peers and system.peers_v2 tables match ClusterMetadata.
These tables are maintained for existing clients and tools which read from them to determine 
topology and schema information, while Cassandra itself uses ClusterMetadata as the source of truth.

This validator detects inconsistencies and automatically repairs them by synchronizing
the peers tables with the current ClusterMetadata.

The system_views.peers virtual table provides a live view on the current ClusterMetadata 
and includes all members of the cluster, unlike the legacy tables which exclude the local node. 

expectedAddresses.put(endpoint.getAddress(), nodeId);
}

String deleteV2Query = String.format("DELETE FROM %s.%s WHERE peer = ? AND peer_port = ?",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be static and final

}
}

String deleteLegacyQuery = String.format("DELETE FROM %s.%s WHERE peer = ?",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be static and final

public static void validateAndRepair(ClusterMetadata metadata)
{
Map<InetAddressAndPort, UntypedResultSet.Row> peersV2Rows = getPeersV2Rows();
Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would generally just refer to these two tables as the peers/peersV2 tables, so peersRows/getPeersRows & peersV2Rows/getPeersV2Rows here. Same for the CQL queries (i.e. deletePeersQuery/deletePeersV2Query and so on.

Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();

Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();
Map<InetAddress, NodeId> expectedAddresses = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only needs to be a Set, the node id is never retrieved using the address.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as well as changing to a Set I would maybe rename to knownAddresses


Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();
Map<InetAddress, NodeId> expectedAddresses = new HashMap<>();
for (NodeId nodeId : getExpectedPeerNodes(metadata))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this without getExpectedPeerNodes by just iterating metadata.directory.allJoinedEndpoints() here.

Copy link
Author

@minal-kyada minal-kyada Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this because I wanted to eliminate localEndpoints, but there's a better way to do it - Inlined the logic directly into validateAndRepair, iterating
allJoinedEndpoints() once and looking up nodeId from endpoint directly and skipping the localEndpoint.

Map<InetAddressAndPort, UntypedResultSet.Row> peersV2Rows = getPeersV2Rows();
Map<InetAddress, UntypedResultSet.Row> legacyPeersRows = getLegacyPeersRows();

Map<InetAddressAndPort, NodeId> expectedEndpoints = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would consider renaming to knownEndpoints

UntypedResultSet.Row v2Row = peersV2Rows.get(endpoint);
UntypedResultSet.Row legacyRow = legacyPeersRows.get(endpoint.getAddress());

List<String> v2Discrepancies = collectV2Discrepancies(v2Row, nodeId, metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be trimmed down a bit, as there's no need to collect differences and log them individually.

I would implement a boolean method (or methods) which checks equivalence between a row from one of the tables and the values in cluster metadata. If not equivalent, then just log that there was a mismatch and include everything in the row (you can use UntypedResultSet.Row::toString for that as it handles all the data typing and so on for you).

This hypothetical equivalence method should take into account potentially missing/unset columns, which I don't think the existing checks do (i.e. what if the row has has no value for one of the columns?)

There is a lot of commonality between the two table schemas so it should be possible to come up with something reasonably generic to check both (may need a couple of methods, but I expect they can share a lot).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null safety

Thanks for sharing this, I should definitely add it, as except for getSet(), none of the getters have null guards. I'll add has() checks before each getter and treat a missing column as a discrepancy.

consolidating the two methods

I did consider it, but reverted that change because I found it more convenient to see the complete set of validations for a given table at a glance without jumping between methods. A shared helper would still need table-specific checks alongside it, so I'd end up fragmenting the logic across multiple methods for what is a fairly simple use case. That said, happy to revisit if you feel strongly about it.

per-field discrepancy logging

It tells an operator exactly what's wrong — e.g. Updating peer 10.0.0.1 in peers_v2 for stale fields [data_center, rack]. With Row::toString(), we will get pipe-delimited positional values without column names, and the operator would have to figure out the diff manually to identify what's stale. For something meant to be a diagnostic/repair tool, don't you think having the specific out-of-sync fields is more helpful? I also checked the rest of the codebase and we always extract fields and log them by name. Please let me know if it is useful to know this information from operator POV, else we can definitely add this change ?

}

@Test
public void testStaleFieldInPeersV2IsRepaired()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add similar verification for every field in the table, preferably without too much duplication in the test code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly, what if the mismatch is due to a null column in the system table?

if (table.equals(PEERS_V2))
result = executeInternal(
String.format("SELECT data_center FROM %s.%s WHERE peer = ? AND peer_port = ?",
SchemaConstants.SYSTEM_KEYSPACE_NAME, table),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally tend to add static imports for things like SchemaConstants.SYSTEM_KEYSPACE_NAME as it removes a fair bit of noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants