Skip to content

HDDS-13891. SCM-based health monitoring and batch processing in Recon#9258

Open
devmadhuu wants to merge 38 commits intoapache:masterfrom
devmadhuu:HDDS-13891
Open

HDDS-13891. SCM-based health monitoring and batch processing in Recon#9258
devmadhuu wants to merge 38 commits intoapache:masterfrom
devmadhuu:HDDS-13891

Conversation

@devmadhuu
Copy link
Contributor

@devmadhuu devmadhuu commented Nov 7, 2025

What changes were proposed in this pull request?

This PR Implements ContainerHealthTaskV2 by extending SCM's ReplicationManager for use in Recon. This approach evaluates container health locally using SCM's proven health check logic without requiring network communication between SCM and Recon.

Design
https://docs.google.com/document/d/1iea0eC4IpPa4Qpmc47Ae3KyneFCZ_fyyuhbZwqrR3cM/edit?pli=1&tab=t.0#heading=h.986yaoz7wnxv

Implementation Approach

Introduces ContainerHealthTaskV2, a new implementation that determines container health states by:

  1. Extending SCM's ReplicationManager as ReconReplicationManager
  2. Calling processAll() to evaluate all containers using SCM's proven health check logic

Container Health States Detected

ContainerHealthTaskV2 detects 5 distinct health states:

SCM Health States (Inherited)

  • MISSING - Container has no replicas available
  • UNDER_REPLICATED - Fewer replicas than required by replication config
  • OVER_REPLICATED - More replicas than required
  • MIS_REPLICATED - Replicas violate placement policy (rack/datanode distribution)

Recon-Specific Health State

  • REPLICA_MISMATCH - Container replicas have different data checksums, indicating:
    • Bit rot (silent data corruption)
    • Failed writes to some replicas
    • Storage corruption on specific datanodes
    • Network corruption during replication

Implementation: ReconReplicationManager first runs SCM's health checks, then additionally checks for REPLICA_MISMATCH by comparing checksums across replicas. This ensures both replication health and data integrity are monitored.

Testing

  • Build compiles successfully
  • Unit tests pass
  • Integration tests pass (failures are pre-existing flaky tests)
  • ContainerHealthTaskV2 runs successfully in test cluster
  • All containers evaluated correctly
  • All 5 health states (including REPLICA_MISMATCH) captured in UNHEALTHY_CONTAINERS table
  • No performance degradation observed
  • REPLICA_MISMATCH detection verified (same logic as legacy)

Database Schema

Uses existing UNHEALTHY_CONTAINERS_V2 table with support for all 5 health states:

  • MISSING - No replicas available
  • UNDER_REPLICATED - Insufficient replicas
  • OVER_REPLICATED - Excess replicas
  • MIS_REPLICATED - Placement policy violated
  • REPLICA_MISMATCH - Data checksum inconsistency across replicas

Each record includes:

  • Container ID
  • Health state
  • Expected vs actual replica counts
  • Replica delta (actual - expected)
  • Timestamp (in_state_since)
  • reason

Some code optimizations in this PR for Recon's ContainerHealthTask are done using Cursor AI tool.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13891

How was this patch tested?

Added junit test cases and tested using local docker cluster.

bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-11-06T17:10:27Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 3
QUASI_CLOSED: 3
CLOSED: 0
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
UNDER_REPLICATED: 1
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 3
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 1
OPEN_WITHOUT_PIPELINE: 0

First 100 UNDER_REPLICATED containers:
#1

First 100 MISSING containers:
#3, #5, #6

First 100 QUASI_CLOSED_STUCK containers:
#1

image
bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-11-06T17:11:42Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 2
QUASI_CLOSED: 1
CLOSED: 3
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
UNDER_REPLICATED: 1
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 2
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 1
OPEN_WITHOUT_PIPELINE: 0

First 100 UNDER_REPLICATED containers:
#1

First 100 MISSING containers:
#5, #6

First 100 QUASI_CLOSED_STUCK containers:
#1

image
bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-11-06T17:12:42Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 2
QUASI_CLOSED: 1
CLOSED: 3
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 1
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0

First 100 OVER_REPLICATED containers:
#1

image

Recon UNHEALTHY_CONTAINERS Table — Performance Optimisations

Summary

This PR also improves the read throughput of the UNHEALTHY_CONTAINERS Derby table by
43–67×, fixes a latent ERROR XBCM4 crash that would occur on any cluster large
enough to trigger a >2,000-container DELETE in one statement, and removes a redundant
Java-side sort that was executing on every paginated API response.

Changes

1. New composite index — ContainerSchemaDefinition.java

Old index:

CREATE INDEX idx_container_state ON UNHEALTHY_CONTAINERS (container_state)

New index:

CREATE INDEX idx_state_container_id ON UNHEALTHY_CONTAINERS (container_state, container_id)

Why this matters for paginated reads:

With the old single-column index, Derby had to:

  1. Scan all rows matching container_state = ? (up to 200K entries).
  2. Sort those rows by container_id on every single page call — an O(n) operation repeated once per page.

With the composite index, Derby jumps directly to (state, minContainerId) and reads the next LIMIT entries sequentially — O(1) per page regardless of cursor position or total row count.

The composite index also covers COUNT(*) WHERE container_state = ? and GROUP BY container_state queries via its leading column prefix, so those queries retain their index-only access path.


2. ContainerHealthSchemaManagerV2.getUnhealthyContainers() — two fixes

a) Conditional Java-side sort removed for forward pagination

// Before — sort ran on every page even though SQL ORDER BY already returned ASC
return query.fetchInto(UnhealthyContainersRecord.class).stream()
    .sorted(Comparator.comparingLong(UnhealthyContainersRecord::getContainerId))
    ...

// After — sort kept only for the reverse-pagination path (maxContainerId > 0)
Stream<UnhealthyContainersRecord> stream = query.fetchInto(...).stream();
if (maxContainerId > 0) {
    stream = stream.sorted(...); // DESC→ASC flip for reverse cursor
}
return stream.map(...).collect(toList());

For the common forward-pagination path (minContainerId), the SQL ORDER BY container_id ASC
already delivers sorted rows. The redundant Java sort was calling
Comparator.comparingLong on every page (up to 200 pages per state per request).

b) JDBC fetch-size hint added

query.fetchSize(limit);   // pre-buffers `limit` rows per JDBC round-trip

Derby's default JDBC fetch size is 1 row per wire call. For a 5,000-row page this
meant 5,000 individual JDBC fetch round-trips inside the driver before any data reached
the application layer. Setting fetchSize(limit) pre-buffers the full page in a single
JDBC call.


3. ContainerHealthSchemaManagerV2.batchDeleteSCMStatesForContainers() — internal chunking

Bug fixed: Derby's SQL compiler generates a Java class per prepared statement.
A WHERE container_id IN (N values) predicate combined with the 7-state
container_state IN (…) predicate generates an expression tree whose compiled
bytecode can exceed the JVM 65,535-byte per-method limit (ERROR XBCM4).

IN-clause size Bytecode generated Within limit?
1,000 IDs ~30 KB Yes
2,000 IDs ~60 KB borderline
5,000 IDs ~148 KB No

The method previously delegated chunking to callers. On a large cluster
ReconReplicationManager.persistUnhealthyRecords() passes the full container list
in one call — which would crash Derby on any cluster with >2,000 containers in a
single scan batch.

Fix: the method now chunks internally at MAX_DELETE_CHUNK_SIZE = 1,000 IDs per
SQL statement. Callers pass any size list; the method is safe by construction.

static final int MAX_DELETE_CHUNK_SIZE = 1_000;

for (int from = 0; from < containerIds.size(); from += MAX_DELETE_CHUNK_SIZE) {
    List<Long> chunk = containerIds.subList(from,
        Math.min(from + MAX_DELETE_CHUNK_SIZE, containerIds.size()));
    dslContext.deleteFrom(UNHEALTHY_CONTAINERS)
        .where(UNHEALTHY_CONTAINERS.CONTAINER_ID.in(chunk))
        .and(UNHEALTHY_CONTAINERS.CONTAINER_STATE.in(/* 7 states */))
        .execute();
}

Performance Test (Added in last commit)

A new test class TestUnhealthyContainersDerbyPerformance benchmarks all operations
at 1 million records (5 states × 200,000 container IDs).

Test environment: macOS 14 (Apple M-series), JDK 8, Derby 10.14 embedded,
derby.storage.pageCacheSize = 20,000 (~80 MB page cache).

Results — baseline vs. optimised

Operation Baseline Optimised Improvement
INSERT 1 M records 18,843 ms · 53,070 rec/s 48,326 ms · 20,693 rec/s 2.6× slower
COUNT(*) total 250 ms 337 ms –35%
COUNT by state (avg of 5) 67 ms 83 ms –19%
GROUP BY summary (all states) 360 ms 653 ms –45%
Paginated read — 1 state, 200K rows, 40 pages × 5K 41,963 ms · 4,766 rec/s 964 ms · 200,000 rec/s 43.5× faster
Full 1M read — all 5 states, paged 281,349 ms · 3,554 rec/s 4,175 ms · 239,521 rec/s 67× faster
DELETE 500K rows (100K IDs × 5 states) 48,681 ms · 10,271 rows/s 37,268 ms · 13,416 rows/s 1.3× faster
COUNT by state after delete (avg) 58 ms 73 ms –21%

Raw logs

Optimised run output (click to expand)
[main] INFO  - === Derby Performance Benchmark — Setup ===
[main] INFO  - Dataset: 5 states × 200000 container IDs = 1000000 total records
[main] INFO  - Starting bulk INSERT: 1000000 records  (2000 containers/tx, 100 transactions)
[main] INFO  - INSERT complete: 1000000 records in 48326 ms  (20693 rec/sec, 100 tx)
[main] INFO  - --- Test 1: Verify total row count = 1000000 ---
[main] INFO  - COUNT(*) = 1000000 rows in 337 ms
[main] INFO  - --- Test 2: COUNT(*) by state (index-covered, 200000 records each) ---
[main] INFO  -   COUNT(UNDER_REPLICATED)  = 200000 rows in 111 ms
[main] INFO  -   COUNT(MISSING)           = 200000 rows in  78 ms
[main] INFO  -   COUNT(OVER_REPLICATED)   = 200000 rows in  72 ms
[main] INFO  -   COUNT(MIS_REPLICATED)    = 200000 rows in 102 ms
[main] INFO  -   COUNT(EMPTY_MISSING)     = 200000 rows in  51 ms
[main] INFO  - --- Test 3: GROUP BY summary over 1000000 rows ---
[main] INFO  - GROUP BY summary: 5 state groups returned in 653 ms
[main] INFO  - --- Test 4: Paginated read of UNDER_REPLICATED (200000 records, page size 5000) ---
[main] INFO  - Paginated read: 200000 records in 40 pages, 964 ms  (200000 rec/sec)
[main] INFO  - --- Test 5: Full 1 M record read (all states, paged) ---
[main] INFO  -   State UNDER_REPLICATED:  200000 records in   989 ms
[main] INFO  -   State MISSING:           200000 records in   785 ms
[main] INFO  -   State OVER_REPLICATED:   200000 records in   709 ms
[main] INFO  -   State MIS_REPLICATED:    200000 records in   805 ms
[main] INFO  -   State EMPTY_MISSING:     200000 records in   876 ms
[main] INFO  - Full dataset read: 1000000 total records in 4175 ms  (239521 rec/sec)
[main] INFO  - --- Test 6: Batch DELETE — 100000 IDs × 5 states = 500000 rows (100 SQL statements of 1000 IDs) ---
[main] INFO  - DELETE complete: 100000 IDs (500000 rows) in 37268 ms via 100 SQL statements  (13416 rows/sec)
[main] INFO  - Rows remaining after delete: 500000 (expected 500000)
[main] INFO  - --- Test 7: COUNT by state after 50% delete (expected 100000 each) ---
[main] INFO  -   COUNT(UNDER_REPLICATED)  = 100000 rows in 79 ms
[main] INFO  -   COUNT(MISSING)           = 100000 rows in 76 ms
[main] INFO  -   COUNT(OVER_REPLICATED)   = 100000 rows in 90 ms
[main] INFO  -   COUNT(MIS_REPLICATED)    = 100000 rows in 62 ms
[main] INFO  -   COUNT(EMPTY_MISSING)     = 100000 rows in 58 ms

@devmadhuu devmadhuu changed the title Hdds 13891 HDDS-13891. Recon - Introduce ContainerHealthTaskV2 with SCM-based health monitoring and batch processing Nov 7, 2025
@devmadhuu devmadhuu marked this pull request as ready for review November 7, 2025 11:42
@adoroszlai adoroszlai changed the title HDDS-13891. Recon - Introduce ContainerHealthTaskV2 with SCM-based health monitoring and batch processing HDDS-13891. SCM-based health monitoring and batch processing in Recon Nov 8, 2025
@devmadhuu devmadhuu requested a review from sodonnel November 11, 2025 15:31
@devmadhuu devmadhuu marked this pull request as draft November 21, 2025 07:30
@devmadhuu devmadhuu marked this pull request as ready for review November 24, 2025 03:48
@devmadhuu devmadhuu marked this pull request as draft February 21, 2026 15:17
@devmadhuu devmadhuu requested a review from yandrey321 February 21, 2026 16:38
@devmadhuu devmadhuu marked this pull request as ready for review February 21, 2026 16:38
0L, 0L, 1000);

// Should be empty in normal operation (all replicas healthy)
assertEquals(0, underReplicatedContainers.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test that reproduces a situation with different types of unhealthy containers?

Here both tests expects 0 unhealthy containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated TestReconTasks — I updated the tests to cover actual unhealthy-container scenarios with ContainerHealthTaskV2 instead of only query-path checks that return 0.

  1. Kept SCM sync coverage
  2. Added/updated end-to-end unhealthy transitions for V2:
    • MISSING detection after DN shutdown + cleanup after DN restart
    • EMPTY_MISSING detection (with key count 0) + cleanup after restart

Updated hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/fsck/TestContainerHealthTask.java

Multibinder.newSetBinder(binder(), ReconSchemaDefinition.class);
schemaBinder.addBinding().to(UtilizationSchemaDefinition.class);
schemaBinder.addBinding().to(ContainerSchemaDefinition.class);
schemaBinder.addBinding().to(ContainerSchemaDefinitionV2.class);
Copy link

@yandrey321 yandrey321 Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what code is going to delete 'ContainerSchemaDefinition' from the existing DB during the upgrade?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a plan to migrate v1 -> v2 rows to avoid delay with displaying the status after upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no V2 schema and since new schema is same as old legacy table, so no changes needed while upgrade. Existing upgrade action classes InitialConstraintUpgradeAction and ContainerReplicaMismatchAction should suffice.

@devmadhuu devmadhuu marked this pull request as draft February 27, 2026 06:29
@devmadhuu devmadhuu requested a review from yandrey321 February 27, 2026 16:41
@devmadhuu devmadhuu marked this pull request as ready for review March 1, 2026 09:00
Copy link
Contributor

@dombizita dombizita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @devmadhuu! This is quite big PR, I went through mostly the REPLICA_MISMATCH related changes (which look good), but tried to go over the whole change set, except the test changes.
Two overall comments: I believe you have a design document for this solution, please share that on the jira and the PR description too; not sure if you used any kind of AI tool, if yes please mention it in the description (as ASF asks for that). Thanks!

Comment on lines +63 to +67
@SuppressWarnings("checkstyle:ParameterNumber")
public UnhealthyContainerMetadata(long containerID, String containerState,
long unhealthySince, long expectedReplicaCount, long actualReplicaCount,
long replicaDeltaCount, String reason, List<ContainerHistory> replicas,
UUID pipelineID, long keyCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change from UnhealthyContainers to this parameter list?

Copy link
Contributor Author

@devmadhuu devmadhuu Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out, I was doing some testing and did some refactoring, so forgot to revert that. I reverted this to the previous UnhealthyContainers based constructor pattern. ContainerEndpoint now constructs the jOOQ-generated POJO and passes it into UnhealthyContainerMetadata

ReconReplicationManagerReport report,
List<ContainerInfo> allContainers) {

long currentTime = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that every time we call processAll() it'll use the current time as inStateSince? That makes inStateSince irrelevant as it'll always be the last timestamp the task ran, not the time since the container is in that state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I fixed this. We now read existing rows first and preserve in_state_since for the same (containerId, state). We only use current scan time when the container enters a new state.

Comment on lines +345 to +348
/**
* POJO representing a record in UNHEALTHY_CONTAINERS table.
*/
public static class UnhealthyContainerRecordV2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously these classes in Recon are generated by jOOQ, aren't they? Like UnhealthyContainersRecord:

package org.apache.ozone.recon.schema.generated.tables.records;


import javax.annotation.Generated;

import org.apache.ozone.recon.schema.generated.tables.UnhealthyContainersTable;
import org.jooq.Field;
import org.jooq.Record2;
import org.jooq.Record7;
import org.jooq.Row7;
import org.jooq.impl.UpdatableRecordImpl;


/**
 * This class is generated by jOOQ.
 */
@Generated(
    value = {
        "http://www.jooq.org",
        "jOOQ version:3.11.9"
    },
    comments = "This class is generated by jOOQ"
)
@SuppressWarnings({ "all", "unchecked", "rawtypes" })
public class UnhealthyContainersRecord extends UpdatableRecordImpl<UnhealthyContainersRecord> implements Record7<Long, String, Long, Integer, Integer, Integer, String> {

Are you intentionally moving away from that pattern?

Also it was UnhealthyContainersRecord and now you are using UnhealthyContainerRecordV2, container was in plural, it might be good to follow that.

Copy link
Contributor Author

@devmadhuu devmadhuu Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, For API transport we still use jOOQ-generated UnhealthyContainers at the endpoint boundary. UnhealthyContainerRecordV2 remains an internal service DTO used by batch write/read flow and is not replacing generated DB records globally.

public void testUnhealthyContainersFilteredResponse()
throws IOException, TimeoutException {
String missing = UnHealthyContainerStates.MISSING.toString();
String emptyMissing = UnHealthyContainerStates.EMPTY_MISSING.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it the way it was, it's easier to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I switched the test back to enum-driven state values (UnHealthyContainerStates.EMPTY_MISSING/NEGATIVE_SIZE) for easier maintenance.

Comment on lines +180 to +187
public String getJdbcUrl() {
return "jdbc:derby:" + tempDir.getAbsolutePath() +
File.separator + "derby_recon.db";
File.separator + "derby_recon.db;create=true";
}

@Override
public String getUserName() {
return null;
return "RECON";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I was doing some test on performance with multiple users with multiple DBs. Those changes were not needed for this PR. I reverted them to keep the test DB wiring unchanged from baseline.

@devmadhuu devmadhuu requested a review from dombizita March 4, 2026 17:48
try {
initializeAndRunTask();
long elapsed = Time.monotonicNow() - cycleStart;
long sleepMs = Math.max(0, interval - elapsed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid continuous loop, we can have min-interval for next run to be 1 min

* @param container The container ID to record
*/
@Override
public void incrementAndSample(ContainerHealthState stat, ContainerInfo container) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove other methods, just have 2 method for replica-mismatch state.

*/
@InterfaceAudience.Private
@Metrics(about = "ContainerHealthTaskV2 Metrics", context = OzoneConsts.OZONE)
public final class ContainerHealthTaskV2Metrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove V2 from all classname and variablename


@Override
public int hashCode() {
return Long.hashCode(containerId) * 31 + state.hashCode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use return Objects.hash(field1, field2); for combining 2 field, then own implementation

int to = Math.min(from + MAX_DELETE_CHUNK_SIZE, containerIds.size());
List<Long> chunk = containerIds.subList(from, to);

try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete and insert for same container to be in same transaction

}

private void persistUnhealthyRecords(
List<Long> containerIdsToDelete,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perform delete and insert as atomic transaction

recordsToInsert, chunkContainerIdSet);
totalReplicaMismatchCount += chunkReplicaMismatchCount;
totalStats.add(chunkStats);
persistUnhealthyRecords(chunkContainerIds, recordsToInsert);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pass existingInStateSinceByContainerAndState values to improve deletion performance in case very little entry or no entry present in db.


for (int from = 0; from < allContainers.size(); from += PERSIST_CHUNK_SIZE) {
int to = Math.min(from + PERSIST_CHUNK_SIZE, allContainers.size());
List<Long> chunkContainerIds = collectContainerIds(allContainers, from, to);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use only chunkContainerIdSet, chunkContainerIds is not requried.

ProcessingStats chunkStats = new ProcessingStats();
Set<Long> negativeSizeRecorded = new HashSet<>();

report.forEachContainerByState((state, cid) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have SCM default report with sample-limit as 0, and no need additional capture of State vs containerId.
For replica_mismatch, can have another list being passed, need not be part of report.

and here, can have loop of allContainers from'from' to 'to'with index access as its array list (no need loop report), and if check can be avoided chunkContaineridSet

Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the patch @devmadhuu

Set<Long> negativeSizeRecorded,
ProcessingStats stats) throws ContainerNotFoundException {
switch (state) {
case MISSING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCM's handler chain can emit composite health states like:

UNHEALTHY_UNDER_REPLICATED
QUASI_CLOSED_STUCK_MISSING
QUASI_CLOSED_STUCK_UNDER_REPLICATED
MISSING_UNDER_REPLICATED

etc.

But the switch statement in handleScmStateContainer() only handles 4 states: MISSING, UNDER_REPLICATED, OVER_REPLICATED, MIS_REPLICATED. Everything else falls into default: break; and is silently thrown away.

This means a container that is both quasi-closed-stuck AND has no replicas (QUASI_CLOSED_STUCK_MISSING) will never appear in the Recon UI or the UNHEALTHY_CONTAINERS table.

Are these composite states intentionally excluded from V2? or should we map them to their base state (e.g., QUASI_CLOSED_STUCK_UNDER_REPLICATED → store as UNDER_REPLICATED with an appropriate reason string).

handleMissingContainer(containerId, currentTime,
existingInStateSinceByContainerAndState, recordsToInsert, stats);
break;
case UNDER_REPLICATED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a state for REPLICA_MISMATCH also ?

List<UnhealthyContainerRecordV2> recordsToInsert,
Set<Long> negativeSizeRecorded,
ProcessingStats stats) throws ContainerNotFoundException {
switch (state) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember, in our offline discussion we tried checking what happens if we attempt to add a state to the Derby table that violates the allowed-state constraint. I believe this switch case will prevent a new state from being added to the database.

Comment on lines +522 to +526
healthSchemaManager.batchDeleteSCMStatesForContainers(containerIdsToDelete);

LOG.info("Inserting {} unhealthy container records", recordsToInsert.size());
healthSchemaManager.insertUnhealthyContainerRecords(recordsToInsert);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

healthSchemaManager.batchDeleteSCMStatesForContainers(containerIdsToDelete);  // step 1
healthSchemaManager.insertUnhealthyContainerRecords(recordsToInsert);          // step 2

These two calls are not in the same transaction. If Recon crashes between step 1 and step 2, all old health data is gone but the new data was never written. The API would return 0 unhealthy containers until the next scan runs. Also, if someone queries the API between step 1 and step 2, they get empty or partial results.


for (int from = 0; from < allContainers.size(); from += PERSIST_CHUNK_SIZE) {
int to = Math.min(from + PERSIST_CHUNK_SIZE, allContainers.size());
List<Long> chunkContainerIds = collectContainerIds(allContainers, from, to);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collects every single container ID in the cluster (healthy and unhealthy) and runs DELETE statements for all of them. On a cluster with 1 million containers, that means:

  • Allocating a list with 1M entries
  • Running 1,000 chunked DELETE statements
  • Most of those containers are healthy and have no rows in the table, so the DELETEs are wasted work

Suggestion: Instead, just delete all rows by state: DELETE FROM UNHEALTHY_CONTAINERS WHERE container_state IN (...) — one statement, no chunking, much faster.


// Call inherited processContainer - this runs SCM's health check chain
// readOnly=true ensures no commands are generated
processContainer(container, nullQueue, report, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Suggestion -

processContainer(container, nullQueue, report, true);  // calls getContainerReplicas() internally
Set<ContainerReplica> replicas = containerManager.getContainerReplicas(cid);  // calls again

Every container's replicas are fetched twice once inside the inherited processContainer() and once for the REPLICA_MISMATCH check. On a 1 million container cluster, that's 2M replica lookups. We can Extract replicas once and pass them to both operations.

Comment on lines +97 to +100
dslContext.createIndex("idx_state_container_id")
.on(DSL.table(UNHEALTHY_CONTAINERS_TABLE_NAME),
DSL.field(name(CONTAINER_STATE)),
DSL.field(name(CONTAINER_ID)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!TABLE_EXISTS_CHECK.test(conn, UNHEALTHY_CONTAINERS_TABLE_NAME)) {
    createUnhealthyContainersTable();  // creates table + composite index
}

The composite index idx_state_container_id is created inside createUnhealthyContainersTable(). This method is only called if the table doesn't exist.

If someone upgrades an existing Recon deployment, the UNHEALTHY_CONTAINERS table already exists (from V1), so this entire method is skipped. The new composite index is never created on existing clusters they're stuck with the old single-column index and get none of the 43–67× performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants