Skip to content

CKS cluster remains in Alert state if the scaling fails due to capacity issue on the hypervisor host #12699

@kiranchavala

Description

@kiranchavala

problem

CKS cluster remains in Alert state if the scaling fails due to capacity issue on the hypervisor host

versions

ACS 4.22

The steps to reproduce the bug

Have cloudstack environment with 2 kvm host in a cluster

  1. Launch a Cks cluster with size 2 ( worker nodes)

Worker nodes deployed on kvm host 2

  1. CKS cluster in running state

  2. Deploy other vm's in the cloudstack environment so that capacity of kvm host have reached

  3. Scale the CKS cluster to size 3

  4. Scaling of the CKS cluster fails due to capacity issue

The new worker node will be in stopped state

  1. CKS cluster will be in Alert state

2026-02-24 11:12:14,223 DEBUG [c.c.k.c.KubernetesClusterManagerImpl] (Kubernetes-Cluster-State-Scanner-1:[ctx-c196e036]) (logid:43979d1a) Found VM: VM instance {"id":16,"instanceName":"i-2-16-VM","state":"Stopped","type":"User","uuid":"47386d74-3c9f-49aa-b102-1c10537c8350"} in the Kubernetes cluster KubernetesCluster {"id":2,"name":"test","uuid":"e155ab23-68ca-4c3e-b8c5-7175a3f65fda"} in state: Stopped while expected to be in state: Running. So moving the cluster to Alert state for reconciliation
2026-02-24 11:12:14,224 DEBUG [c.c.k.c.KubernetesClusterManagerImpl] (Kubernetes-Cluster-State-Scanner-1:[ctx-c196e036]) (logid:43979d1a) Found VM: VM instance {"id":9,"instanceName":"i-2-9-VM","state":"Running","type":"User","uuid":"ebf0a5a6-01b7-462a-bad6-1f61887f0f41"} in the Kubernetes cluster KubernetesCluster {"id":2,"name":"test","uuid":"e155ab23-68ca-4c3e-b8c5-7175a3f65fda"} in state: Running while expected to be in state: Stopped. So moving the cluster to Alert state for reconciliation
  1. Cannot remove the worker node which is stopped state

Exception thrown

Image Image Image

What to do about it?

CKS cluster should go back to running state since the scaling failed due to insufficent capacity issue

Currently, we are checking only for resource limit during scaling operation with this pr

#12167

We should also check host capacity before scaling

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions