Auto scaling of a group of nodes

For your information

Auto Zoom is not available:

for groups of nodes with GPUs without drivers;
groups of nodes on dedicated servers.

In a Managed Kubernetes cluster, Cluster Autoscaler can be used to autoscale groups of nodes or Karpenter They help to optimally utilize cluster resources — depending on the load on the cluster, the number of nodes in the group will be automatically reduced or increased.When using autoscaling tools, take into account the recommendations.

Managed Kubernetes uses Metrics Server for autoscaling pods.

Recommendations

For optimal autoscaling performance, we recommend:

Do not use more than one autoscaling tool at the same time;
make sure that the project has quotas for vCPU, RAM, GPU and disk capacity to create the maximum number of nodes in the group;
Specify resource requests in the manifests for pods.For more information, see the Resource Management for Pods and Containers instruction in the Kubernetes documentation;
configure PodDisruptionBudget for pods for which stops are not allowed.This will help avoid downtime when transferring between nodes;
not to manually modify node resources through the control panel. and Karpenter will not take these changes into account;
when using Cluster Autoscaler check that nodes in the group have the same configuration and labels.

Autoscaling with Cluster Autoscaler

Cluster Autoscaler does not need to be installed in the cluster — it is installed automatically when the cluster is created.To use Cluster Autoscaler in a cluster, enable node group autoscaling.After autoscaling is enabled, the default settings are used, but you can customize Cluster Autoscaler for each node group.

Working principle

Cluster Autoscaler works with existing node groups and pre-selected configurations.If a node group is in ACTIVE status, Cluster Autoscaler checks every 10 seconds whether pods are in PENDING status and analyzes the load — requests from pods on vCPU, RAM and GPU . Depending on the results of the check, nodes are added or removed.A group of nodes at this time goes to the PENDING_SCALE_UP or PENDING_SCALE_DOWN status.The status of the cluster during autoscaling is ACTIVE.For more information about cluster statuses, see View Cluster Status.

The minimum and maximum number of nodes in a group can be set when autoscaling is enabled — Cluster Autoscaler will only change the number of nodes within these limits.

Adding a node

If there are pods in PENDING status and there are not enough free resources in the cluster to accommodate them, the necessary number of nodes will be added to the cluster.In a cluster with Kubernetes version 1.28 and higher, Cluster Autoscaler will work in several groups at once and distribute nodes evenly.

note

For example, you have two node groups with autoscaling enabled.The load on the cluster has increased and requires the addition of four nodes.Two new nodes will be created simultaneously in each node group.

In a cluster with Kubernetes version 1.27 and below, nodes are added one per validation cycle.

Deleting a node

If there are no pods in PENDING status, Cluster Autoscaler checks the number of resources that are requesting pods.

If the requested number of resources for pods on one node is less than 50% of its resources, Cluster Autoscaler marks the node as unnecessary.If the number of resource requests on a node does not increase after 10 minutes, Cluster Autoscaler will check if pods can be moved to other nodes.

Cluster Autoscaler will not migrate pods and therefore will not delete a node if one of the conditions is met:

Pods use PodDisruptionBudget;
there is no PodDisrptionBudget in Kube-system pods;
pods are created without a controller — for example, Deployment, ReplicaSet, StatefulSet;
Pods use local storage;
the other nodes don't have the resources for the pod's requests;
there is a mismatch between nodeSelector, affinity and anti-affinity rules or other parameters.

You can allow such submissions to carry over — add an annotation to do so:

cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

If there are no restrictions, the pods will be moved and the low-loaded nodes will be removed.Nodes are removed one at a time per test cycle.

Autoscaling to zero nodes

In a node group it is possible to configure autoscaling to zero nodes — at low load all nodes of the group are deleted.The node group card with all settings is not deleted.When the load increases, nodes can be added to this node group again.

Autoscaling to zero nodes works only if there are at least two working nodes left in other cluster node groups.The cluster must still have working nodes to accommodate the system components that are needed for the cluster to function.

note

For example, autoscaling to node zero will not work if in a cluster:

two groups of nodes, with one working node in each group;
one node group with two working nodes.

When there are no nodes in the group, you don't pay for unused resources.

Enable autoscaling with Cluster Autoscaler

For your information

If you set the minimum number of nodes in the group to be greater than the current number of nodes, it will not scale to the lower limit immediately.The group of nodes will scale only after the pods appear in PENDING status.Same with the upper limit of nodes in the group — if the current number of nodes is greater than the upper limit, deletion will start only after the pods are checked.

You can enable autoscaling with Cluster Autoscaler in the dashboard, via the Managed Kubernetes API, or via Terraform.

In the dashboard, on the top menu, click Products and select Managed Kubernetes.
Open the Cluster page → Cluster Composition tab.
From themenu of the node group, select Change Number of Nodes.
In the Number of nodes field, open the With autoscaling tab.
Set the minimum and maximum number of nodes in the group — the value of nodes will change only in this range.For fault-tolerant operation of system components we recommend using at least two working nodes in the cluster, nodes can be in different groups.
Click Save.

Configure Cluster Autoscaler

You can configure Cluster Autoscaler separately for each node group.

Parameters, their descriptions, and default values can be viewed in the Cluster Autoscaler Parameters table.If you do not specify a parameter in the manifest, the default value will be used.

Manifesto example:

apiVersion: v1
kind: ConfigMap
metadata:
    name: cluster-autoscaler-nodegroup-options
    namespace: kube-system
data:
    config.yaml: |
        150da0a9-6ea6-4148-892b-965282e195b0:
          scaleDownUtilizationThreshold: 0.55
          scaleDownUnneededTime: 7m
          zeroOrMaxNodeScaling: true
        e3dc24ca-df9d-429c-bcd5-be85f8d28710:
          scaleDownGpuUtilizationThreshold: 0.25
          ignoreDaemonSetsUtilization: true

Here 150da0a9-6ea6-4148-892b-965282e195b0 and e3dc24ca-df9d-429c-bcd5-be85f8d28710 are the unique identifiers (UUIDs) of the node groups in the cluster.You can view them in the control panel: in the top menu, click Products ⟶ Managed Kubernetes ⟶ Kubernetes section ⟶ cluster page ⟶ copy the UUID above the node group card, next to the pool segment.

Cluster Autoscaler Settings

	Description	Default value
scaleDownUtilizationThreshold	The minimum vCPU and RAM utilization of a node at which the system can delete the node. If the node uses less than the specified percentage of vCPU and RAM, for example, less than 50% with a value of `0.5`, the system removes the node	0.5
scaleDownGpuUtilizationThreshold	The minimum GPU utilization at which the system can delete a node. If the node uses less than the specified percentage of GPU, for example, less than 50% with a value of `0.5`, the system removes the node	0.5
scaleDownUnneededTime	Wait time before removing a low-load node. The system will not remove a node as soon as the node's load drops — it will wait for a specified time to make sure that the load drop is stable	10m
scaleDownUnreadyTime	The time to wait before deleting a node in `NotReady` status. The system will not leave a node in `NotReady` status in the cluster — it will wait the specified time to make sure that the node is hung and will not recover, and then delete it	20m
maxNodeProvisionTime	Waiting time for adding a new node. If an error occurs and a node is not added within the specified time, the system will restart the node addition process	15m
zeroOrMaxNodeScaling	Allows you to automatically change the number of nodes only up to zero or the maximum you set. This is useful if you want the system to deploy all nodes in a group at once when a load occurs, and remove all nodes when there is no load	false
ignoreDaemonSetsUtilization	Allows DaemonSets to be disregarded when the system determines whether to reduce the number of nodes in a group. If `true`, service services are not counted	false

Autoscaling with Karpenter

Working principle

Karpenter is a cluster autoscaling tool with flexible settings.Unlike Cluster Autoscaler, Karpenter uses not only existing node groups, but can also create new node groups.

Karpenter integrates directly with the OpenStack API, which is used to create resources for the cloud platform.This allows Karpenter to choose the optimal node configuration — taking into account not only technical parameters, but also cost.Karpenter selects the cheapest option that fits the current workload.

If the cluster is in ACTIVE status, Karpenter checks if there are Pods in PENDING status and analyzes the load — requests from Pods on vCPU, RAM and GPU.Depending on the results of the check, node groups and nodes are added or removed.Karpenter can only remove nodes and node groups it has created.

The cluster goes to the PENDING_SCALE_UP or PENDING_SCALE_DOWN status at this time. The status of the cluster during autoscaling is ACTIVE.For more information about cluster statuses, see the View Cluster Status instructions.

Install Karpenter

In the dashboard, on the top menu, click Products and select Managed Kubernetes.
Open the cluster page → Settings tab.
Click Download kubeconfig.Downloading the kubeconfig file is not available if the cluster has a status of PENDING_CREATE, PENDING_ROTATE_CERTS, PENDING_DELETE, or ERROR.
Export the path to the kubeconfig file to the KUBECONFIG environment variable:
```
export KUBECONFIG=<path>
```
Specify <path> is the path to the kubeconfig file cluster_name.yaml, where <cluster_name> is the cluster name.
Install Karpenter with Helm:
```
helm install karpenter-helmrelease oci://ghcr.io/selectel/mks-charts/karpenter:0.1.0 \
--namespace kube-system \
--set controller.settings.clusterID=$(<cluster_id>)
```
Specify <cluster_id> — Managed Kubernetes cluster ID, can be viewed in the control panel: in the top menu click Products → Managed Kubernetes → cluster page → copy the ID under the cluster name, next to the region and pool.

Customize Karpenter

To configure autoscaling with Karpenter, configure the NodePool and NodeClass objects.

NodePool describes the rules for selecting and scaling nodes.For example:

what types of nodes can be created;
With what configurations (flavors) and resources;
when these nodes can be deleted or recreated.

Each NodePool refers to a specific NodeClass.In configurations (flavor) with network disks, the NodeClass defines the infrastructure parameters of the network disks to be used by the nodes in the cluster.There can be several different NodeClasses in the same cluster, for example, which differ in the type or size of the network disk. Configurations (flavors) with a local boot disk can also be used, but the local disk parameters are determined by the selected configuration.

Read more about NodePool in the NodePools article of the Karpenter documentation.

Check that the cluster meets the requirements.
Create a NodeClass.
Create a NodePool.

1. Verify that the cluster meets the requirements

Make sure the Kubernetes version is 1.28 or higher.You can upgrade the cluster version.
Ensure that there is at least one node in the cluster with at least 2 vCPUs and 4 GiB RAM.For optimal Karpenter performance, we recommend adding two nodes to the cluster, each with at least 2 vCPUs and 4 GiB RAM;
Make sure auto zoom is turned off.
Make sure auto-recovery is turned off.

2. Create NodeClass

Create a nodeclass.yaml file with a manifest for the NodeClass object.

An example of a NodeClass manifest for a network disk of type Universal:
```
apiVersion: karpenter.k8s.selectel/v1alpha1
kind: SelectelNodeClass
metadata:
    name: default
spec:
    disk:
    categories:
        - universal
    sizeGiB: 30
```
Here:
- universal — network disk type;
- 30 — the size of the network disk in GB.
Apply the manifest:
```
kubectl apply -f nodeclass.yaml
```

3. Create NodePool

Create a nodepool.yaml file with a manifest for the NodePool object.The description of all parameters, except for the requirements block parameters, can be found in the NodePools instruction of the Karpenter documentation.The description of the requirements block parameters can be found in the table Parameters of the requirements block in NodePool.

NodePool manifest example:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
    name: default
spec:
    template:
        spec:
            nodeClassRef:
                name: default
                kind: SelectelNodeClass
                group: karpenter.k8s.selectel
            requirements:
                - key: topology.kubernetes.io/zone
                  operator: In
                  values: ['ru-7a', 'ru-7b']
                - key: node.kubernetes.io/instance-type
                  operator: In
                  values: ['SL1.1-2048', 'SL1.2-4096', 'SL1.2-8192']
                - key: karpenter.sh/capacity-type
                  operator: In
                  values: ['on-demand']
    disruption:
        consolidationPolicy: WhenEmptyOrUnderutilized
        consolidateAfter: 0s
        expireAfter: 720h
    limits:
        cpu: '1000'
        memory: 1000Gi

Apply the manifest:
```
kubectl apply -f nodepool.yaml
```

Parameters of requirements block in NodePool

In the NodePool object, the requirements block describes the requirements for the nodes to be created.

Parameter	Description
karpenter.sh/capacity-type	Type of nodes to be created: `on-demand` — uninterrupted nodes; or `spot` — interrupted nodes
karpenter.k8s.selectel/instance-category	Configuration Line. For example, Standard Line `(SL`) or GPU Line (`GL`). For more information about configurations, see the Configurations manual
karpenter.k8s.selectel/instance-family	The configuration line and the generation of the line. For example, `SL1` or `GL1`
karpenter.k8s.selectel/instance-generation	The generation of the ruler. For example, `1` or `2`
node.kubernetes.io/instance-type	Configurations (flavors). For example, `["SL1.1-2048", "SL1.2-4096", "SL1.2-8192"]`. Configurations can be viewed in the sub-section List of fixed configuration flavors in all pools of the Configurations instruction
karpenter.k8s.selectel/instance-cpu	Number of vCPUs. For example, `Gt: "4"` — more than 4
karpenter.k8s.selectel/instance-memory	The amount of RAM in GB. For example, `Gt: "8"` — more than 8 GB
karpenter.k8s.selectel/instance-gpu-manufacturer	The manufacturer of the graphics processing unit (GPU). The available value is `["NVIDIA"`]
karpenter.k8s.selectel/instance-gpu-name	GPU name for configurations with GPUs. For example, `["A100", "H100"]`. Available GPUs can be viewed in the Available GPUs subsection of the Create a Managed Kubernetes Cluster with GPUs instruction
karpenter.k8s.selectel/instance-gpu-count	Number of GPUs. For example, `Gt: "0"` — more than one GPU
karpenter.k8s.selectel/instance-local-disk	Specifies whether the local disk of the cloud platform is used as the boot disk. The available values are `["true"]` and `["false"`]
topology.kubernetes.io/zone	Pool segments where node groups can be created. For example, `["ru-7a", "ru-7b"]`. Available values can be viewed in the Managed Kubernetes subsection of the Availability Matrices instructions