Skip to main content

Scale inference service

To optimally utilize the resources of the inference service, you can scale it depending on the load:

configure a fixed number of inference-instances - deployed instances of models.For example, increase the number of inference-instances when the number of queries increases or decrease if the number of queries decreases;
or configure autoscaling.The number of inference-instances will automatically change in the specified range depending on the number of requests and their processing time.

To scale the inference service, change the number of inference instances.

Change the number of inference instances

You can use a fixed number of inference instances or set up autoscaling.

Fixed quantity
Auto Scaling

In the control panel, on the top menu, click Products and select Inference Services.
Open the inference service page → Service tab.
In the Service Auto Scaling block, click Edit.
Open the Fixed tab and specify the number of inference instances.
Click Save.Changing the zoom settings may take more than 10 minutes.During this time, the inference service will not be available.

Change the number of inference instances