Skip to main content

Inference service model parameters

Last update:

You can specify model parameters when creating an inference service. They determine how the inference service will process requests and consume computing resources.

Model parameter data type

The format used to store model weights. This is a model optimization method (quantization) in which floating-point numbers (e.g., FP32) are converted to a lower-precision format, such as 8-bit integers (INT8). More compact formats reduce the model size, accelerate token generation, and decrease computing resource consumption. This allows models to run on resource-constrained devices, such as mobile phones or embedded systems, with minimal loss of accuracy

Data type for KV cache (Key-Value Cache)

The storage format for model intermediate calculations during token generation. Key-Value Cache is a mechanism for accelerating token generation in LLMs based on the transformer architecture. More compact formats reduce memory consumption and allow for processing longer contexts

Maximum context length

The maximum number of tokens the model can process within a single request