Inference service model parameters

You can specify model parameters when creating an inference service. They determine how the inference service will process requests and consume computing resources.

Model parameter data type	The format in which model weights are stored. This is a model optimization method (quantization) where floating-point numbers (for example, FP32) are converted to a lower-precision format, such as 8-bit integers (INT8). More compact formats reduce the model size, speed up token generation, and decrease computing resource consumption. This allows running models on resource-constrained devices, such as mobile phones or embedded systems, with minimal loss of accuracy
Data type for KV cache (Key-Value Cache)	The storage format for model intermediate calculations during token generation. Key-Value Cache is a mechanism for accelerating token generation in LLMs based on the transformer architecture. More compact formats reduce memory consumption and allow for processing longer contexts
Maximum context length	The maximum number of tokens the model can process within a single request