Skip to main content

Parameters of the inference-service model

Last update:

You can specify model parameters when creating an inference service. They determine how the inference service will process requests and consume computational resources.

Data type of model parameters

The format in which the model weights are stored. This is a method of model optimization (quantization) where floating point numbers (e.g. FP32) are converted to a lower precision format such as 8-bit integers (INT8). Smaller formats reduce the size of the model, speed up token generation, and reduce the consumption of computational resources. This allows models to be run on resource-constrained devices, such as cell phones or embedded systems, with minimal loss of accuracy

Data type for KV cache (Key-Value Cache)

Format for storing intermediate model calculations during token generation. Key-Value Cache is a mechanism for accelerating token generation in LLMs based on the Transformer architecture. More compact formats reduce memory consumption and allow processing longer contexts

Maximum context length

Maximum number of tokens a model can process within a single request