Skip to main content

Inference Servers

Setting Up an Inference Server

Users can interact with the Inference Server using the or through the API. Code snippets unique to your server can be found by clicking the Connect to server button.

connect-to-inference-server

To create an Inference Server, click the Create button. You can modify the server's settings by clicking the Cog button. When the Model Version Settings modal window opens, you'll see tabs for Inference Server, Monitoring, and Storage. These tabs offer customization options so that you can optimize the deployment of your model, allowing you to minimize costs while tailoring performance to meet your specific requirements. Below, we review these options and highlight additional features available for certain task types.

Inference Server Components

Chariot Inference Servers consist of two Components, a predictor and a transformer. Options in the Model Version Settings window (and SDK Inference Server settings functions) are handled by one of these resources. Core Inference Server features like inference itself are handled by the predictor. This is the resource that loads your model data when you start an Inference Server. Other features like inference storage and drift detection are handled by the transformer, a separate resource that communicates automatically with the predictor.

models-create-inference-server

Chariot provides a wide range of options to optimize serving your model so that you can keep costs down while customizing the performance to match your needs. In this section, we will go over these options and highlight some additional features that are available if your model is a certain task type.

Compute Resources

Within the Inference Server tab, the Base Resources section allows you to configure your model to run on either a GPU or CPU and allocate the necessary resources accordingly. Note that in some cases, the Inference Server may take several minutes to initialize if the required resources are not immediately available.

inference-server-base-settings

Inference Server Scaling

The Scaling Resources section offers controls for adjusting the server's capacity based on demand. Setting the minimum to zero allows the server to shut down when idle, which helps to reduce costs. This is similar to putting a computer into sleep mode—it will automatically wake up and resume operations when there is activity. You can control scaling behavior by specifying the threshold for concurrent requests and setting the duration of inactivity before scaling down begins.

Monitoring Settings

Drift detection is available to you for a Chariot model with an associated Training Run and a corresponding dataset. First, register the model and datasets to enable the Inference Servers to monitor incoming data for potential data drift. This powerful functionality allows you to detect when production data begins to deviate from the data used during training, providing valuable insights. View these data drift detections within the Monitor tab on your model's page. For more details see Drift Detection.

inference-server-monitoring-settings

Inference Storage

This tab is focused on storage management. You have the option of storing the images sent to your model, along with the corresponding inferences. These inferences can be viewed under the Inferences tab on your model's page, and you can even curate new datasets from this data to retrain your model. However, since storing such data can consume significant storage space, Chariot offers retention policies to help manage and clean up outdated data that may no longer be required. For more details on inference storage see Inference Store.

inference-server-storage-settings

Inference Engine Selection

Some models support multiple "inference engines," the underlying code and methods used to load the model and run inference.

When creating an Inference Server, if multiple inference engines are supported for a model, the Inference engine setting will be available, and the supported options will be contained in the drop-down.

InferenceEngineSelection

Sending Data to Models

The UI offers a try-it-out feature where inference can be run through a web interface. To use this feature, click on the Actions button on the model's page, select Create Inference Server. After an Inference Server is started, the Try it Out tab will be enabled.

models-try-it-out

Batching

Batching is the process of passing a collection of inputs (images, strings, etc.) to a model in a single inference request rather than sending them each as a separate request. Depending on the model and what hardware it is running on, this can be more efficient. There are two ways that batching is done in Chariot: client side and server side. Client side batching is where the user explicitly passes all of their inputs into one request. For example, with the SDK this looks like:

from chariot.client import connect
from chariot.models import Model
connect()

model = Model(name="my_model_name", project="my_project_name")
model.detect(["path/to/image1.png","path/to/image2.png","path/to/image3.png"])

Server side batching (also called adaptive batching) is where the Inference Server intercepts separate individual inference requests within a predetermined timeframe and automatically groups them into a single batched inference request that is then sent to the model. For example, if you send your inference requests separately as:

for path in ["path/to/image1.png","path/to/image2.png","path/to/image3.png"]:
result = model.detect(path)

then adaptive batching will automatically group these into a single batch before they reach the model.

Benefits

There are two main benefits of batching:

  • Maximize resource usage: Usually, inference operations are "vectorized" (i.e., are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time.
  • Minimize inference overhead: This can be something like input/output (IO) to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it is beneficial to send batches that are as large as possible without deteriorating performance.

However, these benefits will only scale up to a certain point, which can be determined either by the infrastructure, the machine learning framework used to train your model, or a combination of the two.
In order to maximize the performance improvements made by batching, it is important to tailor this configuration for specific models, environments, and use cases, which can be found through experimentation.

Using Adaptive Batching

Chariot lets you configure adaptive batching independently through two parameters:

  • Maximum batch size: The maximum number of inference requests that the server will group together to process as a batch.
  • Maximum batch delay: The maximum time, in seconds, that the server should wait for new requests to group together for batch processing.

Authentication

You will need to authenticate with Chariot in order to send data to models and get predictions. For detailed information on authentication methods including creating client credentials, using the SDK, and programmatic authentication, see the Authentication documentation.

Large Language Models (LLMs)

Large Language models may have additional settings. For comprehensive information about serving Large Language Models in Chariot see the LLM documentation.