Inference Protocols

Overview

Inference protocols are named specifications that define how Chariot communicates with inference engines. When creating a Custom Inference Engine, understanding inference protocols is essential for ensuring that your inference engine integrates properly with Chariot features.

Think of an inference protocol as a contract that describes:

URL paths: The endpoints within the inference engine used for different operations
Request format: The structure and data types expected for incoming inference requests
Response format: The structure and data types returned in inference responses

Each inference protocol is identified with a unique string (e.g., chariot-v2-kserve) and can support multiple task types. An inference engine that adheres to an inference protocol doesn't need to support every task type defined by that protocol—just the task types that the inference engine supports.

Why Inference Protocols Matter

Protocols determine which Chariot features are available for your Inference Server:

Drift detection: Requires Chariot to extract and compare request data against baseline data
Inference storage: Requires Chariot to parse responses to identify inferences and metadata
Evaluation: Requires Chariot to parse inferences and score against ground truth labels

Inference engines that do not specify an inference protocol or use one that is not known by Chariot can still run on Chariot, but they won't have access to these advanced features. Inference engines implementing known inference protocols integrate more tightly with the Chariot platform and enable the full monitoring and evaluation capabilities.

Chariot-Recognized Inference Protocols

While Chariot supports a few legacy protocols, chariot-v2-kserve is the recommended format for new inference engines.

Chariot V2 KServe Protocol

chariot-v2-kserve is a versatile protocol focused on the JSON schema of request and response payloads. While the protocol defines the structure, specific field values may be determined by the inference engine itself.

URL Paths

The chariot-v2-kserve protocol expects an inference endpoint at /infer. The full request path is formed by prepending the inference engine's container_root_relative_base_url, which, for the chariot-v2-kserve protocol, is expected to be /v2/models/m-<lower-case-model-id>/.

The complete path for an inference request to a specific model is:

https://%%CHARIOT-HOST%%/api/serve/serve-model-<lower-case-model-id>-pred/v2/models/m-<lower-case-model-id>/infer/

Requests

chariot-v2-kserve uses an envelope format that comes from the KServe ecosystem. Each request is formatted as JSON and wrapped in a required array of objects called inputs:

{
  "inputs": [
    {
      "name": "input-image-0",
      "shape": [1],
      "datatype": "BYTES",
      "data": ["input data"],
      "parameters": {"action": "predict_proba"}
    }
  ]
}

The following fields are required for each object in the inputs array: name, datatype, data, and shape. An optional parameters field is also supported.

The name field is an arbitrary client-specified string.

The datatype field is described here in the context of MLServer. Strings and complex objects use BYTES. The supported values for datatype and the other fields are defined by the inference engine, and may include values like INT32.

The data for the request is passed in the data field. The data field is defined to be one dimensional, or a "flat array," and can contain elements of the following types:

Strings of text
Encoded binary data (e.g., base64 encoded image data)
Numbers

The shape field is used to describe the actual shape of multi-dimensional data for the model to evaluate. The inference engine is expected to use shape to re-format the data to the provided dimensions. A single dimension can be specified as [N] or [N, 1] where N is the length of the data array. For images, that length is often just 1—indicating that one image is being passed in. Variable-size dimensions are specified as -1.

The optional parameters field allows the inference engine to provide different behaviors based on request-time settings. For example, some models may accept the content_type parameter and use the value of np to determine that the data is a NumPy array of some sort. Another uses the action parameter to differentiate between predict and predict_proba requests for image classification. The ImageText2Text task type uses the prompts parameter to pass an array of string prompts to the model.

For computer vision task types, IMAGE_CLASSIFICATION, OBJECT_DETECTION, IMAGE_SEGMENTATION, and ORIENTED_OBJECT_DETECTION, some inference engines will support multiple values in the input array, where each input is expected to be an encoded image to process. Those inference engines will return multiple outputs, one per input, as described below. Chariot can parse requests in this format with the datatype BYTES to extract encoded images or text to calculate semantic score.

Some task types (e.g., non-computer vision tasks for Hugging Face) consider multiple "inputs" as parameters of a single request. This is how MLServer encodes inputs in mlserver-huggingface. Here is an example for Hugging Face non-OpenAI API text generation model, where the prompt is separate from the max_new_tokens, temperature, and do_sample parameters:

{
    "inputs": [
        {
            "data": ["Prompt here"],
            "datatype": "BYTES",
            "name": "args",
            "parameters": {},
            "shape": [1],
        },
        {
            "data": [500],
            "datatype": "INT64",
            "name": "max_new_tokens",
            "parameters": {"content_type": "raw"},
            "shape": [1],
        },
        {
            "data": [0.8],
            "datatype": "FP64",
            "name": "temperature",
            "parameters": {"content_type": "raw"},
            "shape": [1],
        },
        {
            "data": [true],
            "datatype": "INT8",
            "name": "do_sample",
            "parameters": {"content_type": "raw"},
            "shape": [1],
        },
    ]
}

In all cases, the format of the request is the same, but the inference engine is interpreting it in its own way.

Custom Metadata

Inference requests can include metadata in the parameters section. The values should be an encoded JSON string and can be broken into standard and extended sections. The inference engine should copy this metadata through to the response, where it can be used as queryable/filterable values in the Inference Store.

"metadata": json.dumps(
    {
        "standard_metadata": {},
        "extended_metadata": [
            {"key": "a", "type": "int", "value": "1"},
            {
                "key": "score_threshold",
                "type": "float",
                "value": "0.55",
            },
        ],
    }
),

The standard_metadata field should be left empty. It is reserved for internal Chariot processes, which populate this field with system-defined metadata keys. For more details on these reserved keys, see here.

The extended metadata should be a dictionary with three fields:

key is the string name of the metadata and is used in filtering
value is a string version of the value (numbers are passed as strings)
type is one of str, float, int, or dict and is used by the Inference Store to type the data for operations like greater than

Responses

chariot-v2-kserve uses an envelope for the output format, similar to the one used for requests. Each response is formatted as JSON and wrapped in a required array of dictionaries called outputs:

{
    "outputs": [
        {
            "name": "input-image-0",
            "shape": [1],
            "datatype": "BYTES",
            "data": [],
        }
    ],
}

As described in the next few sections, the actual inferences are returned in the data array based on the task type.

The inference engine may return extra data in the output dictionaries. Chariot considers the existence of parameters fields on the root level or the individual output object levels to be valid. Currently, parameters are used to return the semantic score along with a header, but Chariot is moving away from adding parameters in favor of an HTTP header to reduce the changes to data formatted by the inference engine. The inference engine may return parameters to match those passed in, like the action parameter to indicate which operation was used.

{
    "id": "output-1",
    "model_name": "mymodel",
    "model_version": "1.0.0",
    "outputs": [
        {
            "name": "myinput",
            "shape": [1, 777],
            "datatype": "BYTES",
            "data": [],
            "parameters": {},
        }
    ],
    "parameters": {},
}

Image Classification

Image classification inferences are formatted as an array of dictionaries in JSON. Each dictionary should contain a field label with the string label being scored and a field score with a numeric score.

[
    {"label": "cat", "score": 0.7},
    {"label": "dog", "score": 0.3},
]

Object Detection

Object detection inferences are formatted as an array of dictionaries in JSON. Each dictionary contains the label and score as well as the bounding box information encoded in the fields:

xmin: The "left" side of the bounding box on the x-axis
xmax: The "right" side of the bounding box on the x-axis
ymin: The "top" of the bounding box on the y-axis
ymax: The "bottom" of the bounding box on the y-axis

Coordinates for object detection are "flipped," with the y-axis value of 0 at the top and increasing as you go down. The x-axis is left to right, and units are in pixels. So, an xmin of 7 would mean 7 pixels from the left while a ymin of 7 would mean 7 pixels from the top.

[
    {"label": "cat", "score": 0.6, "xmin": 7, "xmax": 17, "ymin": 50, "ymax": 77},
    {"label": "dog", "score": 0.2, "xmin": 12, "xmax": 21, "ymin": 17, "ymax": 71}
]

Oriented Object Detection

For oriented detections, Chariot still expects an array of dictionaries, but now each dictionary has the concept of a rotated bounding box. The label and score fields remain the same, but the outline of the detection is specified with:

cx: The center of the bounding box on the x-axis
cy: The center of the bounding box on the y-axis
w: The width of the bounding box
h: The height of the bounding box
r: The rotation in radians, axes point rightward and downward, rotation is clockwise for positive r

Coordinates for oriented object detection are "flipped," with the y-axis value of 0 at the top and increasing as you go down. The x-axis is left to right. Units are normalized from 0 to 1, where 1 is the width or height of the image. So, a cx of 0.5 with a cy of 0.5 would be the center of the image.

[
  {"label": "cat", "score": 0.99, "cx": 0.5, "cy": 0.5, "w": 0.01, "h": 0.01, "r": 0.123},
  {"label": "dog", "score": 0.99, "cx": 0.3, "cy": 0.7, "w": 0.08, "h": 0.04, "r": -0.1},
]

Image Segmentation

Image segmentation models should return an array of contours where each contour is defined as a sequence of points in pixel coordinates following the OpenCV contour specification. A contour, per OpenCV, can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity. As with other inferences, there is a label and score field.

[
  {
    "label": "cat",
    "score": 0.5,
    "contour": [
      [{"x": 1.0, "y": 2.0}, {"x": 3.0, "y": 4.0}],
      [{"x": 5.0, "y": 6.0}, {"x": 7.0, "y": 8.0}]
    ]
  },
  {
    "label": "dog",
    "score": 0.75,
    "contour": [
      [{"x": 9.0, "y": 10.0}, {"x": 11.0, "y": 12.0}],
      [{"x": 13.0, "y": 14.0}, {"x": 15.0, "y": 16.0}]
    ]
  }
]

Image Text to Text

For image plus text to text models, there may be multiple prompts. The inferences returned are linked to those prompts using dictionaries with a prompt field and an answer field. The answer is the text generated by the model based on the input image and related prompt.

[
    {"prompt": "what in picture?", "answer": "dogs and cats"},
    {"prompt": "what colors in picture?", "answer": "red blue and green"},
]

Overview​

Why Inference Protocols Matter​

Chariot-Recognized Inference Protocols​

Chariot V2 KServe Protocol​

URL Paths​

Requests​

Custom Metadata​

Responses​

Image Classification​

Object Detection​

Oriented Object Detection​

Image Segmentation​

Image Text to Text​