Custom Inference Engines

Chariot supports defining custom inference engines for serving Custom models. A custom inference engine consists of a container image, a name, a version, and a set of configuration parameters. Chariot can use this engine information to run the inference engine as a service, configured with a specific Custom model.

Creating and using custom engines requires custom code and some developer knowledge. The engine writer will also need to be able to install the image in a registry that is accessible by the cluster where the engine is used. In private clusters, an administrator may be required to install the image files.

Creating a Custom Inference Engine

At its core, an inference engine is an OCI image which has been uploaded to a container registry and registered with Chariot. When a model's inference server settings point to the inference engine, that image will be used when starting an inference server for the model. Since engines contain custom code they can be used to support many different types of models.

To interoperate with the rest of Chariot, engines must follow some conventions. For example, the container must have /ready and /infer endpoints. For inference storage, drift detection, and SDK interop, the container must accept a certain input shape and produce a certain output shape.

Parts of the engine

Engines consist of the following components:

An HTTP Server, written in your language of choice, following some guidelines,
A Containerfile to package your server into an OCI image, and
An OCI image stored in a Chariot-accessible container registry

For example, to write an engine that serves Huggingface models, you could use:

Python and the FastAPI framework for the HTTP API
The HuggingFace transformers Python library to run the model
podman and an OCI Containerfile to build the image

Server

The HTTP server must implement /ready (GET) and /infer (POST) endpoints. For example, with a FastAPI server, this might look like:

from fastapi import FastAPI, Response, Request, JSONResponse
from http import HTTPStatus
import asyncio
from contextlib import asynccontextmanager


class ModelAPI:
    def __init__(self):
        self.model = None
        self._ready = False

    def _load_sync(self):
        """
        Load your model (synchronous)
        """
        # Your model loading logic here
        self.model = "model_instance"
        self._ready = True

    async def load(self):
        """Load model asynchronously"""
        await asyncio.to_thread(self._load_sync)

    def _infer_sync(self, body: dict):
        """
        Infer on your model (synchronous)
        """
        assert self.model is not None
        # Your inference logic here
        return {"result": "inference_output"}

    async def infer(self, request: Request) -> JSONResponse:
        """Handle inference endpoint"""
        post_body = await request.json()
        # Perform model inference using asyncio
        result = await asyncio.to_thread(self._infer_sync, post_body)
        return JSONResponse(content=result)

    async def ready(self) -> Response:
        """Handle ready endpoint"""
        return Response(status_code=HTTPStatus.OK if self._ready else HTTPStatus.BAD_REQUEST)


def main():
    import uvicorn
    
    api = ModelAPI()
    
    @asynccontextmanager
    async def lifespan(app: FastAPI):
        # Load model on startup
        await api.load()
        yield

    app = FastAPI(lifespan=lifespan)
    
    # Add routes
    app.add_api_route("/ready", api.ready, methods=["GET"])
    app.add_api_route("/infer", api.infer, methods=["POST"])

    uvicorn.run(app, host="0.0.0.0", port=8080)


if __name__ == "__main__":
    main()

The ready endpoint will be used to determine if the model is ready to process inference requests or not, and should return a 200 response if it is. The infer endpoint does the actual inference for the model. Your model artifact (what you uploaded to Chariot) will be unzipped and copied to /mnt/models within your container. Additionally, Chariot will set the CHARIOT_MODEL_ID and CHARIOT_TASK_TYPE env vars on your engine container, these can be used by the container for any purpose.

Any logs sent to stdout will be collected by Chariot and shown to users.

note

Be careful to consider thread safety in your engine code. FastAPI is capable of serving multiple requests concurrently using async Python. We don't want to block the event loop, which is why we used asyncio.to_thread in the above example to call our model's inference methods. This means that two inference requests can be performed at the same time. Therefore you should ensure that the underlying code running your model is thread safe (or use a lock).

note

If you want the option of running your model on a GPU, be sure your model loading and inference pipelines look for a GPU and use it appropriately.

Containerfile

The engine needs a Containerfile to create the container image that runs your server. Any valid OCI image that will run on Linux is okay to use.

When you build your engine, ensure that you install the necessary dependencies (e.g. for the example above, fastapi, uvicorn) as well as any dependencies needed to run your model. Additionaly, ensure that it runs your HTTP server on startup.

Creating the Engine in Chariot

After you have created the engine image, you will need to upload it to your container repository. Keep in mind that depending on where you are using Chariot you may need to ask an administrator for help uploading the image to an internal repository. Once uploaded you should have a URI for the image that you can use to register it with Chariot.

You can now register your engine with Chariot. This can be done with the SDK:

from chariot.client import connect
connect()

from chariot.models import engines

engine_version_id = engines.create_engine_version(
        name=<the engine name>,
        project_id=<the project id>,
        version=<the engine version, like "1.0.0">,
        container_image_cpu=<the URL for the image>,
        container_image_gpu=<the URL for the image>, # Can be the same as the CPU image, if built properly
        is_default=True,
        env=[{"name": "SOME_ENV_VAR", "value": "some value"}], # Optional env vars added to the engine
        readiness_probe=engines.ReadinessProbe(
            path="/ready",
            port=8080, # Ensure you use the same port as what your server used
            initial_delay_seconds=5,
            timeout_seconds=10,
            period_seconds=1,
            success_threshold=1,
            failure_threshold=3,
        ),
    )

Some more details about this operation:

container_image_cpu and container_image_gpu define which image URI to grab the image from when running without or with a GPU respectively. If the image can run on both, you can use the same URI in both places. If neither of these is set, an error is returned. If only once is set, the engine will not run in the other mode
is_default is used to flag the default engine version
env is a map of environment variables to pass to the engine on startup
readiness_probe is the kubernetes configuration for this engines readiness probe

Not shown in this example, you can also configure the following engine settings:

command - Maps to the ENTRYPOINT in the container
args - Maps to the CMD passed to the OCI container

Once your engine is registered, you can use the SDK to associate it with a model using its selector as described earlier.

Model Storage

A Chariot model is created by uploading model files through the UI or SDK. The model files are stored by Chariot and will be mounted in the engine container at /mnt/models when an inference server is started. This is a common place to put model weights and other model configuration.

Using a Custom Inference Engine

To use an existing inference engine you will need to ensure you have a Custom model in Chariot that will work with your engine. The contents of this model's files can be anything.

Given a model archive, call it model.zip, you can use the SDK to upload the model to Chariot in the same way other models are uploaded:

from chariot.client import connect
connect()

from chariot.models import import_model, ArtifactType

class_labels_list = {"cat": 0, "dog": 1}

model = import_model(
            project_id=<the project id for the model>,
            name=<model name>,
            summary=<a short text summary of the model>,
            version=<the model's version>,
            task_type=<a supported task type, like TaskType.OBJECT_DETECTION>,
            artifact_type=ArtifactType.CustomEngine,
            model_path="./model.zip",
            class_labels=class_labels_list,
        )

Then you can associate an engine to its inference server settings:

model.set_inference_server_settings(
        {
            "engine_selector": {
                "org_name": <org containing the engine>,
                "project_name": <project containing the engine>,
                "engine_name": <engine name>,
            },
        }
    )

Once this is set, you can start the model's inference server using the UI or SDK and it will use your custom engine to perform inference.

Creating a Custom Inference Engine

Parts of the engine​

Server​

Containerfile​

Creating the Engine in Chariot​

Model Storage​