Custom Inference Engines
Chariot supports defining custom Inference Engines for serving Custom models. At the most basic level, an Inference Engine consists of:
- An OCI container image
- A name
- A version
- A set of configuration parameters
Chariot can use this Inference Engine information to run the Inference Engine as a service, configured with a specific Custom model.
Creating and using custom Inference Engines requires custom code and some developer knowledge. The Inference Engine writer will also need to be able to publish the container image to a registry that is accessible by the cluster where the Inference Engine is used.
Creating a Custom Inference Engine
At its core, an Inference Engine is an OCI image that has been uploaded to a container registry and registered with Chariot. When a model's Inference Server settings point to the Inference Engine, that image will be used when starting an Inference Server for the model. Since Inference Engines can contain custom code, they can be used to support many different types of models.
To interoperate with Chariot, Inference Engines must expose a small set of well-known endpoints and can, optionally, implement a recognized inference protocol. At a minimum, your container must implement a GET health endpoint so Chariot can determine if the model is ready to service requests. To enable advanced features (Inference Store, drift detection, evaluation, and SDK compatibility), you should implement and declare a supported inference protocol—we recommend chariot-v2-kserve.
Parts of the Inference Engine
An Inference Engine is an OCI image that runs an HTTP server. The code itself can be written in the language of your choice. This server must be written to process inference requests for your target models, and the image must be stored in a Chariot-accessible container registry.
For example, to write an Inference Engine that serves Hugging Face models, you might use Python with FastAPI and the transformers library, and then build and push the image with podman.
Server
The HTTP server must implement:
- Health check: A
GETendpoint that returns HTTP 200 when the model is ready to service requests. This is configured via thereadiness_probefield when creating an Inference Engine. See the example below. - Inference endpoint: A
POSTendpoint that accepts inference requests.
You control the request path prefix via the container_root_relative_base_url setting when registering the Inference Engine. This value is prepended to the endpoints above. For example:
- Simple base:
/or empty → endpoints are/readyand/infer. - Model-specific base:
/v2/models/m-<lower-case-model-id>→ endpoints are/v2/models/m-<lc-modelid>/readyand/v2/models/m-<lc-modelid>/infer.
Choose the pattern that matches your Inference Engine architecture; model-specific prefixes are useful when a single Inference Engine instance serves multiple models or versions.
For example, with a FastAPI server, this might look like:
from fastapi import FastAPI, Response, Request, JSONResponse
from http import HTTPStatus
import asyncio
from contextlib import asynccontextmanager
class ModelAPI:
def __init__(self):
self.model = None
self._ready = False
def _load_sync(self):
"""
Load your model (synchronous)
"""
# Your model loading logic here
self.model = "model_instance"
self._ready = True
async def load(self):
"""
Load model asynchronously
"""
await asyncio.to_thread(self._load_sync)
def _infer_sync(self, body: dict):
"""
Infer on your model (synchronous)
"""
assert self.model is not None
# Your inference logic here
return {"result": "inference_output"}
async def infer(self, request: Request) -> JSONResponse:
"""
Handle inference endpoint
"""
post_body = await request.json()
# Perform model inference using asyncio
result = await asyncio.to_thread(self._infer_sync, post_body)
return JSONResponse(content=result)
async def ready(self) -> Response:
"""
Handle ready endpoint
"""
return Response(status_code=HTTPStatus.OK if self._ready else HTTPStatus.BAD_REQUEST)
def main():
import uvicorn
api = ModelAPI()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Load model on startup
await api.load()
yield
app = FastAPI(lifespan=lifespan)
# Add routes
app.add_api_route("/ready", api.ready, methods=["GET"])
app.add_api_route("/infer", api.infer, methods=["POST"])
uvicorn.run(app, host="0.0.0.0", port=8080)
if __name__ == "__main__":
main()
Be sure to bind to host="0.0.0.0". If you instead bind to default host="127.0.0.1", the Kubernetes readiness probe will fail with an error similar to the following:
Readiness probe failed: Get "http://10.30.66.14:8080/engine-test/ready": dial tcp 127.0.0.1:8080: connect: connection refused
The readiness probe endpoint will be used to determine if the model is ready to process inference requests or not and should return a 200 response if it is. The infer endpoint does the actual inference for the model. Your model artifact (what you uploaded to Chariot) will be unzipped and copied to /mnt/models within your container. Additionally, Chariot will set the CHARIOT_MODEL_ID and CHARIOT_TASK_TYPE environment variables on your Inference Engine container; these can be used by the container for any purpose.
Any logs sent to stdout will be collected by Chariot and shown to users.
Be careful to consider thread safety in your Inference Engine code. FastAPI is capable of serving multiple requests concurrently using async Python. We don't want to block the event loop, which is why we used asyncio.to_thread in the above example to call our model's inference methods. This means that two inference requests can be performed at the same time. Therefore, you should ensure that the underlying code running your model is thread safe (or use a lock).
If you want the option of running your model on a GPU, be sure your model loading and inference pipelines look for a GPU and use it appropriately.
Containerfile
Continuing the above example, an example Containerfile to build your image is:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["python", "main.py"]
Here, main.py is a file that contains your server code, and requirements.txt is the list of requirements for your server (e.g., FastAPI, Uvicorn, transformers, etc.).
To build this image, put the Containerfile in your app's working directory and use podman build:
podman build -t my-custom-engine .
Creating the Inference Engine in Chariot
After you have created the Inference Engine image, you will need to upload it to your container repository. Keep in mind that, depending on where you are using Chariot, you may need to ask an administrator for help uploading the image to an internal repository. Once it's uploaded, you should have a URI for the image that you can use to register it with Chariot.
You can now register your Inference Engine with Chariot. This can be done with the SDK:
from chariot.client import connect
connect()
from chariot.models import engines
engine_name = "My Engine Name"
project_id = "abc"
version = "1.0.0"
container_image_cpu = "docker.io/my-engine-image-cpu"
container_image_gpu = "docker.io/my-engine-image-gpu" # Can be the same as the CPU image, if built properly
env = [{"name": "SOME_ENV_VAR", "value": "some value"}] # Optional env vars added to the Inference Engine
engine_version_id = engines.create_engine_version(
name=engine_name,
project_id=project_id,
version=version,
container_image_cpu=container_image_cpu,
container_image_gpu=container_image_gpu,
is_default=True,
supports_multi_gpu=False,
env=env,
readiness_probe=engines.ReadinessProbe(
path="/ready",
port=8080, # Ensure you use the same port that your server used
initial_delay_seconds=5,
timeout_seconds=10,
period_seconds=1,
success_threshold=1,
failure_threshold=3,
),
predictor_env_schema=[
{
"name": "LOG_LEVEL",
"type": "string",
"default_value": "INFO",
"display_text": "Log level",
},
{
"name": "QUANTIZE_WEIGHTS",
"type": "bool",
"default_value": "false",
"display_text": "Quantize weights",
},
],
enforce_predictor_env_schema=False,
# Optional inference protocol
inference_protocol="chariot-v2-kserve"
)
Some more details about this operation:
container_image_cpuandcontainer_image_gpudefine which image URI to grab the image from when running without or with a GPU respectively. If the image can run on both, you can use the same URI in both places. If neither of these is set, an error is returned. If only one is set, the Inference Engine will not run in the other mode.is_defaultis used to flag the default Inference Engine version.supports_multi_gpuis used to tell the system that the Inference Engine can use more than 1 GPU on a single node (see Multi-GPU for context). The Inference Engine must havecontainer_image_gpuset if you want to set this totrue. The default isfalse. When set tofalse, the Inference Engine can still support a single GPU ifcontainer_image_gpuis set. This flag is primarily used to validate inference service settings.envis a map of environment variables to pass to the Inference Engine on startup.readiness_probeis the Kubernetes configuration for this Inference Engine's readiness probe.predictor_env_schemais used to specify a list of possible config options. These options can then be set as Inference Server settings (under thepredictor_env_varkey) and will be passed to the Inference Engine container as environment variables when it is started. Valid entries for thetypefield are "string", "int", "bool", and "float". In the example, users will be able to specifyLOG_LEVELandQUANTIZE_WEIGHTSvalues when creating an Inference Server through the UI.- If
enforce_predictor_env_schemais true, only variables listed inpredictor_env_schemawill be allowed as entries in the Inference Server settingspredictor_env_varlist. When false, arbitrary entries are allowed, and the UI provides a way to create new name/value pairs. All such pairs implicitly have type "string".
Although it is not shown in this example, you can also configure the following Inference Engine settings:
entrypoint: Maps to theENTRYPOINTin the containercommand: Maps to theCMDpassed to the OCI container
Once your Inference Engine is registered, you can use the SDK to associate it with a model using its selector as described earlier.
Model Storage
A Chariot model is created by uploading model files through the UI or SDK. The model files are stored by Chariot and will be mounted in the Inference Engine container at /mnt/models when an Inference Server is started. This is a common place to put model weights and other model configuration.
Using a Custom Inference Engine
To use an existing Inference Engine, you will need to ensure you have a Custom model in Chariot that will work with your Inference Engine. The contents of this model's files can be anything.
Given a model archive, call it model.zip, you can use the SDK to upload the model to Chariot in the same way other models are uploaded:
from chariot.client import connect
connect()
from chariot.models import import_model, ArtifactType
class_labels_list = {"cat": 0, "dog": 1}
model = import_model(
project_id=<the project id for the model>,
name=<model name>,
summary=<a short text summary of the model>,
version=<the model's version>,
task_type=<a supported task type, like TaskType.OBJECT_DETECTION>,
artifact_type=ArtifactType.CUSTOM_ENGINE,
model_path="./model.zip",
class_labels=class_labels_list,
)
Then you can associate an Inference Engine to its Inference Server settings:
model.set_inference_server_settings(
{
"engine_selector": {
"org_name": <org containing the Inference Engine>,
"project_name": <project containing the Inference Engine>,
"engine_name": <Inference Engine name>,
},
}
)
Once this is set, you can start the model's Inference Server using the UI or SDK, and it will use your custom Inference Engine to perform inference.