Model Monitoring

Chariot captures runtime metrics for Inference Servers including latency, system utilization, errors, and throughput. To view them, navigate to the Models page and select the Monitor tab.

Monitor tab

The top of the Monitor tab shows the number of running instances and a time range selector. Historical data availability depends on your deployment but is generally up to 60 days.

Metrics are organized into five collapsible sections. Select any section header to expand or collapse it.

Data Drift

When drift detection is enabled, drift results appear in this section. See Drift Detection for details.

If drift detection is not enabled, a message is displayed.

Drift disabled message

Request Latency

Request latency is captured in a histogram and displayed as aggregate statistics over the selected time range. Empty intervals indicate periods with no inference requests.

Metric	Description
Average	Total processing time divided by number of inferences
Median (P50)	50% of requests completed at or below this time
P90	90% of requests completed at or below this time
P95	95% of requests completed at or below this time
P99	99% of requests completed at or below this time

Use these values together to assess whether your model meets latency requirements across typical and worst-case requests.

Request latency graph

System Utilization

The System Utilization section shows resource usage for the Inference Server:

Metric	Description
GPU Utilization	Percentage of time that the GPU's compute cores (streaming multiprocessors [SMs]) were actively executing work during the sampling window. Reflects the fraction of time at least one task was running on the GPU and is more representative of sustained activity than instantaneous GPU utilization. Based on `DCGM_FI_PROF_SM_ACTIVE`. Empty for CPU-only models.
VRAM	GPU memory usage. Empty for CPU-only models.
CPU	Processor utilization across all cores.
RAM	System memory usage.

If utilization is consistently at the maximum, consider allocating more resources. If utilization is very low, consider reducing allocated resources.

GPU utilization graph

CPU and RAM graph

note

On some installations, Chariot may not have permission to capture system utilization. In those cases, these values will always be blank.

Response Status Codes

Each inference request returns an HTTP status code, grouped into four buckets:

Code range	Meaning
2xx	Success
3xx	Redirection (rare for Inference Servers)
4xx	Bad request (invalid data, authorization errors)
5xx	Server error (including rate limiting)

The top of the Response Status Codes section of the monitoring UI shows total counts per bucket for the selected time range, followed by a time-series graph.

Response status codes graph

note

Status code counts are stored in histogram buckets chunked by time. Off-by-one discrepancies can occur near bucket or time boundaries. Adjusting the time range may help.

Queue and Throughput

The Queue and Throughput section shows three metrics:

Metric	Description
Requests Per Minute	Rate of incoming requests; shown as current, average, and peak over the selected time range
Requests in Queue	Requests waiting to be forwarded to the Inference Server
Concurrent Requests	Requests currently being processed by the Inference Server

Queue and throughput graph

Max Concurrent Requests

The Max Concurrent Requests Inference Server setting controls how many simultaneous requests Chariot forwards to each Inference Server instance. The queue is also per instance; each running instance maintains its own independent queue with a capacity of 10× the max concurrent requests value.

Setting	Behavior (per instance)
Unset or `0`	All requests are forwarded directly; no queue is used; concurrent requests may exceed 1 under load
Positive value `N`	Up to `N` requests run concurrently; up to `10N` requests are held in queue; excess requests receive an HTTP 503 error

Incoming requests are load balanced across all running instances. This means that the total system capacity scales with the number of instances: With five instances and max concurrent requests set to 1, each instance holds up to 10 queued requests, giving the system a total capacity of 50 open requests at once.

Requests can remain in queue for up to 30 seconds before being rejected.

Example: With a single instance and max concurrent requests set to 2, Chariot allows two simultaneous requests and holds up to 20 in queue (10×). If 25 requests arrive at once and each takes one second to complete:

Requests 1–2 are forwarded to the model immediately (concurrent slots full).
Requests 3–22 are held in queue.
Requests 23–25 arrive with both concurrent slots and the queue full, and receive an immediate HTTP 503 error.

As each request is completed, Chariot pulls the next one from the queue until it empties. Adding more instances would increase both the concurrent and queue capacity proportionally.

Interpreting the Metrics

Use Requests in Queue and Concurrent Requests, together with Requests Per Minute and latency, to assess whether your deployment is appropriately sized:

High queue depth or frequent 503 errors indicate the need for more Inference Server instances.
Consistently low utilization may indicate over-provisioning.

Data Drift​

Request Latency​

System Utilization​

Response Status Codes​

Queue and Throughput​

Max Concurrent Requests​

Interpreting the Metrics​