Skip to main content

Monitoring a Training Run

The status, checkpoints, and metrics of a Training Run can be retrieved through the UI or SDK.

Within a project, the Training Runs page lists all Training Runs associated with that project, along with details about their status and any actions that can be accomplished with that Training Run.

training-run-monitoring

Click on the run name for detailed information about your Training Run, including the tabs below.

Details

The Details tab summarizes key aspects of your Training Run, including its status, selected settings, and information associated with the dataset you choose to train on.

training-run-details-tab

Logs

The Logs tab provides access to two types of logging information from your Training Runs: container logs and pod events.

Container Logs

Container logs show output directly from your Training Run container, including your application logs, print statements, and any error messages from your training code.

Select the Container Logs radio button to view logs from the training container.

training-run-logs

Pod Events

Pod events provide infrastructure-level logs from the Kubernetes system that schedules and manages your training containers. These logs are useful for troubleshooting deployment and resource issues.

Select the Pod Events radio button to view infrastructure logs from Kubernetes.

training-run-kube-logs

Metrics

This tab displays plots from metrics that get recorded during training, such as the training loss and validation accuracies. You can also view the performance of your model at different checkpoints within this tab. When you have checkpoint that has the desired perfromance of your model, you can export the checkpoint to the model catalog.

training-run-metrics