Skip to main content

Introduction

Chariot provides a codeless environment for deep learning training, so anyone from curious novices to experienced data scientists can train models without worrying about writing code, provisioning servers, or installing software.

Currently, the following computer vision tasks are supported:

  • Image classification
  • Object detection
  • Image segmentation

The following natural language processing (NLP) tasks are supported:

  • Text classification
  • Token classification

Chariot’s model training functions are primarily built on top of PyTorch but also support additional training frameworks. For further information regarding its ecosystem of models and descriptions of the model architectures, navigate to the PyTorch Document Library. For more information on choosing the best architecture, check out our Data Science Fundamentals guide on neural network architecture. We also provide a complete list of supported models, sizes (number of parameters), depth (number of layers), and their disk footprint (size, in MB of a checkpoint) in the appendix.

Training a Model

Before training a model, you must determine your training parameters, data processing settings, and various other critical details that will go into the Training Run. Chariot provides default settings for many of these variables, and you can choose from very minimal options in the Quick Start training template as outlined below.

If you would like greater control and specification over the model training process, we suggest researching the optimal settings for your use case and having this information available prior to starting a Training Run so that the process is smooth and effective. You can specify all your model parameters and settings in the Full Configuration template as outlined below. Your training parameters, data processing settings, and resource selections can be provided either via the UI or the .

To start a Training Run from the Chariot UI, open the associated project and click the Training Runs icon on the left-hand navigation bar. Click the Start New Run button to begin creating a Training Run in Chariot.

training-run-list

Select a Training Blueprint

Blueprints are training templates that guide users through different ways to train a machine learning model. You will see the following options to select from:

  • Quick Start: This is a simple training template that is ideal for first-time users, less experienced practitioners, or those who simply wish to get started quickly!
  • Full Configuration: This is a training template that provides a high degree of customization on model architecture, parameters, and other settings. This is ideal for data scientists who want to customize their Training Run based on experience, research, and use case.
  • Advanced Configuration (JSON): This option is typically used by advanced users and Striveworks personnel and includes the ability to provide a JSON that matches an expected Training Run configuration.
  • Any other custom training Blueprints created by users in the project. (Click here for more information on training Blueprints and how to create your own.)

blueprint-selection

If you selected Quick Start, the following instructions will guide you through the remaining steps. If you selected Full Configuration, skip to the Full Configuration instructions.

 

Quick Start

Name the Run and Specify the Task

  • Name: Name your run whatever you'd like! We recommend including distinguishing characteristics in the name to help differentiate and manage your runs; this will be particularly helpful when you're viewing them in a list on the Training Run home page.
  • Version: Set the version of this Training Run for your own records. By default, this will be set to 1.0, but it can be overridden by any string.
  • Data Type: Select the kind of data you are training your model to make predictions on:
    • Images: For training computer vision models
    • Text: For training natural language processing models
  • Task Type: Select the type of task your model will perform. Image classification, object detection, and image segmentation are available for computer vision model training while text classification and token classification are available for text-based natural language processing. data-type-task-type-selection  

Select Your Dataset

  • Add training dataset: Choose the training dataset you wish to train your model on, the specific View, and the Snapshot.
  • Add validation dataset: Choose the validation dataset you wish to use during training to evaluate the performance of your model, the specific View, and the Snapshot.
  • Negative samples: Select whether you would like to include negative samples (data that do not contain any cases of the selected labels you are training for) while training your model.
  • Select labels: Choose a subset of the available classes of a dataset that you would like to train your model to predict. dataset-selection

 

Choose Your Model, Training Settings, and Resources:

In this simplified training template, select some basic training characteristics based on your needs.

  • Model settings: Choose Start from scratch to build your model without any preloaded model weights, or choose Start from previously trained model to fine-tune from an existing model in Chariot.
  • Model architecture: If you choose to start from scratch, we provide three options for Image Classification that vary based on the size of the architecture. Typically, larger models require more computing power. The architecture that will best suit your use case varies widely based on many factors and can only be properly determined through experimentation. As a starting point, choosing a smaller model is a good way to get started and receive feedback to iterate on in the training process.
  • Training length: Select the length of the Training Run. One epoch is one complete pass through the entire training dataset by the machine learning model. During each epoch, the model processes every data point in the training set once, adjusts its internal parameters (weights), and learns from the data.
  • Resource selection: Specify the allocation of GPU and CPU resources for your Training Run. Please note that usable compute resources may vary depending on the other entities that your organization is actively running, like Inference Servers, Workspaces, and other Training Runs. If available, GPUs will show up in the drop-down menu; click Show Limits to get an idea of the limits of CPU cores and RAM that you can allocate for the run. The Resources Available message provides feedback on whether your chosen allotments are valid. For a more thorough deep dive into how to allocate resources effectively, refer to the Resource Management documentation.

data-type-task-type-selection

 

Review Your Selections

Review all the selections you made in prior steps. You can always go back to previous steps and edit any settings that you wish to change. When you're ready, click Submit to start training.

review-step

 

Full Configuration

Data Type Selection

Select the kind of data you are training your model to make predictions on and click Next.

  • Images: For training computer vision models
  • Text: For training natural language processing models data-type-selection

 

General Information

  • Name: Name your run whatever you'd like! We recommend including distinguishing characteristics in the name to help differentiate and manage your runs; this will be particularly helpful when you're viewing them in a list on the Training Run home page.
  • Version: Set the version of this Training Run for your own records. By default, this will be set to 0.1, but it can be overridden by any string.
  • Task Type: Select the type of task your model will perform. Image classification, object detection, and image segmentation are available for computer vision models while text classification and token classification are available for text-based natural language processing models.

name-and-task-type-selection

 

Dataset Selection

  • Training and validation dataset selection: The training and validation dataset selection allows you to specify which Chariot-catalogued datasets to train/validate on.
  • Label selection: The label selector allows training on a subset of the available classes of a dataset.

dataset-selection

 

Training Settings

If you selected full configuration, you have the ability to completely customize your model settings and training parameters.

  • Model Settings: Choose Start from scratch to build your model without any preloaded model weights, or choose Start from model to fine-tune from an existing model in Chariot. When starting from scratch, you must choose a model architecture. For the complete list of supported model architectures, along with their sizes (number of parameters and depth) and relative inference speeds, navigate to the page of our Data Science Fundamentals guide. For specific implementation details of the architectures, navigate to the PyTorch Document Library.
  • Train Data Settings: You can choose to apply augmentations to your training data, which is applied based on the RandAugment strategy. If this fits your use case, you can read more about data augmentations in the Data Science Fundamentals guide.
    • Random Crop: A type of data augmentation that simply chooses where to crop at random.
    • Apply Gradient Clipping: Option to prevent model over-correction during training.
    • Apply Data Augmentations: This option enables various data augmentations like rotation, or horizontal shear.
      • Use Color Transformation: This option allows you to utilize computer vision color transformation, including auto contrast, brightness, color, contrast, equalize, posterize, sharpness, solarize, and invert.
      • Number of Random Transformations: This option allows you to set the number of random transformations applied during the training run.
      • Transformation Strength: This option allows you to set the strength of transformation applied in the above settings.
  • Validation Data Settings:
    • Center crop: This option allows you to enable the center crop transformation for data validation, allowing images to be cropped to target aspect ratio.
    • Center crop height: This option allows you to set the height dimension for the target transformation ratio.
    • Center crop width: This option allows you to set the width dimension for the target transformation ratio.
  • Training settings:
    • Number of training steps: This option allows you to set the number of optimizer steps to train for; note that this is not the number of epochs but rather the number of mini batches processed by the trainer.
    • Step frequency to evaluate: This option allows you to set how often to run evaluation against the provided validation dataset.
    • Step frequency to save checkpoints: This option allows you to set how often to save model checkpoints, which you can use to resume a run if it gets prematurely stopped either manually or by a system failure. It can also be exported to the models feature.
    • Optimizer: You have the option to use any of the PyTorch optimizers.
    • Learning Rate: This configuration controls the step size during optimization. It determines how much model weights are updated in response to the gradient of the loss function. See Learning Rate for more information.
  • Advanced Optimizer Settings: You have the option to define the selected optimizers' parameters.
  • Advanced Augmentation Settings: Allows disabling of specific data augmentations and color transformations if you have those applied.
  • Image Preprocessing Settings: Contains configuration options for Linear Stretch and CLAHE.

full-configuration-training-settings-1-of-2 full-configuration-training-settings-2-of-2  

Resource Requirements

  • Specify the allocation of GPU and CPU resources for your Training Run. Please note that usable compute resources may vary depending on the other entities that your organization is actively running, like Inference Servers, Workspaces, and other Training Runs. If available, GPUs will show up in the drop-down menu; click Show Limits to get an idea of the limits of CPU cores and RAM that you can allocate for the run. The Resources Available message provides feedback on whether your chosen allotments are valid. For a more thorough deep dive into how to allocate resources effectively, refer to the Resource Management documentation.

training-resource-requirements

Monitoring a Training Run

The status, checkpoints, and metrics of a Training Run can be retrieved through the UI or SDK.

Within a project, the Training Runs page lists all Training Runs associated with that project, along with details about their status and any actions that can be accomplished with that Training Run.

training-run-monitoring

Click on the run name for detailed information about your Training Run, including the tabs below.

Details

The Details tab summarizes key aspects of your Training Run, including its status, selected settings, and information associated with the dataset you choose to train on.

training-run-details-tab

Checkpoints

As your Training Run progresses, Chariot saves the model periodically, based on the checkpoint frequency that you specified when setting up the run. The checkpoints table found via the Metrics tab lists those checkpoints for your Training Run. Click on the Catalog Checkpoint button to export them directly into the Model Catalog feature. Checkpoints may also be removed by clicking the Delete Checkpoint button.

training-run-checkpoints

Logs

The Logs tab provides access to two types of logging information from your Training Runs: container logs and pod events.

Container Logs

Container logs show output directly from your Training Run container, including your application logs, print statements, and any error messages from your training code.

Select the Container Logs radio button to view logs from the training container.

training-run-logs

Pod Events

Pod events provide infrastructure-level logs from the Kubernetes system that schedules and manages your training containers. These logs are useful for troubleshooting deployment and resource issues.

Select the Pod Events radio button to view infrastructure logs from Kubernetes.

training-run-kube-logs

Metrics

This tab displays plots from metrics that get recorded during training, such as the training loss and validation accuracies. You can also view the performance of your model at different checkpoints within this tab.

training-run-metrics

Restarting a Training Run

A Training Run that resulted in an error or that has been stopped may be restarted. When selecting the Restart Run button, you'll be presented with a modal for adjusting resources from the previous setting.

training-restart-run

The modal will be populated with the previous settings. If you'd like to simply restart the run with the same settings, you can then click Restart. However, you can also adjust resources as needed and then click Restart.

training-restart-modal

Note that this will update the existing run rather than create a new run. If you would like to view previous resource settings from the run, you can view those in the Events tab under Run Restart Requested statuses.

Cloning a Training Run

There may be cases when you want to restart a Training Run using a previous run's settings with some tweaks. Chariot supports this via the ability to clone a Training Run.

To clone a Training Run, go to the Training Runs page and click the Clone button on the right side of the page. You will then be redirected to the Training Run process described earlier, with the settings of the selected Training Run pre-populated in all the fields. You may still make changes to any Training Run settings prior to running it.

On this page, you will find additional options:

  • Delete Run: Delete the Training Run and its associated data.
  • Copy Training Config: Copy the JSON configuration text that specifies the training settings to be utilized with the Use Custom Config option when creating a new Training Run.

training-clone-run

Appendix

Sizes of Available Models

The size of a model depends on the number of trainable parameters it has. This is often a consideration when choosing a model as larger models tend to take longer to train but often have the capability to be more accurate. Below is a table of currently supported models in Chariot and their sizes.

note

We call a model "small" if it has fewer than 10,000,000 trainable parameters, "medium" if it has between 10,000,000 and 30,000,000 trainable parameters, and "large" if it has more than 30,000,000 trainable parameters.

Model NameArchitectureSizeNumber of Trainable ParametersMemory Footprint
squeezenet1_1squeezenet1_1 (Torchvision)Small1,245,5064 MB
squeezenet1_0squeezenet1_0 (Torchvision)Small1,258,4344 MB
shufflenet_v2_x0_5shufflenet_v2_x0_5 (Torchvision)Small1,376,8025 MB
mnasnet0_5mnasnet0_5 (Torchvision)Small2,228,5228 MB
shufflenet_v2_x1_0shufflenet_v2_x1_0 (Torchvision)Small2,288,6148 MB
mobilenet_v3_smallmobilenet_v3_small (Torchvision)Small2,552,8669 MB
mnasnet0_75mnasnet0_75 (Torchvision)Small3,180,21812 MB
shufflenet_v2_x1_5shufflenet_v2_x1_5 (Torchvision)Small3,513,63413 MB
mobilenet_v2mobilenet_v2 (Torchvision)Small3,514,88213 MB
regnet_y_400mfregnet_y_400mf (Torchvision)Small4,354,15416 MB
mnasnet1_0mnasnet1_0 (Torchvision)Small4,393,32216 MB
efficientnet_b0efficientnet_b0 (Torchvision)Small5,298,55820 MB
mobilenet_v3_largemobilenet_v3_large (Torchvision)Small5,493,04221 MB
regnet_x_400mfregnet_x_400mf (Torchvision)Small5,505,98621 MB
mnasnet1_3mnasnet1_3 (Torchvision)Small6,292,26624 MB
regnet_y_800mfregnet_y_800mf (Torchvision)Small6,442,52224 MB
regnet_x_800mfregnet_x_800mf (Torchvision)Small7,269,66627 MB
shufflenet_v2_x2_0shufflenet_v2_x2_0 (Torchvision)Small7,404,00628 MB
efficientnet_b1efficientnet_b1 (Torchvision)Small7,804,19430 MB
densenet121densenet121 (Torchvision)Small7,988,86630 MB
efficientnet_b2efficientnet_b2 (Torchvision)Small9,120,00435 MB
regnet_x_1_6gfregnet_x_1_6gf (Torchvision)Small9,200,14635 MB
regnet_y_1_6gfregnet_y_1_6gf (Torchvision)Medium11,212,44042 MB
resnet18resnet18 (Torchvision)Medium11,699,52244 MB
efficientnet_b3efficientnet_b3 (Torchvision)Medium12,243,24247 MB
googlenetgooglenet (Torchvision)Medium13,014,89849 MB
densenet169densenet169 (Torchvision)Medium14,159,49054 MB
regnet_x_3_2gfregnet_x_3_2gf (Torchvision)Medium15,306,56258 MB
efficientnet_b4efficientnet_b4 (Torchvision)Medium19,351,62674 MB
regnet_y_3_2gfregnet_y_3_2gf (Torchvision)Medium19,446,34874 MB
densenet201densenet201 (Torchvision)Medium20,023,93877 MB
resnet34resnet34 (Torchvision)Medium21,807,68283 MB
resnext50_32x4dresnext50_32x4d (Torchvision)Medium25,038,91495 MB
resnet50resnet50 (Torchvision)Medium25,567,04297 MB
inception_v3inception_v3 (Torchvision)Medium27,171,274103 MB
swin_tswin_t (Torchvision)Medium28,298,364108 MB
convnext_tinyconvnext_tiny (Torchvision)Medium28,599,138109 MB
densenet161densenet161 (Torchvision)Medium28,691,010110 MB
efficientnet_b5efficientnet_b5 (Torchvision)Large30,399,794116 MB
regnet_y_8gfregnet_y_8gf (Torchvision)Large39,391,482150 MB
regnet_x_8gfregnet_x_8gf (Torchvision)Large39,582,658151 MB
efficientnet_b6efficientnet_b6 (Torchvision)Large43,050,714165 MB
resnet101resnet101 (Torchvision)Large44,559,170170 MB
swin_sswin_s (Torchvision)Large49,616,268189 MB
convnext_smallconvnext_small (Torchvision)Large50,233,698191 MB
regnet_x_16gfregnet_x_16gf (Torchvision)Large54,288,546207 MB
resnet152resnet152 (Torchvision)Large60,202,818230 MB
alexnetalexnet (Torchvision)Large61,110,850233 MB
efficientnet_b7efficientnet_b7 (Torchvision)Large66,357,970254 MB
wide_resnet50_2wide_resnet50_2 (Torchvision)Large68,893,250263 MB
resnext101_64x4dresnext101_64x4d (Torchvision)Large83,465,282319 MB
regnet_y_16gfregnet_y_16gf (Torchvision)Large83,600,150319 MB
vit_b_16vit_b_16 (Torchvision)Large86,577,666330 MB
swin_bswin_b (Torchvision)Large87,778,234335 MB
vit_b_32vit_b_32 (Torchvision)Large88,234,242336 MB
convnext_baseconvnext_base (Torchvision)Large88,601,474337 MB
resnext101_32x8dresnext101_32x8d (Torchvision)Large88,801,346339 MB
regnet_x_32gfregnet_x_32gf (Torchvision)Large107,821,570411 MB
wide_resnet101_2wide_resnet101_2 (Torchvision)Large126,896,706484 MB
vgg11vgg11 (Torchvision)Large132,873,346506 MB
vgg11_bnvgg11_bn (Torchvision)Large132,878,850506 MB
vgg13vgg13 (Torchvision)Large133,057,858507 MB
vgg13_bnvgg13_bn (Torchvision)Large133,063,746507 MB
vgg16vgg16 (Torchvision)Large138,367,554527 MB
vgg16_bnvgg16_bn (Torchvision)Large138,376,002527 MB
vgg19vgg19 (Torchvision)Large143,677,250548 MB
vgg19_bnvgg19_bn (Torchvision)Large143,688,258548 MB
regnet_y_32gfregnet_y_32gf (Torchvision)Large145,056,780553 MB
convnext_largeconvnext_large (Torchvision)Large197,777,346754 MB
vit_l_16vit_l_16 (Torchvision)Large304,336,6421,160 MB
vit_l_32vit_l_32 (Torchvision)Large306,545,4101,169 MB
vit_h_14vit_h_14 (Torchvision)Large632,055,8102,411 MB
regnet_y_128gfregnet_y_128gf (Torchvision)Large644,822,9042,461 MB