Model Training

In this section we discuss what is happening behind the scenes while a model is being trained and dive into (some) details about how neural networks specifically are trained. We also share a concise dictionary of common terms associated with neural network training.

What is model training

All machine learning models have some parameters which are learned. For example, finding a best-fit line is a (simple) form of machine learning. The model has two parameters: a slope (m) and a y-intercept (b). The generic model for a best fit line is:

f(x) = mx + b

Training a model simply means learning/tuning/setting/finding/choosing specific values for the learnable parameters. That process could be achieved in any number of ways.

How are neural networks trained

Training a neural network is an iterative process of trying to minimize a loss function--loss functions are a way of measuring how bad a model is, by reducing the loss one reduces (hopefully) the badness of the model. A useful analogy to keep in mind throughout this discussion is a ball rolling down a hill. The height (altitude) of the ball corresponding to the loss value, the goal being to get the ball as far down the hill as possible.

The iterative process of training a neural network procedes thusly:

Pick random values for the (learnable) parameters to start from
Collect a sample of training data (a batch, or mini-batch)
Compute the gradient of the loss function with respect to the small training sample
Take a small step (adjust the model's parameters) in the (opposite) direction of the gradient
Repeat (many, many, many, many times) steps 2-4.

Returning the analogy of the ball rolling down the hill, the process is to place the ball somewhere on the hill (at random); examine the hill at the place where the ball is to see which direction has the steepest descent; move the ball a small distance in the direction of steepest descent, then repeat. The process stops when the loss value (altitude of ball) stops (a trend of) decreasing.

Note: The process of training a neural network is random for (at least) two reasons. First, the starting starting weights (starting location of the ball) are chosen at random. Second, the order in which training data is sampled is random and the samples of the data are used to determine the "steepest descent." By altering the sample, at any point in training, can (and usually does) alter (to some degree) the trajectory of training. Both the random starting location and the random order of sampling data affect the final state of the trained model.

Neural network training dictionary

In this section we define a number of terms which are reasonably standardized in the ML/DL community but may be unfamiliar to new practitioners. Along with each term, we provide suggestions for good places to start. These suggestions are usually the default values that Chariot supplies when creating new training runs.

Batch size

The batch size is the number of examples (data points, images, etc) to use for each model update step; think of this as the number of examples processed at a time. The batch (also sometimes called a minibatch) is used to determine (approximate) the direction of steepest descent.

Recommendation: How big should the batch size be?

If you are training on a single GPU (or small cluster of GPUs) it is frequently advantageous to use batch sizes as large as possible with the model (and intermediate computations) fitting in-memory on the GPU. If the batch size is too large, CUDA out of memory errors will occur. Using as large as possible a batch is usually more computationally efficient (faster), however, when training in a regime where the batch size can be very large (e.g. flirting with 10k or larger) there are known issues with convergence (model training tends to fail badly). This is almost never an issue when training on a single GPU. The choice of model will dictate the largest possible batch size for the training hardware you have. Larger models will necessitate smaller batch sizes (because they occupy more memory on the GPU).

Optimizer

An optimizer is the algorithm used to (try to) train your (neural network) model by minimizing a loss function. Essentially, an optimizer is just a formula for how to use the current gradient (direction of steepest descent) to compute a small update to your model. Depending on the choice of optimizer this formula may include information about previously seen gradients or the direction(s) you have recently moved (e.g. to track things like momentum which works exactly as you think if you recall the ball rolling down a hill analogy).

Chariot provides easy access to most optimizers available in Pytorch. For the curious you can read in-depth discussions of the various algorithms here. The choice of optimizer does (potentially) have a large impact on the ability of your model to be trained effectively; and, there are a lot of choices of optimizer.

Recommendations: Which optimizer should I choose?

A good first choice is Adam. There is an entire body of research variously claiming one optimizer is better than another. Most of these claims should be taken with a grain salt.

Data Augmentations

Data augmentation is term for a collection of transformations that can be applied to data to increase the variety of data in a dataset. For example, resizing (to something large-ish, possibly a larger size chosen at random) then cropping (choosing where to crop at random) to a pre-determined size (e.g. 256x256 pixels) is a commonly used data augmentation. Other types of augmentations include flipping an image (horizontally or vertically) or randomly (usually subtly) perturbing the colors of some/all of the pixels, etc.

Recommendations: Should I use (random) data augmentations?

It depends a lot on your dataset and what you are trying to learn. If random augmentations make your labels wrong then you probably want to avoid them. For example, if you model is learning to detect the color of vehicles but your random augmentation alters the vehicle's color, then that augmentation can break your labels (annotations) and using it could (will) be problematic. You should know which augmentations might be applied and whether or not they break your annotations before you elect to use them. With that said, data augmentations usually improve model performance, especially when your dataset is small--provided the augmentations don't ruin your annotations.

Learning rate (abbrev. lr)

The learning rate influences the size of a step taken after approximating the direction of steepest descent. If the learning rate is very large, the updates to the model at each step can also be quite large and this can result in erratic behavior (e.g. enormous swings in training loss) while training. On the other hand, if the learning rate is very small then the updates are also (usually) very small and the model may only change very slowly (requiring many more steps to converge to a well-trained model).

Recommendations: How big should the learning rate be?

The default values for the learning rate are usually 0.001 and that is frequently a good starting point. If you observe that training loss is not, on average, improving over a few thousand training batches it may be advisable to increase the learning rate a bit (e.g. 0.0015 or 0.002). If you do increase the learning rate you typically will not want to train for too long at the higher learning rate before reducing it back to around 0.001. If you have been training for awhile and the model appears to have reached a plateau in performance, you can often (but not always) get a modest bump in performance by halving the learning rate and resuming training for a (short) period of time.

If you want to train with a non-constant learning rate using Chariot you have two options. If you are comfortable creating your own json training config (using Advanced mode) you can setup a learning rate scheduler and specify what learning rates to use when. See the Teddy documentation for options/parameters/etc. You can also alter learning rates manually (using the Simple mode) by training and snapshotting a model. You can then export your favorite snapshot from the first phase and create a second training run where you start from the previously exported model. This can be repeated as many times as desired.

Weight decay

Weight decay is a method to encourage models to be (effectively) smaller. Including weight decay in the training process penalizes models for each non-zero weight (learned parameter). The amount of weight decay included influences the tradeoff. Larger weight decays convey to the training process that being smaller is a higher priority than being accurate; very small weight decays (or not including weight decay) indicates that you strongly prefer accuracy (or other goodness metrics) over reducing the model size.

Recommendations: Should I use weight decay? and how much?

First, if you have reason to want/need a small model then you should strongly consider using weight decay. You may want/need a smaller model particularly when your dataset is small (many fewer data examples than learnable parameters in the model). Weight decay does not need to be very large in order to be effective; increasing a weight decay from 0 to even 0.001 can have a dramatic impact on the model training.

Momentum

Momentum can be used by a variety of optimizers as a means to include recent trends when making an update. Think of momentum in the context of the ball rolling down the hill. If the ball is rolling (quickly) in one direction, then the small (local) bumps in the road only alter the direction a bit. If the bumps are large enough and/or common enough then they can redirect the momentum of the ball. The purpose of using momentum in an optimizer is to avoid getting trapped in a local minima (a small dip on the hill). The greater the momentum, the less likely one is to get trapped in a small local minima. However, when the momentum is very high, it is easy to fly past optimal locations (the ball keeps rolling at the bottom of the hill and maybe rolls up the next hill).

Recommendations: Should I use momentum? and how much?

In most cases, yes; you should use momentum. Amongst the myriad papers comparing optimizers, one common conclusion is that methods using momentum pretty consistently beat those without it (e.g. stochastic gradient descent (SGD) without momentum vs any other optimizer with it). How much is harder to answer, but the default values provided by PyTorch (and wrapped in Chariot) tend to be very good starting points. The thing to keep in mind is that low momentum emphasizes (magnifies) the impact of the current batch you are training on (regarding very little past batches), higher momentum values emphasize the collection of recent batches over the current batch.

What is model training​

How are neural networks trained​

Neural network training dictionary​

Batch size​

Optimizer​

Data Augmentations​

Learning rate (abbrev. lr)​

Weight decay​

Momentum​