Appendix: Gradient Clipping
What problem can gradient clipping help solve?
Gradient clipping alleviates the problem of model over-correction preventing dramatic, downward spirals in model performance during training.
Each training step includes computing a gradient. The gradient vector provides both a direction for the model update and, through the magnitude of the vector, a size for the update. Occasionally, gradients may be unexpectedly large. Any number of things can cause an explosion in the gradients, including: pecularities in the data (e.g., incorrect labels, significant differences in image histograms, etc.), unlucky model initialization (initial random weights happen to be very bad), artifacts resulting from data augmentations, and many others.
When a batch of data produces a large gradient, a correspondingly large update to the model typically occurs. Updates to a model during training should be viewed as iterative corrections for mistakes the model makes on its training data. Large model updates are therefore large model corrections; unfortunately, large model updates are frequently over-corrections and can leave the model in a worse state than it was before the update. In some cases, over-correction becomes a rapid downward spiral in model quality. These rapid downward spirals can be observed in a model’s metrics, usually the training loss.
Figure 1: A high level view of a loss spike, frequently observed during object detection training.
Figure 1 contains sample metrics from an object detection training run that experienced a rapid downward spiral manifested by a spike in the training loss. It is noteworthy that even though the loss eventually came down significantly, the model never recovered to the pre-spike levels. While it appears that the spike is sudden and massive, if we zoom in around the start of the spike, we see the escalation over several batches.
Figure 2: A closer look at the same loss spike as in Figure 1. The first spike is small but quickly grows as iterative over-correction occurs.
Notice in Figure 2, the same training run as Figure 1 but zoomed in on the start of the spike, that the first spike in loss is modest, and the model momentarily recovers. But, the model then experiences a sequence of progressively worse over-corrections, which produce the spike.
What is gradient clipping?
Gradient clipping is a method that preserves the direction of a gradient while capping the maximum impact a single gradient update can have on a model.
As previously mentioned, a gradient is a vector quantity, so it contains both a direction for the model update and a magnitude of the update. The gradient points in the direction of the fastest improvement (on the batch of data for which the gradient was computed). Gradient clipping alters a gradient with a large magnitude by preserving the direction and reducing the magnitude to a user-specified threshold.
Figure 3: Example gradient whose magnitude is smaller than the clipping threshold. Since the magnitude of the raw gradient is smaller than the threshold, no change (clip) is made.
In the example gradient in Figure 3, the vector (arrow) points in the direction to adjust the model for fastest improvement. The length of the vector is its magnitude. In this figure, the length of the vector is smaller than the threshold (the radius of the drawn circle), so no clipping (magnitude adjustment) is applied.
Figure 4: Example gradient whose magnitude is larger than the clipping threshold. Since the magnitude of the raw gradient is larger than the threshold, the gradient is clipped at the threshold amount preserving the direction of the gradient.
In Figure 4 we see another example of a gradient. This time, however, the magnitude of the raw gradient is larger than the threshold for clipping. To clip this gradient, we preserve the direction but reduce the length to our limit (the threshold). Large gradients, like this raw gradient, frequently lead to over-correction and sometimes result in the runaway effect seen in Figure 1. By preserving the direction of the gradient, we can still provide incremental improvement to the model. And, by capping the maximum size of an update we make, we can dramatically reduce the likelihood of the catastrophic downward spiral.
How large (or small?) should the gradient clipping threshold be?
It depends. Unfortunately, there’s no “single right answer,” and the reasonable ranges for these values vary with the size of the model you are training.
As with many aspects of model training, it is worthwhile to experiment with clipping threshold values.
Figure 5: A plot of the magnitude of the gradients during the same training run as Figures 1 and 2.
In Figure 5 we continue our examination of a training run that experienced the training loss spike. Again, we have zoomed in on the region near where the spike began. Note that the first perceptible spike in the training loss (Figure 2) happens around step 2675 (roughly halfway between step 2650 and step 2700). However, in Figure 5 we see six or seven spikes in the magnitude of the gradient before we had the first over-correction. When thinking of a good threshold, we need to first understand what is normal. In this case, we can see that the normal gradients have magnitudes slightly under 2.5. Since training was happening well at this point, we likely don’t want a clipping threshold that small. The first spike has a magnitude of approximately six, which we may want to clip (since that may have silently started the ball rolling toward the real problems later). So, for this model, we likely want a threshold between three and five—this informs our default (placeholder) value in the training wizard of four.
It is worth noting that in Figures 1, 2, and 5 we are training a Faster R-CNN detection model with a ResNet-18 backbone (FasterRCNNResnet18
in the architecture drop-down).
That model has approximately 28 million parameters.
Larger models will have, on average, larger gradients because there are more components in the vector.
When choosing a threshold, one of three things could happen. First, the threshold could be in an ideal range; in this case, model training happens normally, any loss spiking is averted, and the result is a good model. Second, the threshold could be too large; in this case, the model is susceptible to loss spikes. You might get lucky and not experience any issues (our limited testing saw issues in approximately one in three training runs). Or, you might see a spike and need to restart training (with a smaller threshold). Depending on how good your model was before the spike, you could either start over (from scratch) or start from the checkpoint closest to—but preceding—the spike. Finally, the third option is that you choose a threshold that is too small. In this case, you will not see loss spikes, and the model may turn out to be good. However, training will be very slow. While gradient clipping doesn’t make a significant difference in the per-batch training speed, a very tight threshold will slow down the model’s convergence, requiring many more training steps to arrive at a good model.