Data Collection
How much training data do I need?
This is probably the most common question—and among the hardest questions to definitively answer. For some types of models, like general linear models (GLMs), which are fit (trained) by solving a matrix equation, one needs more data points than parameters in the model, and one needs those points to be sufficiently independent from each other. For example, if we have 2D data and we've sampled all those points from a single line (a 1D subspace), then we won't have sufficient independence among the points to build these models. Without diving deeply into the math, if we have fewer data points than parameters or if those data points are too interdependent, then when we try to solve the matrix equation, we attempt to invert a matrix that is not invertible; the result of the non-invertible matrix is that there is not one best solution to the equation (a single, best-fit model). Instead, infinitely many models are considered equally good. The intuition we derive from this is:
Intuition: We want (relatively) independent examples in our data, and we want to have more examples than we have parameters in the model.
When we aren't fitting a general linear model (or something for which the training has a non-iterative, direct, provably optimal, etc., solution), the question becomes fuzzier. The intuition still (roughly) holds. However, especially when training neural networks, even concepts like how many parameters your model has are not always easy to answer. Most (modern) convolutional neural networks have anywhere from millions to billions of parameters. And, most datasets are not nearly that large. There are two things that people frequently do to try to mitigate this issue, both of which are addressed in more detail in the model training section. First, we typically use weight decay when training models. Weight decay penalizes models for having non-zero parameters—so models try to train and use fewer parameters when possible. This makes the effective number of parameters in a model (the number of non-zero parameters) variable. The second tactic that is frequently employed is to use data augmentations. These augmentations artificially perturb the data we have to introduce additional (albeit minimally independent) data into the dataset.
An often overlooked aspect of data collection, when discussing data quantity, is the data quality. As just described, we need enough data (examples) to be able to train, but we also need those examples to be reasonably independent (also read: different). Data quality is most often a function of the methodology used to generate the data. It can be very easy to inadvertently add biases to datasets that make incorrect inference very easy. For example, consider an image dataset where we want a model to be able to distinguish clothing types—for example, distinguishing pants from shirts. Our data generation/collection strategy may be to hire people to be photographed wearing different clothing examples. If one person wears the pants in every photo and a different person wears the shirts in every photo, then a model could simply learn to distinguish the two faces and say: Face 1 = pants, Face 2 = shirt. The model may be extremely accurate on the dataset we curated, but it would fail miserably when applied to data "in the wild."
Can I recycle an old, but related, dataset?
Sometimes you might need to construct a model for a specialized task, and one or more related, but perhaps more general, dataset exists. Since collecting and annotating new data is time consuming and labor intensive, it would be nice if old datasets could be easily (and safely) recycled for new, or more specific, tasks. If this is your situation, there are a couple of things to consider. Most importantly, how similar is the dataset you have (or some subset of it) to the data where you ultimately want to deploy your model? If you have black and white imagery taken from a grainy CCTV camera, but you want to apply your model to color imagery taken with a high resolution phone camera, you might run into trouble (even if the labels you have, say recognizing a person's face, are precisely the kind of labels you need). If you determine that the data you have is similar enough to the data you want to deploy the model on, then it is usually safe to use that data for training. Even still, you will likely want to curate some validation/testing data from your target data source so that you can reasonably estimate your model's performance.
Recognizing, and reacting to, extreme imbalances in a dataset
There are a variety of reasons a dataset may be imbalanced, and the impacts of an imbalanced
dataset depend on the source of the imbalance. The best way to react to an imbalanced
dataset depends greatly on the source.
An imbalance in a dataset is the result of a lack of representation of some data.
That lack of representation might occur because the event (or whatever the data is recording)
is rare (compared to other events) in the wild.
For example, if your dataset consists of traffic camera footage, and the annotations
include vehicle descriptions (e.g., passenger car, pickup truck, police car, etc.), then
the annotation for ice-cream truck
may be rare (say, compared to passenger cars) because
there are far fewer ice-cream trucks driving on the road than there are passenger cars—and,
this is likely true regardless of your location around the world (passenger cars will be
more common than ice-cream trucks).
This kind of imbalance is representative of the real world.
The other source of imbalances in datasets stems from bias in the data collection process.
For example, suppose you are training a model to recognize faces, and you curate a dataset
of faces from yearbook/class photos from all Ivy League schools since 1900.
Because many of these schools did not admit women until the 1960–80s, a large number of the
faces (especially in the older yearbooks) will be overwhelmingly white, male faces;
in other words, there will be a heavy imbalance in the data (having relatively few female faces or faces of color).
This dataset is considered biased because the imbalance isn't representative of the population
as a whole—neither white faces nor male faces are a majority worldwide.
Are imbalances always bad?
Not necessarily. If you do nothing to mitigate an imbalance, the result will be that any models trained on that dataset will perform poorly on examples from the underrepresented class(es). If poor performance on an underrepresented class is acceptable, then this imbalance isn't a problem. It is incumbent upon the data scientist (or ML practitioner) to understand what the impacts will be of using a model that is known to perform poorly on these underrepresented classes and to determine whether or not poor performance is acceptable (or legal, or ethical, etc.). As an example, if your goal is to use a model to count passenger cars and the model is usually wrong in identifying the very rare ice-cream truck, then the effect on your passenger car count is minimal (only in the rare instance of seeing an ice-cream truck and mistakenly believing it to be a passenger car). For the facial recognition example, that bias in the dataset could be a problem if the goal of the model is to recognize faces all over the world (or, for example, if you are using facial recognition to unlock a phone)—the model will likely perform poorly on the majority of the world. If, however, your goal is to recognize attendees at an alumni gathering, then the model may perform reasonably well.
What should I do if there is an imbalance?
If the imbalance isn't likely to cause an issue (see: "Are imbalances always bad?"), then you probably don't need to do anything. However, if the imbalance is a problem, then there are (at least) three things you should consider doing.
-
Expand the dataset to better represent the (previously) underrepresented examples. If possible (examples exist that could be included; budgets—time and money—allow for expanding the dataset, etc.), this is the best option. If the imbalance was a result of flawed sampling, that flaw should be fixed and data resampled—do not repeat the same mistakes that were previously made. If the imbalance was the result of the fact that some things are simply rare (though important for your application), then you may specifically search out examples of the rare thing to add. In some cases, it is hard (or impossible) to find real examples of the classes you are most concerned about (extreme rarity); in those cases, using synthetic data may be able to help.
-
Oversample underrepresented classes. There are many methods to accomplish this, but, essentially, the impact of each is to expand your dataset by adding duplicates (or near duplicates) of the underrepresented data to the dataset. While these methods can help, they do not increase diversity among the underrepresented class, and that in itself can cause problems. So, this method is worth trying—though you shouldn't expect it to solve every problem.
-
Weight loss functions so that incorrect predictions are penalized heavier when made on the underrepresented class. As with oversampling, this method cannot solve all problems. It doesn't increase the diversity among the underrepresented class. And, weighting loss functions in this manner mostly only works on classification tasks—it's hard to apply this kind of technique to other tasks.