Annotation Best Practices

Accurate data annotations are critical to building reliable machine learning (ML) models with supervised training. If annotations are not accurate, it can degrade model performance and lead to incorrect metrics.

This guide provides best practices for creating accurate and consistent annotations as efficiently as possible.

note

Different types of ML models require different types of annotations. For best practices for partial datum annotations, see the Partial Datum section of this guide.*

Create an Annotation Guide

Before you start annotating, create an annotation guide tailored to the scope and complexity of the annotation task. This will help to ensure alignment and consistency across your team in terms of the classes you'll be labeling, how to treat partially obscured objects, and other important considerations. Spending time on this guide up front will set up you and your fellow annotators to make good decisions in the future.

Your annotation guide should include the sections outlined below.

Description of the Use Case

Include a detailed description of the use case for the ML model. This will help inform decisions about the target level of accuracy and how to handle edge cases and ambiguities.

Be sure to include details about:

Whether the model's predictions will be used by a human or an automated system
How the ML model's predictions will be contributing value to the customer
The difference in costs of false positive and false negative predictions made by the model

Description of the Data

Include a description of the data to be annotated: the type of data, how it was collected, and relevant details like image resolution.

Defined Labels

List the labels that you will be using for your annotations, and include detailed definitions and examples for each label. This information should mostly come from the use case and customer requirements.

Label definitions should make it clear how to know which label is the correct one. Label names should be helpful, short-hand forms of the definitions.

Use examples to show the expected variation in the data for each label. Examples can be expanded upon as annotators find new variations in the data.

If the annotation tool doesn't support a way to flag invalid or unusable datums, include a label for that scenario.

Ideally, labels should be mutually exclusive, but in some cases, hierarchical labels may be necessary. When hierarchical labels are used, be sure that the structure is clear in the annotation guide. An example of hierarchical labels for classifying images of ships would be a "Military" label with sub-labels "Aircraft Carrier," "Submarine," and "Surface Combatant." The "Military" label would be used for any military ship that does not fit one of the sub-labels, like support ships and personnel carriers. The "Surface Combatant" label could also have sub-labels such as "Destroyer," "Frigate," and "Corvette."

In some cases, finer-grained labels than what the use case calls for can be useful. For example, if the task involves detecting cats in images, and specific labels like "house cat," "lynx," "tiger," and "lion," are used during annotation, these labels can be automatically changed to "cat" for model training. Then, model performance metrics can be computed for the more specific categories to see what types of cats the model does not perform well on. Ultimately, consider what might be valuable for the use case beyond the immediate prediction problem.

Note that changing the label set during annotation can be costly, because existing annotations might need to be revisited. But sometimes it's necessary, particularly if there are unanticipated cases in the data that the labels don't cover.

Guidance on Edge Cases

Include guidance on how annotators should handle edge cases, ambiguities, and unusable datums, along with examples.

First, describe some expected cases when the data might not be sufficient to be certain about the label. These can be due to a natural part of the data (e.g., occluded objects in object detection) or due to low data quality (e.g., low-resolution or corrupted images or misspellings in text).

Then describe what annotators should do in these situations, using the details of the use case to inform your decisions. Some options are:

Add metadata to the datum to indicate that the correct annotation is not known.
Use a special label, if the annotation platform does not support metadata.
Choose a particular label from the label set, using your best judgment.
Err on the side of a particular label, if the ambiguity is between two or more specific labels.

Decide how your team will discuss and track these cases. Some options are:

Discuss them in the chat channel
Track and record them in a shared document

Note that these datums may need to be excluded from training, validation, and testing splits.

Accuracy Targets

Include the target level of accuracy for the annotations. The target level of accuracy should consider the costs of incorrect predictions made by the model and the annotation resources available.

Perfect accuracy is often not necessary for creating ML models that deliver value. In fact, perfect accuracy in annotations can be extremely costly, and practices for achieving this require multiple experts independently annotating the entire dataset¹. Instead, strive for consistency.

Quality Control Plan

Determine how accuracy will be measured when initial annotations are completed. Some options are:

Review a random sample of the annotations and record any errors.
Annotate a random sample of the data again from scratch, and compare your results to the existing annotations.

If the measured accuracy doesn't meet your targets, locate and correct the errors. Options for locating errors include:

Using errors discovered while measuring accuracy
Reviewing a random sample of datums not included in the accuracy measurement
Training a model with the annotated data, running predictions with that model on the annotated data, and reviewing the mistakes made in the predictions, which are likely to be annotation errors

Specify whether the final decision on corrections will be made by a single expert or a committee of annotators.

Description of the Annotation Tool

Include a description of the annotation tool you will be using. Include links to the tool's user guide and any other relevant resources.

If existing ML models will be used to provide annotation hints, specify which models.

Datum Metadata

Include keys and values to use for metadata associated with datums and annotations, if the annotation platform supports them. Metadata can provide valuable contextual information to help understand dataset characteristics and enable detailed performance analysis.

Examples of datum metadata for imagery are location information, image quality, and lighting conditions.

Supplemental Information

If necessary and available, provide links to supplemental sources of information. This can include:

For overhead (satellite or aerial) imagery data, higher-resolution images of the area, ideally at the same time, that show more detail
For imagery data, other images of the broader area that show context or images from different perspectives that provide more information
Reference material about the items of interest, such as technical documents that describe terms in scientific text data or Wikipedia pages about military assets to be detected in imagery

Extra Guidance for Sub-Datum Annotations

Some annotation tasks involve assigning a label to an entire datum, like the type of animal an image contains or the topic of a document. Other annotation tasks involve identifying subsets of datums and assigning labels to those, such as drawing labeled regions around objects of interest in images (object detection and image segmentation) or identifying sentences in a document that express opinions. For these other types of tasks that involve sub-datum annotation, there can also be ambiguity about whether and where an item of interest is present in a datum. Give clear guidance about what to do in these situations, informed by the details of the use case.

For object detection and image segmentation, clearly describe when to annotate partially visible objects (whether they are occluded by another object or on the edge of the image). For example, your guidance might be to only annotate partial objects that are at least 50% visible, or it might be to annotate partial objects only when you are unambiguously certain of the object's correct label.

If the annotation platform supports it, include a "None" label to indicate that there are no items of interest in the datum to be annotated. This allows these datums to be distinguished from datums that have not yet been annotated, which should be excluded from model training and evaluation. Some ML platforms assume that datums without annotations contain no items of interest. Chariot uses a negative sample label to indicate a datum has been reviewed and contains no items of interest.

Use annotation metadata, if your annotation tool supports them, to record when the annotated data contains characteristics of interest. For example, metadata can be used in image data annotations to indicate objects that are partially visible or affected by image corruption. These can provide valuable information about how well models perform under different conditions in the data.

Annotation Best Practices

Create an Annotation Guide

Description of the Use Case

Description of the Data

Defined Labels

Guidance on Edge Cases

Accuracy Targets

Quality Control Plan

Description of the Annotation Tool

Datum Metadata

Supplemental Information

Extra Guidance for Sub-Datum Annotations

Best Practices Checklist

Before Annotating

While Annotating

Create an Annotation Guide​

Description of the Use Case​

Description of the Data​

Defined Labels​

Guidance on Edge Cases​

Accuracy Targets​

Quality Control Plan​

Description of the Annotation Tool​

Datum Metadata​

Supplemental Information​

Extra Guidance for Sub-Datum Annotations​

Best Practices Checklist​

Before Annotating​

While Annotating​

Footnotes​

Create an Annotation Guide

Description of the Use Case

Description of the Data

Defined Labels

Guidance on Edge Cases

Accuracy Targets

Quality Control Plan

Description of the Annotation Tool

Datum Metadata

Supplemental Information

Extra Guidance for Sub-Datum Annotations

Best Practices Checklist

Before Annotating

While Annotating

Footnotes