Fernando @cferdophotography on Unsplash

Long tail Object Detection

In many real-world datasets, object classes follow a varied distribution, a few classes are very common, while many others appear infrequently. For example, you will see a huge amount of data on cats and dogs, but not a lot on specific species. Frequently occurring classes are called “head” and rare data classes are called “tail” classes. Performing object detection using such a dataset poses a set of challenges:

So, how can we improve feature learning so the model can perform well across the full spectrum of class frequencies?

Popular methods to mitigate Long-tailed object detection

The most popular ways to mitigate long-tailed object detection can be divided into two categories:

Data-level strategy

  1. Address the class imbalance by resampling — Under-sampling the majority classes and over-sampling the minority classes
  2. Use open-source datasets to increase the minority classes
  3. Increase the dataset by augmentations or generate synthetic datasets

Model-level or Architecture-level strategy

  1. Add weightage to the loss function where you penalise the wrong predictions on minority classes more than the majority classes
  2. Use a loss function that handles the class imbalance. E.g.. Focal loss
  3. Leverage a pre-trained model backbone

These are really great strategies to address the class imbalance and enhance the model's learning. However, these pre-training paradigms leave some key detection components randomly initialised and tend to overlook the suboptimal performance issues caused by long-tailed distributions during the pre-training process.

In one of the recent papers, I read a nice way to mitigate the long-tailed distribution targeting from the pretraining phase using contrastive learning.

Quick Refresher on Contrastive Learning

Contrastive learning is a self-supervised approach that helps models learn useful representations by comparing data samples.

The core idea is to bring similar samples (called positives) closer in the feature space while pushing dissimilar ones (negatives) apart. In the computer vision domain tasks, this often involves taking two augmented views of the same image as positives, and treating views from other images as negatives. This encourages the model to focus on underlying semantic features rather than low-level noise.

It’s a foundational idea behind methods like SimCLR and MoCo, and has been widely adopted in pre-training for tasks such as classification and detection.

Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction

The paper stresses the importance of pretraining, as it plays a crucial role in object detection, especially when the data is imbalanced. It introduces a framework called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction, which consists of three components: Holistic-Local Contrastive Learning, Dynamic Rebalancing, and Dual Reconstruction.

Proposed method paper

Holistic-Local Contrastive Learning (HLCL)

This module aims to pretrain both the backbone and the detection head by encouraging learning at both global (image) and local (object proposal) levels.

Dynamic Rebalancing

This part focuses on improving representation learning for tail classes by dynamically resampling the training data.

Unlike traditional methods like Repeat Factor Sampling (RFS), this method looks at:

It calculates a harmonic mean of both to get a balanced repeat factor, which shifts over time to focus more on instance-level balancing as training progresses.

Dual Reconstruction

To counter simplicity bias (where models overly rely on shortcuts), this module encourages the model to learn both visual fidelity and semantic consistency.

Summary

By dynamically rebalancing training focus and enforcing both visual fidelity and semantic consistency, it significantly improves the quality of pre-trained representations. These enhancements lay a strong foundation for building object detectors that perform robustly across diverse and imbalanced datasets. It is an elegant and robust way to:

References