Long tail Object Detection

In many real-world datasets, object classes follow a varied distribution, a few classes are very common, while many others appear infrequently. For example, you will see a huge amount of data on cats and dogs, but not a lot on specific species. Frequently occurring classes are called “head” and rare data classes are called “tail” classes. Performing object detection using such a dataset poses a set of challenges:

Models become biased towards head classes
Tail classes suffer from poor generalisation due to limited data

So, how can we improve feature learning so the model can perform well across the full spectrum of class frequencies?

Popular methods to mitigate Long-tailed object detection

The most popular ways to mitigate long-tailed object detection can be divided into two categories:

Data-level strategy

Address the class imbalance by resampling — Under-sampling the majority classes and over-sampling the minority classes
Use open-source datasets to increase the minority classes
Increase the dataset by augmentations or generate synthetic datasets

Model-level or Architecture-level strategy

Add weightage to the loss function where you penalise the wrong predictions on minority classes more than the majority classes
Use a loss function that handles the class imbalance. E.g.. Focal loss
Leverage a pre-trained model backbone

These are really great strategies to address the class imbalance and enhance the model's learning. However, these pre-training paradigms leave some key detection components randomly initialised and tend to overlook the suboptimal performance issues caused by long-tailed distributions during the pre-training process.

In one of the recent papers, I read a nice way to mitigate the long-tailed distribution targeting from the pretraining phase using contrastive learning.

Quick Refresher on Contrastive Learning

Contrastive learning is a self-supervised approach that helps models learn useful representations by comparing data samples.

The core idea is to bring similar samples (called positives) closer in the feature space while pushing dissimilar ones (negatives) apart. In the computer vision domain tasks, this often involves taking two augmented views of the same image as positives, and treating views from other images as negatives. This encourages the model to focus on underlying semantic features rather than low-level noise.

It’s a foundational idea behind methods like SimCLR and MoCo, and has been widely adopted in pre-training for tasks such as classification and detection.

Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction

The paper stresses the importance of pretraining, as it plays a crucial role in object detection, especially when the data is imbalanced. It introduces a framework called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction, which consists of three components: Holistic-Local Contrastive Learning, Dynamic Rebalancing, and Dual Reconstruction.

Holistic-Local Contrastive Learning (HLCL)

This module aims to pretrain both the backbone and the detection head by encouraging learning at both global (image) and local (object proposal) levels.

Holistic Contrastive Learning (HCL): Learns general visual features from the entire image using augmented views.
Local Contrastive Learning (LCL): Focuses on features from object proposals generated by a class-agnostic region proposal network.

Dynamic Rebalancing

This part focuses on improving representation learning for tail classes by dynamically resampling the training data.

Unlike traditional methods like Repeat Factor Sampling (RFS), this method looks at:

How often does each class appear across images (image-level)
How many object instances belong to each class (instance-level)

It calculates a harmonic mean of both to get a balanced repeat factor, which shifts over time to focus more on instance-level balancing as training progresses.

Dual Reconstruction

To counter simplicity bias (where models overly rely on shortcuts), this module encourages the model to learn both visual fidelity and semantic consistency.

Appearance Reconstruction: Reconstructs the original image from features using a pixel-wise loss.
Semantic Reconstruction: Applies a mask to the reconstructed image and ensures the features remain consistent with the original.

Summary

By dynamically rebalancing training focus and enforcing both visual fidelity and semantic consistency, it significantly improves the quality of pre-trained representations. These enhancements lay a strong foundation for building object detectors that perform robustly across diverse and imbalanced datasets. It is an elegant and robust way to:

Align pretraining with downstream object detection
Adaptively rebalance class distributions
Preserve both low-level and high-level representations

References

Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction https://arxiv.org/pdf/2411.09453
A wonderful blog on Contrastive Learning — https://lilianweng.github.io/posts/2021-05-31-contrastive/