In many real-world datasets, object classes follow a varied distribution, a few classes are very common, while many others appear infrequently. For example, you will see a huge amount of data on cats and dogs, but not a lot on specific species. Frequently occurring classes are called “head” and rare data classes are called “tail” classes. Performing object detection using such a dataset poses a set of challenges:
So, how can we improve feature learning so the model can perform well across the full spectrum of class frequencies?
The most popular ways to mitigate long-tailed object detection can be divided into two categories:
These are really great strategies to address the class imbalance and enhance the model's learning. However, these pre-training paradigms leave some key detection components randomly initialised and tend to overlook the suboptimal performance issues caused by long-tailed distributions during the pre-training process.
In one of the recent papers, I read a nice way to mitigate the long-tailed distribution targeting from the pretraining phase using contrastive learning.
Contrastive learning is a self-supervised approach that helps models learn useful representations by comparing data samples.
The core idea is to bring similar samples (called positives) closer in the feature space while pushing dissimilar ones (negatives) apart. In the computer vision domain tasks, this often involves taking two augmented views of the same image as positives, and treating views from other images as negatives. This encourages the model to focus on underlying semantic features rather than low-level noise.
It’s a foundational idea behind methods like SimCLR and MoCo, and has been widely adopted in pre-training for tasks such as classification and detection.
The paper stresses the importance of pretraining, as it plays a crucial role in object detection, especially when the data is imbalanced. It introduces a framework called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction, which consists of three components: Holistic-Local Contrastive Learning, Dynamic Rebalancing, and Dual Reconstruction.
This module aims to pretrain both the backbone and the detection head by encouraging learning at both global (image) and local (object proposal) levels.
This part focuses on improving representation learning for tail classes by dynamically resampling the training data.
Unlike traditional methods like Repeat Factor Sampling (RFS), this method looks at:
It calculates a harmonic mean of both to get a balanced repeat factor, which shifts over time to focus more on instance-level balancing as training progresses.
To counter simplicity bias (where models overly rely on shortcuts), this module encourages the model to learn both visual fidelity and semantic consistency.
By dynamically rebalancing training focus and enforcing both visual fidelity and semantic consistency, it significantly improves the quality of pre-trained representations. These enhancements lay a strong foundation for building object detectors that perform robustly across diverse and imbalanced datasets. It is an elegant and robust way to: