
Breaking Through the Noise Correspondence: A Robust Model for
Image-Text Matching
Unleashing the power of image-text matching in real-world applications is hampered by noise correspondence. Manually curating high-quality datasets is expensive and time-consuming, and gathering image-text pairs from the internet introduces noise that significantly degrades model performance. In this paper, we propose a novel model that transforms the noise correspondence filtering problem into a similarity distribution modeling problem. Leveraging the image-text matching capability of CLIP and employing a Gaussian mixture model, our model filters out most of the noise correspondences in image-text pairs. To further minimize the impact of noise correspondence during fine-tuning, we propose a distribution-aware dynamic margin ranking loss that increases the distance between the clean and noisy distributions. Our extensive experiments on three datasets, including the challenging Conceptual Captions, demonstrate the effectiveness and robustness of our model even under high noise rates. Our approach opens up new opportunities for improving image-text matching performance in real-world settings by breaking through the noise.
Abstract
Pipline
