No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Authors: Manu Gaur, Darshan Singh S, Makarand Tapaswi

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessì et al., 2023); and +7.6% on Image Co De. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model s fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce True Match, a benchmark comprising bags of highly similar images that uses SR to assess the captioner s ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on True Match, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
Researcher Affiliation Academia Manu Gaur EMAIL CVIT, IIIT Hyderabad, India Darshan Singh EMAIL CVIT, IIIT Hyderabad, India Makarand Tapaswi EMAIL CVIT, IIIT Hyderabad, India
Pseudocode Yes Algorithm 1 Candidate Bag Creation Algorithm 2 True Match: Automated Bag Curation
Open Source Code No The paper does not explicitly provide a link to the authors' source code or a statement of its release for the methodology described. It mentions using third-party tools like Instruct BLIP and states
Open Datasets Yes Training datasets may be divided into two: curated datasets (COCO (Lin et al., 2014), Flickr30k (Plummer et al., 2015)) or large-scale alt-text data (e.g. CC3M (Sharma et al., 2018))... We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark.
Dataset Splits Yes We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark. ... We randomly sample 60 images from MSCOCO and for each image-caption pair, a human is asked to count the number of hallucinations present in the caption. ... We generate captions for all the 10,000 images of MSCOCO test and validation set and compute object level hallucinations using CHAIR (Rohrbach et al., 2018).
Hardware Specification Yes The model is trained with the Adam W optimizer (Loshchilov & Hutter, 2018) on a single A6000 GPU.
Software Dependencies No The paper mentions several models and frameworks like Mistral-7B, Instruct BLIP, CLIP, GPT-2, AdamW optimizer, and LoRA adapters, but does not specify their exact version numbers or any other software dependencies with versions required for replication.
Experiment Setup Yes Further details about the hyperparameters are provided in Appendix C.3. ... We present a brief overview of hyperparameters for different optimization stages: MLE (Table 11) and Reinforce (Table 12). ... MLE Pretraining Batch size: 40 Schedule: Linear Decay Learning-rate: 2 10 5 Total steps: 30 000 Warmup steps: 1 000 ... REINFORCE optimization settings. Base params Batch size: 100 Schedule: constant Total steps: 23 000 Warmup steps: 0 SR reward Learning-rate: 9 10 8 CIDEr reward Learning-rate: 1 10 6 CIDEr + SR reward Learning-rate: 1 10 7