reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Authors: Manu Gaur, Darshan Singh S, Makarand Tapaswi

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessì et al., 2023); and +7.6% on Image Co De. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model s fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce True Match, a benchmark comprising bags of highly similar images that uses SR to assess the captioner s ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on True Match, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
Researcher Affiliation	Academia	Manu Gaur EMAIL CVIT, IIIT Hyderabad, India Darshan Singh EMAIL CVIT, IIIT Hyderabad, India Makarand Tapaswi EMAIL CVIT, IIIT Hyderabad, India
Pseudocode	Yes	Algorithm 1 Candidate Bag Creation Algorithm 2 True Match: Automated Bag Curation
Open Source Code	No	The paper does not explicitly provide a link to the authors' source code or a statement of its release for the methodology described. It mentions using third-party tools like Instruct BLIP and states
Open Datasets	Yes	Training datasets may be divided into two: curated datasets (COCO (Lin et al., 2014), Flickr30k (Plummer et al., 2015)) or large-scale alt-text data (e.g. CC3M (Sharma et al., 2018))... We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark.
Dataset Splits	Yes	We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark. ... We randomly sample 60 images from MSCOCO and for each image-caption pair, a human is asked to count the number of hallucinations present in the caption. ... We generate captions for all the 10,000 images of MSCOCO test and validation set and compute object level hallucinations using CHAIR (Rohrbach et al., 2018).
Hardware Specification	Yes	The model is trained with the Adam W optimizer (Loshchilov & Hutter, 2018) on a single A6000 GPU.
Software Dependencies	No	The paper mentions several models and frameworks like Mistral-7B, Instruct BLIP, CLIP, GPT-2, AdamW optimizer, and LoRA adapters, but does not specify their exact version numbers or any other software dependencies with versions required for replication.
Experiment Setup	Yes	Further details about the hyperparameters are provided in Appendix C.3. ... We present a brief overview of hyperparameters for different optimization stages: MLE (Table 11) and Reinforce (Table 12). ... MLE Pretraining Batch size: 40 Schedule: Linear Decay Learning-rate: 2 10 5 Total steps: 30 000 Warmup steps: 1 000 ... REINFORCE optimization settings. Base params Batch size: 100 Schedule: constant Total steps: 23 000 Warmup steps: 0 SR reward Learning-rate: 9 10 8 CIDEr reward Learning-rate: 1 10 6 CIDEr + SR reward Learning-rate: 1 10 7