No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Authors: Manu Gaur, Darshan Singh S, Makarand Tapaswi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessì et al., 2023); and +7.6% on Image Co De. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model s fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce True Match, a benchmark comprising bags of highly similar images that uses SR to assess the captioner s ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on True Match, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters. |
| Researcher Affiliation | Academia | Manu Gaur EMAIL CVIT, IIIT Hyderabad, India Darshan Singh EMAIL CVIT, IIIT Hyderabad, India Makarand Tapaswi EMAIL CVIT, IIIT Hyderabad, India |
| Pseudocode | Yes | Algorithm 1 Candidate Bag Creation Algorithm 2 True Match: Automated Bag Curation |
| Open Source Code | No | The paper does not explicitly provide a link to the authors' source code or a statement of its release for the methodology described. It mentions using third-party tools like Instruct BLIP and states |
| Open Datasets | Yes | Training datasets may be divided into two: curated datasets (COCO (Lin et al., 2014), Flickr30k (Plummer et al., 2015)) or large-scale alt-text data (e.g. CC3M (Sharma et al., 2018))... We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark. |
| Dataset Splits | Yes | We use the 10,000 images from COCO’s validation and test sets (Karpathy & Fei-Fei, 2015) for our benchmark. ... We randomly sample 60 images from MSCOCO and for each image-caption pair, a human is asked to count the number of hallucinations present in the caption. ... We generate captions for all the 10,000 images of MSCOCO test and validation set and compute object level hallucinations using CHAIR (Rohrbach et al., 2018). |
| Hardware Specification | Yes | The model is trained with the Adam W optimizer (Loshchilov & Hutter, 2018) on a single A6000 GPU. |
| Software Dependencies | No | The paper mentions several models and frameworks like Mistral-7B, Instruct BLIP, CLIP, GPT-2, AdamW optimizer, and LoRA adapters, but does not specify their exact version numbers or any other software dependencies with versions required for replication. |
| Experiment Setup | Yes | Further details about the hyperparameters are provided in Appendix C.3. ... We present a brief overview of hyperparameters for different optimization stages: MLE (Table 11) and Reinforce (Table 12). ... MLE Pretraining Batch size: 40 Schedule: Linear Decay Learning-rate: 2 10 5 Total steps: 30 000 Warmup steps: 1 000 ... REINFORCE optimization settings. Base params Batch size: 100 Schedule: constant Total steps: 23 000 Warmup steps: 0 SR reward Learning-rate: 9 10 8 CIDEr reward Learning-rate: 1 10 6 CIDEr + SR reward Learning-rate: 1 10 7 |