LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension

Authors: Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Oriane Siméoni, MATTHIEU CORD

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LLM-wrapper on multiple datasets using different VLMs and LLMs, demonstrating significant performance improvements and highlighting the versatility of our method.
Researcher Affiliation Collaboration Amaia Cardiel1,2, Eloi Zablocki1, Elias Ramzi1, Oriane Sim eoni1, Matthieu Cord1,3 1 Valeo.ai 2 APTIKAL, LIG, Universit e Grenoble Alpes 3 Sorbonne Universit e EMAIL
Pseudocode No The paper describes the method using natural language and figures, but does not contain a dedicated pseudocode block or algorithm section.
Open Source Code Yes The code and the checkpoints are available at https://github.com/valeoai/LLM_wrapper.
Open Datasets Yes We experiment with LLM-wrapper on three classic REC datasets Ref COCO, Ref COCO+ (Kazemzadeh et al., 2014), Ref COCOg (Mao et al., 2016) and on Talk2Car (Deruyttere et al., 2019), Additionaly, we evaluate LLM-wrapper on the recent and challenging HC-Ref Lo Co (Wei et al., 2024) benchmark.
Dataset Splits Yes Dataset statistics are given in Table 2: Dataset statistics. Split Size # words / query Ref COCO unc 120,624 10,834 10,752 3.5 Ref COCO+ unc 120,191 10,758 10,615 3.5 Ref COCOg umd 80,512 4,896 9,602 8.3 Talk2Car 8,348 1,163 2,447 11.0 HC-Ref Lo Co 13,360 31,378 84.6
Hardware Specification Yes This approach makes the training efficient in terms of compute and very simple to implement in practice. ... trainable on a single 40GB-A100 GPU in less than 7 hours.
Software Dependencies No The paper mentions methods and tools like LoRA, Flash Attention, 4-bit quantization, Adam, and Hugging Face's supervised fine-tuning pipeline, but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We train LLM-wrapper with Adam (Kingma, 2014), with a batch-size of four, until convergence. ... Unless stated otherwise, we use a learning rate of 10 5 and a rank of r = 128 for Lo RA.