Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
Authors: Quang-Hung Le, Long Hoang Dang, Ngan Hoang Le, Truyen Tran, Thao Minh Le
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our Prom Vi L framework significantly outperforms baselines on various visual grounding and compositional question answering tasks. |
| Researcher Affiliation | Academia | Quang-Hung Le1, Long Hoang Dang2, Ngan Hoang Le3, Truyen Tran1, Thao Minh Le1 1Applied Artificial Intelligence Institute (A2I2), Deakin University, Australia 2Posts and Telecommunications Institute of Technology, Vietnam 3University of Arkansas, USA EMAIL, EMAIL, thile@@uark.edu, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Progressive Multi-granularity Decoding |
| Open Source Code | No | We will release our code and datasets to support further research in this field. |
| Open Datasets | Yes | Our experiments are reproducible in academia, using only public data and models. We introduce a dataset construction pipeline to create a new dataset of nested compositional V-L pairs curated from Visual Genome, enabling training on multiple complexity levels. |
| Dataset Splits | No | The paper mentions using Compo VL for training and Compo VL-hard for evaluation, and refers to 'val' and 'test' splits for standard benchmarks like GQA and Ref COCOg, implying standard splits. However, specific percentages or sample counts for training/validation splits within their main Compo VL dataset are not explicitly provided. |
| Hardware Specification | Yes | Fine-tuning takes around 7 hours on a single NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions 'Lo RA (Hu et al. 2021) tuning', 'spa Cy (Honnibal et al. 2020)', and 'Berkeley Neural Parser (Kitaev, Cao, and Klein 2019)' but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | We perform Lo RA (Hu et al. 2021) tuning with r=64, learning rate 1e-4, warm-up ratio 0.1, and batch size 4. |