Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG

Authors: Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLa VA-v1.513B achieving a 43% improvement on V Bench and 19% on HR-Bench.
Researcher Affiliation Academia 1School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China 2Nanyang Technological University, Singapore 639798 3The University of Sydney, Australia 4Shenzhen Campus of Sun Yat-sen University, China.
Pseudocode Yes Algorithm 1 Spatial-Awareness Layout; Algorithm 2 Retrieval-Augmented Perception
Open Source Code Yes Code is available at https://github.com/Dream Mr/RAP.
Open Datasets Yes We evaluate our RAP on two HR benchmarks: V Bench and HR-Bench. V Bench, derived from SA-1B (Kirillov et al., 2023), averages a resolution of 2246 1582. More details about HR-Bench can be found in Sect. 3.1. HR-Bench 8K, with 8K-resolution images from DIV8K (Gu et al., 2019) and the Internet, includes Fine-grained Single-instance Perception (FSP) and Fine-grained Cross-instance Perception (FCP) tasks.
Dataset Splits No The paper describes dataset characteristics and task types (FSP, FCP) for HR-Bench and V Bench, but does not explicitly provide information on how these datasets are split into training, validation, or testing sets with percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or computing cluster specifications used for running the experiments.
Software Dependencies No The paper mentions various MLLMs (e.g., LLaVA-v1.5, LLaVA-v1.6) and components like Vis RAG and Sig LIP, but it does not specify version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes We set τ = 0.6 throughout the paper. where b is a bias value, set here at 0.2 and d denotes the depth of the image tree.