Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG
Authors: Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLa VA-v1.513B achieving a 43% improvement on V Bench and 19% on HR-Bench. |
| Researcher Affiliation | Academia | 1School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China 2Nanyang Technological University, Singapore 639798 3The University of Sydney, Australia 4Shenzhen Campus of Sun Yat-sen University, China. |
| Pseudocode | Yes | Algorithm 1 Spatial-Awareness Layout; Algorithm 2 Retrieval-Augmented Perception |
| Open Source Code | Yes | Code is available at https://github.com/Dream Mr/RAP. |
| Open Datasets | Yes | We evaluate our RAP on two HR benchmarks: V Bench and HR-Bench. V Bench, derived from SA-1B (Kirillov et al., 2023), averages a resolution of 2246 1582. More details about HR-Bench can be found in Sect. 3.1. HR-Bench 8K, with 8K-resolution images from DIV8K (Gu et al., 2019) and the Internet, includes Fine-grained Single-instance Perception (FSP) and Fine-grained Cross-instance Perception (FCP) tasks. |
| Dataset Splits | No | The paper describes dataset characteristics and task types (FSP, FCP) for HR-Bench and V Bench, but does not explicitly provide information on how these datasets are split into training, validation, or testing sets with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or computing cluster specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions various MLLMs (e.g., LLaVA-v1.5, LLaVA-v1.6) and components like Vis RAG and Sig LIP, but it does not specify version numbers for general software dependencies or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We set τ = 0.6 throughout the paper. where b is a bias value, set here at 0.2 and d denotes the depth of the image tree. |