Visual Agents as Fast and Slow Thinkers
Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Chenglong Wang, Siqi Ma, Qifan Wang, Tong Geng, Yingnian Wu, Yongfeng Zhang, Dongfang Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that FAST outperforms various well-known baselines, achieving 80.8% accuracy over V QAv2 for visual question answering and 48.7% GIo U score over Reason Seg for reasoning segmentation, demonstrate FAST s superior performance. Extensive testing validates the efficacy and robustness of FAST s core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. |
| Researcher Affiliation | Collaboration | Rochester Institute of Technology University of Rochester Rutgers University University of California, Los Angeles Meta AI KAUST Westlake University |
| Pseudocode | Yes | The pseudo-code of FAST is given in Pseudo-code 1. Algorithm 1: Pseudo-code of FAST in a Py Torch-like style. |
| Open Source Code | Yes | The code is available at this link. The codes are available at the anonymous link https://anonymous.4open.science/r/Sys2-LLa VA-8B0F/ for the review process. |
| Open Datasets | Yes | We utilize eight popular benchmarks to evaluate our framework FAST comprehensively, categorized into general visual question answering (VQA) datasets and multimodal benchmarks. The VQA benchmarks include VQA-v2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Science QA (Lu et al., 2022), and Text VQA (Singh et al., 2019) which focus on optical character recognition. For multimodal benchmarks evaluation, we use the hallucination benchmark POPE (Li et al., 2023c), along with comprehensive benchmarks such as MME (Fu et al., 2024), MM-Vet (Yu et al., 2024), and SEED (Li et al., 2024). We compare our model with the baseline LLa VA-v1.5 (Liu et al., 2023a), and other multimodal large language models. To thoroughly assess our model s understanding of pixel-level instances, we evaluate its performance on referring segmentation and grounding benchmarks, including ref COCO (Kazemzadeh et al., 2014), ref COCO+ (Kazemzadeh et al., 2014), and ref COCOg (Caesar et al., 2018). Further, to examine the model s reasoning capabilities on FAST framework, we consider the Reasoning Segmentation benchmark (Lai et al., 2024). |
| Dataset Splits | No | The paper mentions using several datasets for training and evaluation, such as "augmented dataset was combined with LLa VA-v1.5 s supervised dataset and trained for one epoch" and evaluation on "VQA datasets and multimodal benchmarks". It also mentions omitting certain datasets during training for unbiased evaluation. However, it does not provide specific percentages or sample counts for training, validation, or test splits within these datasets for their own experimental setup. |
| Hardware Specification | Yes | All experiments used 8 NVIDIA TESLA A100-80GB GPUs. All Experiments are conducted on eight NVIDIA A100-80GB SXM GPUs1. |
| Software Dependencies | No | The paper states: "Our FAST framework is implemented in Py Torch (Paszke et al., 2019). The Adam W optimizer (Loshchilov & Hutter, 2019) is employed with the Deep Speed Ze RO 2 2 configuration for fine-tuning the switch, proposal, and segmentation adapters with Lo RA (Hu et al., 2022)." While it mentions PyTorch, AdamW, DeepSpeed, and LoRA, it does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | For the Lo RA configuration, we set the rank to 128 and alpha to 256, consistent with the settings of LLa VA-v1.5. Additionally, we adjust the learning rate of the vision encoder projection layer to 2e-5 to achieve better alignment. This fine-tuning process involved 10,000 steps to improve the model s segmentation capabilities. trained for one epoch. |