reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Visual Agents as Fast and Slow Thinkers

Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Chenglong Wang, Siqi Ma, Qifan Wang, Tong Geng, Yingnian Wu, Yongfeng Zhang, Dongfang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate that FAST outperforms various well-known baselines, achieving 80.8% accuracy over V QAv2 for visual question answering and 48.7% GIo U score over Reason Seg for reasoning segmentation, demonstrate FAST s superior performance. Extensive testing validates the efficacy and robustness of FAST s core components, showcasing its potential to advance the development of cognitive visual agents in AI systems.
Researcher Affiliation	Collaboration	Rochester Institute of Technology University of Rochester Rutgers University University of California, Los Angeles Meta AI KAUST Westlake University
Pseudocode	Yes	The pseudo-code of FAST is given in Pseudo-code 1. Algorithm 1: Pseudo-code of FAST in a Py Torch-like style.
Open Source Code	Yes	The code is available at this link. The codes are available at the anonymous link https://anonymous.4open.science/r/Sys2-LLa VA-8B0F/ for the review process.
Open Datasets	Yes	We utilize eight popular benchmarks to evaluate our framework FAST comprehensively, categorized into general visual question answering (VQA) datasets and multimodal benchmarks. The VQA benchmarks include VQA-v2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), Science QA (Lu et al., 2022), and Text VQA (Singh et al., 2019) which focus on optical character recognition. For multimodal benchmarks evaluation, we use the hallucination benchmark POPE (Li et al., 2023c), along with comprehensive benchmarks such as MME (Fu et al., 2024), MM-Vet (Yu et al., 2024), and SEED (Li et al., 2024). We compare our model with the baseline LLa VA-v1.5 (Liu et al., 2023a), and other multimodal large language models. To thoroughly assess our model s understanding of pixel-level instances, we evaluate its performance on referring segmentation and grounding benchmarks, including ref COCO (Kazemzadeh et al., 2014), ref COCO+ (Kazemzadeh et al., 2014), and ref COCOg (Caesar et al., 2018). Further, to examine the model s reasoning capabilities on FAST framework, we consider the Reasoning Segmentation benchmark (Lai et al., 2024).
Dataset Splits	No	The paper mentions using several datasets for training and evaluation, such as "augmented dataset was combined with LLa VA-v1.5 s supervised dataset and trained for one epoch" and evaluation on "VQA datasets and multimodal benchmarks". It also mentions omitting certain datasets during training for unbiased evaluation. However, it does not provide specific percentages or sample counts for training, validation, or test splits within these datasets for their own experimental setup.
Hardware Specification	Yes	All experiments used 8 NVIDIA TESLA A100-80GB GPUs. All Experiments are conducted on eight NVIDIA A100-80GB SXM GPUs1.
Software Dependencies	No	The paper states: "Our FAST framework is implemented in Py Torch (Paszke et al., 2019). The Adam W optimizer (Loshchilov & Hutter, 2019) is employed with the Deep Speed Ze RO 2 2 configuration for fine-tuning the switch, proposal, and segmentation adapters with Lo RA (Hu et al., 2022)." While it mentions PyTorch, AdamW, DeepSpeed, and LoRA, it does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	For the Lo RA configuration, we set the rank to 128 and alpha to 256, consistent with the settings of LLa VA-v1.5. Additionally, we adjust the learning rate of the vision encoder projection layer to 2e-5 to achieve better alignment. This fine-tuning process involved 10,000 steps to improve the model s segmentation capabilities. trained for one epoch.