reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Re-Aligning Language to Visual Objects with an Agentic Workflow

Authors: Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun GONG, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. With automatic VL refinement, our Real-LOD workflow reveals a potential to preserve data quality along with scaling up data quantity, further improving LOD performance from a data-alignment perspective.
Researcher Affiliation	Collaboration	1VCIP, Nankai University 2Sense Time Research
Pseudocode	Yes	Algorithm 1 Pseudo code of Real-LOD. We show the detailed code of our workflow in flexibly leveraging tools to re-align raw expressions to given objects.
Open Source Code	Yes	The code is available at https://github.com/Fish And Wasabi/Real LOD.
Open Datasets	Yes	We randomly select images from Objects365 (Shao et al., 2019), Open Image (Kuznetsova et al., 2020), and LVIS datasets (Gupta et al., 2019) with all categories covered.
Dataset Splits	Yes	The benchmarks we use for evaluation are Omni Label (Schulter et al., 2023), DOD (Xie et al., 2023), Ref COCO/g/+ (i.e., Ref COCO, Ref COCOg, Ref COCO+) (Yu et al., 2016; Mao et al., 2016) and OVDEval (Yao et al., 2024). For all the benchmarks, we follow standard protocols to ensure a fair comparison. ... Omni Label is collected from three object detection datasets, i.e., Objects365 (Shao et al., 2019), Open Image (Kuznetsova et al., 2020), and COCO (Lin et al., 2014). It is divided into these three subsets for evaluation. There are 12.2k images, 20.4k object bbxs, and 15.8k expressions. ... We randomly select 94k images from O365 and OI datasets covering all categories, which is a subset of Real-Data. These images, together with target objects and raw expressions, constitute our original training data pairs with an amount of 933k (i.e., A form).
Hardware Specification	Yes	The time cost is reported based on 48 V100 32G GPUs for our workflow execution. ... For the final result, we train on 16 NVIDIA V100 GPUs for better performance. ... In the ablation study, there is only a single machine with 8 NVIDIA V100 GPUs for training to guarantee impartiality.
Software Dependencies	No	The implementation of Real-Model is based on the MMDetection (Chen et al., 2019) framework and Py Torch (Paszke et al., 2019).
Experiment Setup	Yes	The input size of all the experiments is 1333 800, and the batch size is 4 per GPU. ... During training, we employ the Adam W optimizer (Kingma & Ba, 2015) with a momentum of 0.9 and a weight decay of 0.05. The learning rate setting includes a 1000-iteration warm-up with a start factor of 0.1 and a multi-step schedule with an initial value of 4 10 6 for 10 epochs. To be specific, the weights used for model initialization are referenced from the office repository of mm-GDINO (Zhao et al., 2024).