Re-Aligning Language to Visual Objects with an Agentic Workflow

Authors: Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun GONG, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. With automatic VL refinement, our Real-LOD workflow reveals a potential to preserve data quality along with scaling up data quantity, further improving LOD performance from a data-alignment perspective.
Researcher Affiliation Collaboration 1VCIP, Nankai University 2Sense Time Research
Pseudocode Yes Algorithm 1 Pseudo code of Real-LOD. We show the detailed code of our workflow in flexibly leveraging tools to re-align raw expressions to given objects.
Open Source Code Yes The code is available at https://github.com/Fish And Wasabi/Real LOD.
Open Datasets Yes We randomly select images from Objects365 (Shao et al., 2019), Open Image (Kuznetsova et al., 2020), and LVIS datasets (Gupta et al., 2019) with all categories covered.
Dataset Splits Yes The benchmarks we use for evaluation are Omni Label (Schulter et al., 2023), DOD (Xie et al., 2023), Ref COCO/g/+ (i.e., Ref COCO, Ref COCOg, Ref COCO+) (Yu et al., 2016; Mao et al., 2016) and OVDEval (Yao et al., 2024). For all the benchmarks, we follow standard protocols to ensure a fair comparison. ... Omni Label is collected from three object detection datasets, i.e., Objects365 (Shao et al., 2019), Open Image (Kuznetsova et al., 2020), and COCO (Lin et al., 2014). It is divided into these three subsets for evaluation. There are 12.2k images, 20.4k object bbxs, and 15.8k expressions. ... We randomly select 94k images from O365 and OI datasets covering all categories, which is a subset of Real-Data. These images, together with target objects and raw expressions, constitute our original training data pairs with an amount of 933k (i.e., A form).
Hardware Specification Yes The time cost is reported based on 48 V100 32G GPUs for our workflow execution. ... For the final result, we train on 16 NVIDIA V100 GPUs for better performance. ... In the ablation study, there is only a single machine with 8 NVIDIA V100 GPUs for training to guarantee impartiality.
Software Dependencies No The implementation of Real-Model is based on the MMDetection (Chen et al., 2019) framework and Py Torch (Paszke et al., 2019).
Experiment Setup Yes The input size of all the experiments is 1333 800, and the batch size is 4 per GPU. ... During training, we employ the Adam W optimizer (Kingma & Ba, 2015) with a momentum of 0.9 and a weight decay of 0.05. The learning rate setting includes a 1000-iteration warm-up with a start factor of 0.1 and a multi-step schedule with an initial value of 4 10 6 for 10 epochs. To be specific, the weights used for model initialization are referenced from the office repository of mm-GDINO (Zhao et al., 2024).