reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Authors: Yunjie Tian, Tianren Ma, Lingxi Xie, Qixiang Ye

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that Chatter Box demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions. We conduct both quantitative and qualitative studies, affirming Chatter Box s superiority over existing models in MCQ.
Researcher Affiliation	Collaboration	Yunjie Tian1, Tianren Ma1, Lingxi Xie2, Qixiang Ye1 1University of Chinese Academy of Sciences 2Huawei Inc.
Pseudocode	No	The paper describes methods in paragraph text and uses figures to illustrate architecture, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/sunsmarterjie/Chatter Box
Open Datasets	Yes	The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. We will release the CB-300K data to facilitate the research in this direction. We leverage the Visual Genome dataset (Krishna et al. 2017) due to its richness of instance-level relationship annotations, and get assistance from GPT-4.
Dataset Splits	Yes	We extracted 800 threads in CB-RGB and 200 threads in CB-Co Q for testing, and the remaining threads are used for training.
Hardware Specification	Yes	We utilize 8 NVIDIA A800 GPUs (80GB) for training, making use of Deep Speed to improve computational efficiency.
Software Dependencies	No	The paper mentions several software components like Deep Speed, Adam W optimizer, Lo RA algorithm, CLIP-L/14 model, LLa VA-13B model, LLa MA, and DINO detector, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	In the first stage, we employ the Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate of 0.00005, zero weight decay, a batch size of 6, and a gradient accumulation step of 5. We integrate the Warmup Decay LR learning rate scheduler initialized with a warm-up iteration count of 50. In the second stage, the learning rate is adjusted to 0.00003, while the other training parameters remain unchanged. The data from Groups A, B, and C are sampled at a ratio of 2 : 1 : 10, which aims to maximally preserve the ability of visual grounding that we have established in the first stage.