ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions

Authors: Yunjie Tian, Tianren Ma, Lingxi Xie, Qixiang Ye

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that Chatter Box demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions. We conduct both quantitative and qualitative studies, affirming Chatter Box s superiority over existing models in MCQ.
Researcher Affiliation Collaboration Yunjie Tian1*, Tianren Ma1*, Lingxi Xie2, Qixiang Ye1 1University of Chinese Academy of Sciences 2Huawei Inc.
Pseudocode No The paper describes methods in paragraph text and uses figures to illustrate architecture, but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/sunsmarterjie/Chatter Box
Open Datasets Yes The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. We will release the CB-300K data to facilitate the research in this direction. We leverage the Visual Genome dataset (Krishna et al. 2017) due to its richness of instance-level relationship annotations, and get assistance from GPT-4.
Dataset Splits Yes We extracted 800 threads in CB-RGB and 200 threads in CB-Co Q for testing, and the remaining threads are used for training.
Hardware Specification Yes We utilize 8 NVIDIA A800 GPUs (80GB) for training, making use of Deep Speed to improve computational efficiency.
Software Dependencies No The paper mentions several software components like Deep Speed, Adam W optimizer, Lo RA algorithm, CLIP-L/14 model, LLa VA-13B model, LLa MA, and DINO detector, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes In the first stage, we employ the Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate of 0.00005, zero weight decay, a batch size of 6, and a gradient accumulation step of 5. We integrate the Warmup Decay LR learning rate scheduler initialized with a warm-up iteration count of 50. In the second stage, the learning rate is adjusted to 0.00003, while the other training parameters remain unchanged. The data from Groups A, B, and C are sampled at a ratio of 2 : 1 : 10, which aims to maximally preserve the ability of visual grounding that we have established in the first stage.