ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions
Authors: Yunjie Tian, Tianren Ma, Lingxi Xie, Qixiang Ye
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Chatter Box demonstrates superiority in MCQ both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with logical interactions. We conduct both quantitative and qualitative studies, affirming Chatter Box s superiority over existing models in MCQ. |
| Researcher Affiliation | Collaboration | Yunjie Tian1*, Tianren Ma1*, Lingxi Xie2, Qixiang Ye1 1University of Chinese Academy of Sciences 2Huawei Inc. |
| Pseudocode | No | The paper describes methods in paragraph text and uses figures to illustrate architecture, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/sunsmarterjie/Chatter Box |
| Open Datasets | Yes | The newly collected dataset, named CB-300K, spans challenges including probing dialogues with spatial relationship among multiple objects, consistent reasoning, and complex question chains. We will release the CB-300K data to facilitate the research in this direction. We leverage the Visual Genome dataset (Krishna et al. 2017) due to its richness of instance-level relationship annotations, and get assistance from GPT-4. |
| Dataset Splits | Yes | We extracted 800 threads in CB-RGB and 200 threads in CB-Co Q for testing, and the remaining threads are used for training. |
| Hardware Specification | Yes | We utilize 8 NVIDIA A800 GPUs (80GB) for training, making use of Deep Speed to improve computational efficiency. |
| Software Dependencies | No | The paper mentions several software components like Deep Speed, Adam W optimizer, Lo RA algorithm, CLIP-L/14 model, LLa VA-13B model, LLa MA, and DINO detector, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | In the first stage, we employ the Adam W optimizer (Loshchilov and Hutter 2017) with a learning rate of 0.00005, zero weight decay, a batch size of 6, and a gradient accumulation step of 5. We integrate the Warmup Decay LR learning rate scheduler initialized with a warm-up iteration count of 50. In the second stage, the learning rate is adjusted to 0.00003, while the other training parameters remain unchanged. The data from Groups A, B, and C are sampled at a ratio of 2 : 1 : 10, which aims to maximally preserve the ability of visual grounding that we have established in the first stage. |