ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Authors: Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and Wiki Art, and compare Chat Captioner with BLIP-2 as well as ground truth. Our results demonstrate that Chat Captioner s captions are significantly more informative, receiving three times as many votes from human evaluators as BLIP-2 alone for providing the most image information. |
| Researcher Affiliation | Academia | Deyao Zhu, Jun Chen , Kilichbek Haydarov , Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny King Abdullah University of Science and Technology EMAIL |
| Pseudocode | No | The paper describes the Chat Captioner method, its components, and prompting strategies in detail within sections 3, 3.1, 3.2, and 3.3. However, it does not present this information in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Code is available at https://github.com/Vision-CAIR/Chat Captioner. |
| Open Datasets | Yes | We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and Wiki Art, and compare Chat Captioner with BLIP-2 as well as ground truth. Our results demonstrate that Chat Captioner s captions are significantly more informative, receiving three times as many votes from human evaluators as BLIP-2 alone for providing the most image information. ... We randomly selected 100 photos from the COCO (Lin et al., 2014) validation set, 100 artworks from Wiki Art (Saleh & Elgammal, 2015) dataset with ground truth captions from Art Emis (Achlioptas et al., 2021), 100 internet images from the Conceptual Captions (CC) (Sharma et al., 2018) validation dataset, and 100 images with detailed and long ground truth captions from the Open Image Localized Narratives (OI-LN) (Pont-Tuset et al., 2020) dataset. |
| Dataset Splits | Yes | We randomly selected 100 photos from the COCO (Lin et al., 2014) validation set, 100 artworks from Wiki Art (Saleh & Elgammal, 2015) dataset with ground truth captions from Art Emis (Achlioptas et al., 2021), 100 internet images from the Conceptual Captions (CC) (Sharma et al., 2018) validation dataset, and 100 images with detailed and long ground truth captions from the Open Image Localized Narratives (OI-LN) (Pont-Tuset et al., 2020) dataset. |
| Hardware Specification | No | The paper mentions using specific models like "gpt-3.5-turbo" and "BLIP-2...containing a FLAN-T5... and a Vi T-G/14 model" and discusses API costs, but it does not specify any hardware details such as GPU models, CPU types, or memory amounts used to run the experiments. |
| Software Dependencies | No | The paper specifies using "Chat GPT model gpt-3.5-turbo" and "BLIP-2...containing a FLAN-T5... with 11 billion parameters and a Vi T-G/14 model from EVA-CLIP". While these are specific models/APIs, the paper does not list common ancillary software dependencies like Python, PyTorch, CUDA, or other libraries with their respective version numbers, which are typically required for replication. |
| Experiment Setup | Yes | To activate the questioning ability of Chat GPT, we design a prompting system that enables Chat GPT to generate questions based on previous chat logs. Our prompting system for Chat GPT contains three components: a task instruction for explaining the task ρtask Q, a chat log to store previous questions and answers ρchat, a question instruction for generating high-quality questions ρq. ... In all experiments, BLIP-2 answers 10 questions per image, with the first question being hard-coded as Describe the image in detail. . The remaining 9 questions are from Chat GPT, unless otherwise specified. ... In our BLIP-2 task instruction ρtask A. ... We explicitly add a prompt Avoid asking yes/no questions in the task instruction ρtask Q and the question instruction ρq. |