reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Authors: Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and Wiki Art, and compare Chat Captioner with BLIP-2 as well as ground truth. Our results demonstrate that Chat Captioner s captions are significantly more informative, receiving three times as many votes from human evaluators as BLIP-2 alone for providing the most image information.
Researcher Affiliation	Academia	Deyao Zhu, Jun Chen , Kilichbek Haydarov , Xiaoqian Shen, Wenxuan Zhang, Mohamed Elhoseiny King Abdullah University of Science and Technology EMAIL
Pseudocode	No	The paper describes the Chat Captioner method, its components, and prompting strategies in detail within sections 3, 3.1, 3.2, and 3.3. However, it does not present this information in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Code is available at https://github.com/Vision-CAIR/Chat Captioner.
Open Datasets	Yes	We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and Wiki Art, and compare Chat Captioner with BLIP-2 as well as ground truth. Our results demonstrate that Chat Captioner s captions are significantly more informative, receiving three times as many votes from human evaluators as BLIP-2 alone for providing the most image information. ... We randomly selected 100 photos from the COCO (Lin et al., 2014) validation set, 100 artworks from Wiki Art (Saleh & Elgammal, 2015) dataset with ground truth captions from Art Emis (Achlioptas et al., 2021), 100 internet images from the Conceptual Captions (CC) (Sharma et al., 2018) validation dataset, and 100 images with detailed and long ground truth captions from the Open Image Localized Narratives (OI-LN) (Pont-Tuset et al., 2020) dataset.
Dataset Splits	Yes	We randomly selected 100 photos from the COCO (Lin et al., 2014) validation set, 100 artworks from Wiki Art (Saleh & Elgammal, 2015) dataset with ground truth captions from Art Emis (Achlioptas et al., 2021), 100 internet images from the Conceptual Captions (CC) (Sharma et al., 2018) validation dataset, and 100 images with detailed and long ground truth captions from the Open Image Localized Narratives (OI-LN) (Pont-Tuset et al., 2020) dataset.
Hardware Specification	No	The paper mentions using specific models like "gpt-3.5-turbo" and "BLIP-2...containing a FLAN-T5... and a Vi T-G/14 model" and discusses API costs, but it does not specify any hardware details such as GPU models, CPU types, or memory amounts used to run the experiments.
Software Dependencies	No	The paper specifies using "Chat GPT model gpt-3.5-turbo" and "BLIP-2...containing a FLAN-T5... with 11 billion parameters and a Vi T-G/14 model from EVA-CLIP". While these are specific models/APIs, the paper does not list common ancillary software dependencies like Python, PyTorch, CUDA, or other libraries with their respective version numbers, which are typically required for replication.
Experiment Setup	Yes	To activate the questioning ability of Chat GPT, we design a prompting system that enables Chat GPT to generate questions based on previous chat logs. Our prompting system for Chat GPT contains three components: a task instruction for explaining the task ρtask Q, a chat log to store previous questions and answers ρchat, a question instruction for generating high-quality questions ρq. ... In all experiments, BLIP-2 answers 10 questions per image, with the first question being hard-coded as Describe the image in detail. . The remaining 9 questions are from Chat GPT, unless otherwise specified. ... In our BLIP-2 task instruction ρtask A. ... We explicitly add a prompt Avoid asking yes/no questions in the task instruction ρtask Q and the question instruction ρq.