reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin, Prayag Tiwari

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves4.4% and 8.5% improvement in F1 score over state-of-the-art (So TA) baselines on average, demonstrating its effectiveness.
Researcher Affiliation	Academia	Yazhou Zhang EMAIL School of Computer Science and Technology, Tianjin University Chunwang Zou EMAIL Software Engineering College, Zhengzhou University of Light Industry Bo Wang EMAIL School of Computer Science and Technology, Tianjin University Jing Qin* EMAIL Center for Smart Health, School of Nursing, The Hong Kong Polytechnic University Prayag Tiwari EMAIL School of Information Technology, Halmstad University
Pseudocode	Yes	A.2 Algorithm The algorithm is shown in Alg. 1. Algorithm 1 Commander-GPT: Modular Multimodal Sarcasm Understanding
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	In this section, we conduct comprehensive experiments on two widely-used multimodal sarcasm detection benchmarks, MMSD (Cai et al., 2019) and MMSD 2.0 (Qin et al., 2023).
Dataset Splits	Yes	Table 1: Statistics of the MMSD and MMSD 2.0 datasets. Dataset Train Validation Test Sarcastic Non-sarcastic Source MMSD 19,816 2,410 2,409 10,560 14,075 Twitter MMSD 2.0 19,816 2,410 2,409 11,651 12,980 Twitter
Hardware Specification	Yes	All experiments were conducted on a server equipped with two NVIDIA RTX 4090 GPUs and 256GB RAM.
Software Dependencies	No	Commander-GPT was implemented using Py Torch, Hugging Face Transformers, and the Open MMLab toolkit. The paper mentions software tools but does not provide specific version numbers for them.
Experiment Setup	Yes	For supervised components (e.g., the BERT-based commander and the routing classifier), we fine-tuned for 10 epochs (approximately 12 hours in total). We used the Adam W optimizer with an initial learning rate of 2 10 5, batch size of 64, maximum sequence length of 512, and weight decay of 0.01. A linear warm-up and decay scheduler was applied. Early stopping was triggered if the validation F1 score did not improve for 3 consecutive epochs.