reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Is Sarcasm Detection a Step-by-Step Reasoning Process in Large Language Models?

Authors: Ben Yao, Yazhou Zhang, Qiuchi Li, Jing Qin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) Co C and Go C show superior performance with more advanced models like GPT4 and Claude 3.5, with an improvement of 3.5%. (2) To C significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., To T) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.
Researcher Affiliation	Academia	1University of Copenhagen 2Tianjin University 3The Hong Kong Polytechnic University
Pseudocode	No	The paper describes the methods Co C, Go C, Bo C, and To C using descriptive text and some mathematical formulations, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code https://github.com/qiuchili/llm sarcasm detection.git
Open Datasets	Yes	Four benchmarking datasets are selected as the experimental beds, viz. IAC-V1 (Lukin and Walker 2013), IAC-V2 (Oraby et al. 2016), Sem Eval 2018 Task 3 (Van Hee, Lefever, and Hoste 2018) and MUSt ARD (Castro et al. 2019).
Dataset Splits	No	The paper uses four benchmarking datasets: IAC-V1 (Lukin and Walker 2013), IAC-V2 (Oraby et al. 2016), Sem Eval 2018 Task 3 (Van Hee, Lefever, and Hoste 2018), and MUSt ARD (Castro et al. 2019). While these are standard benchmark datasets, the paper does not explicitly provide information on specific training, validation, or test split percentages or sample counts for reproduction.
Hardware Specification	No	The paper mentions implementing methods for GPT-4o, Claude 3.5 Sonnet, Llama 3-8B, and Qwen2-7B using their respective APIs or Hugging Face Transformers. However, it does not provide specific details about the hardware (e.g., GPU models, CPU models, memory) used for running the experiments or training.
Software Dependencies	No	The paper mentions using official Python API libraries for OpenAI and Anthropic, and the Hugging Face Transformers library for Llama and Qwen models. However, it does not specify the version numbers for any of these software dependencies, nor does it mention the Python version used.
Experiment Setup	Yes	For To C, during training, the original LLM (Lla MA 3-8B and Qwen 2-7B) weights are frozen, while the projection layers are trainable (lr = 0.0001, epochs = 20).