Is Sarcasm Detection a Step-by-Step Reasoning Process in Large Language Models?

Authors: Ben Yao, Yazhou Zhang, Qiuchi Li, Jing Qin

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) Co C and Go C show superior performance with more advanced models like GPT4 and Claude 3.5, with an improvement of 3.5%. (2) To C significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., To T) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.
Researcher Affiliation Academia 1University of Copenhagen 2Tianjin University 3The Hong Kong Polytechnic University
Pseudocode No The paper describes the methods Co C, Go C, Bo C, and To C using descriptive text and some mathematical formulations, but it does not include explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes Code https://github.com/qiuchili/llm sarcasm detection.git
Open Datasets Yes Four benchmarking datasets are selected as the experimental beds, viz. IAC-V1 (Lukin and Walker 2013), IAC-V2 (Oraby et al. 2016), Sem Eval 2018 Task 3 (Van Hee, Lefever, and Hoste 2018) and MUSt ARD (Castro et al. 2019).
Dataset Splits No The paper uses four benchmarking datasets: IAC-V1 (Lukin and Walker 2013), IAC-V2 (Oraby et al. 2016), Sem Eval 2018 Task 3 (Van Hee, Lefever, and Hoste 2018), and MUSt ARD (Castro et al. 2019). While these are standard benchmark datasets, the paper does not explicitly provide information on specific training, validation, or test split percentages or sample counts for reproduction.
Hardware Specification No The paper mentions implementing methods for GPT-4o, Claude 3.5 Sonnet, Llama 3-8B, and Qwen2-7B using their respective APIs or Hugging Face Transformers. However, it does not provide specific details about the hardware (e.g., GPU models, CPU models, memory) used for running the experiments or training.
Software Dependencies No The paper mentions using official Python API libraries for OpenAI and Anthropic, and the Hugging Face Transformers library for Llama and Qwen models. However, it does not specify the version numbers for any of these software dependencies, nor does it mention the Python version used.
Experiment Setup Yes For To C, during training, the original LLM (Lla MA 3-8B and Qwen 2-7B) weights are frozen, while the projection layers are trainable (lr = 0.0001, epochs = 20).