reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Utterance-level Emotion Recognition in Conversation with Conversation-level Supervision

Authors: Ximing Li, Yuanchao Dai, Zhiyao Yang, Jinjin Chi, Wanfu Gao, Lin Yuanbo Wu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that the proposed DERC-PL can be on par with existing weakly-supervised learning baselines and supervised learning ERC methods. We conduct extensive experiments to validate the effectiveness of DERC-PL and provide empirical evidence that coarse-grained DERC can be a strong candidate for fine-grained ERC. In this section, we conduct experiments to evaluate DERCPL, and attempt to answer the following questions: Q1 : Can DERC-PL compete with the existing weaklysupervised learning methods in DERC settings? Q2 : Can DERC-PL compete with the existing supervised learning ERC methods?
Researcher Affiliation	Academia	1College of Computer Science and Technology, Jilin University, China 2Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China 3 Swansea University, United Kingdom
Pseudocode	Yes	Algorithm 1: Computation of DERC-PL
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	We employ three benchmark ERC datasets: MELD (Poria et al. 2019), IEMOCAP (Busso et al. 2008), and Emory NLP (Zahiri and Choi 2018). Statistics of these datasets are listed in Table 3.
Dataset Splits	Yes	Statistics of these datasets are listed in Table 3. For each dataset, we generate its DERC version by directly adding the utterance-level emotions into the corresponding conversation-level emotion sets. Table 3: The statistics of benchmark datasets. Dataset Conversation Utterance Total Train Validation Test Total Train Validation Test IEMOCAP 151 120 31 7,433 5,810 1,623 Emory NLP 827 659 89 79 9,489 7,551 95 984 MELD 1,432 1,039 114 280 13,708 9,989 1,109 2,610
Hardware Specification	Yes	Our experiments are conducted on Ubuntu 20.04 with a single RTX-4090 GPU with 24G memory.
Software Dependencies	No	The paper mentions "Ubuntu 20.04" and the "Adam W optimizer" but does not provide specific version numbers for other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries used in the implementation.
Experiment Setup	Yes	For BERT-based methods (i.e., BERTbase+MLP, RGAT (Ishiwatari et al. 2020)) / Ro BERTa-based methods (i.e., SACL (Hu et al. 2023), Dual GAT (Zhang, Chen, and Chen 2023)), we use the Adam W optimizer (Loshchilov and Hutter 2019), with learning rates of 2e 5 and 1e 4, respectively. The layer dropout rate, batch size, and the number of epochs T are configured to 0.1/0.2, 16/16, and 30/20, respectively. The hyperparameter α is adjusted to 0.8 for IEMOCAP (Busso et al. 2008), 0.3 for Emory NLP (Zahiri and Choi 2018), and 0.4 for MELD (Poria et al. 2019).