reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Structural Entropy Guided Probabilistic Coding

Authors: Xiang Huang, Hao Peng, Li Sun, Hui Lin, Chunyang Liu, Jiang Cao, Philip S. Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across 12 natural language understanding tasks, including both classiﬁcation and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise. Extensive experiments on 12 datasets demonstrate that SEPC achieves SOTA performance in classiﬁcation and regression tasks regarding effectiveness, generalization, and robustness.
Researcher Affiliation	Collaboration	Xiang Huang1, Hao Peng1,2*, Li Sun3, Hui Lin4, Chunyang Liu5, Jiang Cao6, Philip S. Yu7 1Beihang University 2Guangdong Laboratory of Artiﬁcial Intelligence and Digital Economy 3North China Electric Power University 4China Academic of Electronics and Information Technology 5Didi Chuxing 6Academy of Military Sciences 7University of Illinois Chicago
Pseudocode	No	The paper describes methods using mathematical formulations and descriptive text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/SELGroup/SEPC
Open Datasets	Yes	Following Hu et al. (2024), we evaluate SEPC on 10 classiﬁcation task datasets and 2 regression task datasets. For classiﬁcation tasks, 7 datasets about tweet semantic analysis are used: Emoji (Barbieri et al. 2018), Emotion (Mohammad et al. 2018), Hate (Basile et al. 2019), Irony (Van Hee, Lefever, and Hoste 2018), Offensive (Zampieri et al. 2019), Sentiment (Rosenthal, Farra, and Nakov 2017), and Stance (Mohammad et al. 2016). Additionally, we also experiment on three emotion-related datasets from different domains: ISEAR (Scherer and Wallbott 1994), MELD (Poria et al. 2019), and Go Emotions (Demszky et al. 2020). For regression tasks, we utilize STS-B (Cer et al. 2017) and Claire (Roth, Anthonio, and Sauer 2022) for evaluation.
Dataset Splits	No	The paper mentions using specific datasets and a 'test set' in its evaluation. It also describes varying the training data percentage (e.g., 'randomly select 90%, 70%, 50%, and 30% of the training data'). However, it does not provide explicit details about the initial train/validation/test splits (e.g., '80/10/10 split' or specific sample counts) for each of the listed datasets, nor does it explicitly reference standard splits for all of them within this text.
Hardware Specification	Yes	All experiments are conducted on two NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	The training epoch number is 20, and the maximum patience for early stopping is 5 epochs. The learning rate is 5e-5 in all datasets. A linear learning rate warm-up is applied over the ﬁrst 10% of the training data. The batch size is uniformly set to 128. The trade-off parameter ω and the weight parameter ϱ are searched from {1e 2, 1e 1, 1, 10}.