reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach

Authors: Zhen Tan, Jie Peng, Song Wang, Lijie Hu, Tianlong Chen, Huan Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on real-world datasets with LLM backbones in various sizes and architectures, and the results demonstrate that our intervention consistently improves inference-time predictions.
Researcher Affiliation	Academia	Zhen Tan1, Jie Peng2, Song Wang3, Lijie Hu4, Tianlong Chen5 Huan Liu1, 1Arizona State University 2University of Science and Technology of China 3University of Virginia 4King Abdullah University of Science and Technology 5University of North Carolina at Chapel Hill EMAIL, EMAIL, EMAIL, EMAIL, EMAIL,
Pseudocode	No	The paper describes the methodology in Section 3 and illustrates it with Figure 2 and Figure 3, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/Zhen-Tan-dmml/CLEAR.git.
Open Datasets	Yes	Our experiments are conducted on three datasets, including two widely-used real-world datasets, CEBa B (Abraham et al. 2022) and IMDB-C (Tan et al. 2023b) and a self-curated dataset ASAP-C.
Dataset Splits	Yes	Table 1: Statistics of experimented datasets and concepts. Dataset CEBa B (5-way classification) Train / Dev / Test 1755 / 1673 / 1685 IMDB-C (2-way classification) Train / Dev / Test 100 / 50 / 50 ASAP-C (regression) Train / Dev / Test 1005 / 281 / 283
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments. It mentions training models but gives no information on GPUs, CPUs, or other computing resources.
Software Dependencies	No	The paper mentions LLM backbones like BERT (Devlin et al. 2018), OPT (Zhang et al. 2022), and T5 (Raffel et al. 2020) with citations, but it does not provide specific version numbers for these or any other software dependencies, libraries, or programming languages used.
Experiment Setup	No	The paper mentions "We adopt an early stopping strategy, as per Abraham et al. (2022), to mitigate overfitting, with further details provided in Appendix B and G." but does not provide specific hyperparameters like learning rate, batch size, or optimizer settings in the main text.