reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study

Authors: Lili Zhao, Yang Wang, Qi Liu, Mengyun Wang, Wei Chen, Zhichao Sheng, Shijin Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on fine-tuning open-source LLMs demonstrate the effectiveness of double-calibrated strategy in mitigating the reliance of LLMs on local information. For a thorough comparison, we not only employ public JEC-QA and open Book QA datasets, but also construct EG-QA which contains English Grammar multiple-choice question-answering and 14 key knowledge points for assessing self-knowledge and logical reasoning.
Researcher Affiliation	Collaboration	Lili Zhao1, Yang Wang1,3, Qi Liu1,2 , Mengyun Wang3, Wei Chen1, Zhichao Sheng3, Shijin Wang1,3 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd EMAIL; EMAIL EMAIL
Pseudocode	No	The paper describes methods and strategies in prose and provides prompt examples in figures, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections.
Open Source Code	Yes	9https://github.com/LiliizZ/RoSe
Open Datasets	Yes	For a thorough comparison, we not only employ public JEC-QA and open Book QA datasets, but also construct EG-QA which contains English Grammar multiple-choice question-answering and 14 key knowledge points for assessing self-knowledge and logical reasoning. [...] we adopt the legal multiple-choice QA dataset (JEC-QA) (Zhong et al., 2020) [...] we also employ publicly available dataset, open Book QA (Mihaylov et al., 2018)
Dataset Splits	Yes	We adopt 5 tasks as the training set, the sub-knowledge points of tasks as the In-Distributed (ID) set, and other 4 knowledge points outside of training tasks are employed as the Out-Of-Distribution (OOD) sets. The detailed statistics of EG-QA is shown in Table 1. On the whole dataset, there are 26,458 multiple-choice questions in total 4. In this paper, we mainly adopt EG-QA to make fully evaluation and fine-tuning. In the evaluation stage, we choose object clauses which contains 1,645 samples; In the fine-tuning, we obtain 18,598 well-calibrated data though double-calibrated strategy from GPT-4 turbo.
Hardware Specification	Yes	We perform fine-tuning LLaMA3-8B and Qwen-7B on 4 A100-80G GPUs using parallelization, leveraging Low-Rank Adapters (LoRA) parameter-efficient tuning method (Hu et al., 2022) with rank 8 and alpha 32 for 10 epochs. Besides, in Spark-13B, we update all weights on 6 8 Ascend 910B 64G NPUs for 10 epochs, adapting to Ascend development environment (Liao et al., 2021).
Software Dependencies	No	The paper mentions models like LLaMA3-8B, Qwen-7B, Spark-13B, and methods like LoRA and AdamW optimizer, but it does not provide specific version numbers for general software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We perform fine-tuning LLaMA3-8B and Qwen-7B on 4 A100-80G GPUs using parallelization, leveraging Low-Rank Adapters (LoRA) parameter-efficient tuning method (Hu et al., 2022) with rank 8 and alpha 32 for 10 epochs. To balance training costs, we employ fp16 precision, gradient accumulation strategy, and limit the maximum length to 2048. AdamW optimizer (Loshchilov & Hutter, 2019), a 0.1 dropout, and a cosine annealed learning rate of 1e-4 are used.