Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study
Authors: Lili Zhao, Yang Wang, Qi Liu, Mengyun Wang, Wei Chen, Zhichao Sheng, Shijin Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on fine-tuning open-source LLMs demonstrate the effectiveness of double-calibrated strategy in mitigating the reliance of LLMs on local information. For a thorough comparison, we not only employ public JEC-QA and open Book QA datasets, but also construct EG-QA which contains English Grammar multiple-choice question-answering and 14 key knowledge points for assessing self-knowledge and logical reasoning. |
| Researcher Affiliation | Collaboration | Lili Zhao1, Yang Wang1,3, Qi Liu1,2 , Mengyun Wang3, Wei Chen1, Zhichao Sheng3, Shijin Wang1,3 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd EMAIL; EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and strategies in prose and provides prompt examples in figures, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or sections. |
| Open Source Code | Yes | 9https://github.com/LiliizZ/RoSe |
| Open Datasets | Yes | For a thorough comparison, we not only employ public JEC-QA and open Book QA datasets, but also construct EG-QA which contains English Grammar multiple-choice question-answering and 14 key knowledge points for assessing self-knowledge and logical reasoning. [...] we adopt the legal multiple-choice QA dataset (JEC-QA) (Zhong et al., 2020) [...] we also employ publicly available dataset, open Book QA (Mihaylov et al., 2018) |
| Dataset Splits | Yes | We adopt 5 tasks as the training set, the sub-knowledge points of tasks as the In-Distributed (ID) set, and other 4 knowledge points outside of training tasks are employed as the Out-Of-Distribution (OOD) sets. The detailed statistics of EG-QA is shown in Table 1. On the whole dataset, there are 26,458 multiple-choice questions in total 4. In this paper, we mainly adopt EG-QA to make fully evaluation and fine-tuning. In the evaluation stage, we choose object clauses which contains 1,645 samples; In the fine-tuning, we obtain 18,598 well-calibrated data though double-calibrated strategy from GPT-4 turbo. |
| Hardware Specification | Yes | We perform fine-tuning LLaMA3-8B and Qwen-7B on 4 A100-80G GPUs using parallelization, leveraging Low-Rank Adapters (LoRA) parameter-efficient tuning method (Hu et al., 2022) with rank 8 and alpha 32 for 10 epochs. Besides, in Spark-13B, we update all weights on 6 8 Ascend 910B 64G NPUs for 10 epochs, adapting to Ascend development environment (Liao et al., 2021). |
| Software Dependencies | No | The paper mentions models like LLaMA3-8B, Qwen-7B, Spark-13B, and methods like LoRA and AdamW optimizer, but it does not provide specific version numbers for general software libraries or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We perform fine-tuning LLaMA3-8B and Qwen-7B on 4 A100-80G GPUs using parallelization, leveraging Low-Rank Adapters (LoRA) parameter-efficient tuning method (Hu et al., 2022) with rank 8 and alpha 32 for 10 epochs. To balance training costs, we employ fp16 precision, gradient accumulation strategy, and limit the maximum length to 2048. AdamW optimizer (Loshchilov & Hutter, 2019), a 0.1 dropout, and a cosine annealed learning rate of 1e-4 are used. |