reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empowering Self-Learning of LLMs: Inner Knowledge Explicitation as a Catalyst

Authors: Shijue Huang, Wanjun Zhong, Deng Cai, Fanqi Wan, Chengyi Wang, Mingxuan Wang, Mu Qiao, Ruifeng Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results from six benchmarks demonstrate that Inner Knowledge Explicitation improves reasoning by serving as a more effective prompting method. Additionally, SKE-Learn, based on the verifiability of explicit knowledge, shows consistent performance improvements over multiple self-training iterations, with an average performance increase from 52.79% to 56.54% across all benchmarks. Extensive experiments demonstrate that while LLMs possess extensive knowledge, explicit extraction of inner knowledge still significantly enhances their reasoning performance. Comprehensive experiments across six benchmarks reveal that both the Inner Knowledge Explicitation mechanism and the SKE-Learn self-learning approach elicit reasoning abilities, and provide better interpretability in model knowledge utilization.
Researcher Affiliation	Collaboration	1Harbin Institute of Technology, Shenzhen, China 2Bytedance Seed, China 3 Sun Yat-sen University, China 4Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies 5Peng Cheng Laboratory, Shenzhen, China
Pseudocode	No	The paper describes the methodology with equations for meta-skills (e.g., k = M(q, pextract)) and training objectives, and illustrates the overall workflow in Figure 2, but does not contain a clearly labeled "Pseudocode" or "Algorithm" block, nor structured steps formatted like code.
Open Source Code	Yes	Code https://github.com/Joe Ying1019/SKE-Learn
Open Datasets	Yes	We fine-tune Llama3-8B (Dubey et al. 2024) on 100,000 instances of Magpie data1 (Xu et al. 2024b), and derived an instruct model as the backbone model of whole experiments, namely Llama3-8B-Magpie. All models are trained with full parameters for 2 epochs, using batch size of 32, learning rate of 2e-5, with 100 warmup steps. In inference phase, we set all temperatures as 0 to ensure better reproducibility. All experiments are performed on eight NVIDIA A100-SXM4-80GB GPUs. Comprising meta-skill training, we conduct totally four rounds of iterative training. Following Yuan et al. (2024), we mix a proportion of general data2 and the scoring data from the meta-skill training stage at each round to maintain both general responsiveness and self-assessment capability. Training data details for each round are provided in Table 1. The unsupervised knowledge corpora are sourced from a collection based on Wikipedia3 , and the small proportion of existing questions leveraged in this stage are drawn from a dataset focused on STEM4. In meta-skill training, we use GPT-4 with version gpt-4-0125-preview as judge model. ... 1The first 100,000 entries from https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered. ... 3https://huggingface.co/datasets/fmars/wiki stem. 4https://huggingface.co/datasets/cfahlgren1/swti-stem-20k. ... We conducted experiments across a wide range of popular benchmarks, including general examination benchmarks such as MMLU (Hendrycks et al. 2021), AGIEval (Zhong et al. 2024), and ARC (encompassing ARC-E and ARC-C) (Clark et al. 2018). Additionally, we evaluated on the comprehensive reasoning benchmark BBH (Suzgun et al. 2023) and the knowledge question-answering benchmark Natural Questions (NQ)5 (Kwiatkowski et al. 2019). All evaluation metrics are accuracy, with corresponding evaluation scripts derived from Open Compass (Contributors 2023). ... 5We use the standard open-domain splits as per previous studies (Lewis, Stenetorp, and Riedel 2021; Wang et al. 2024a).
Dataset Splits	Yes	To investigate the development of meta-skills during iterative self-learning, we randomly select 500 data points from six benchmarks, and collect the generated knowledge and reasoning process at each iteration, resulting in a total of 15,000 data points (3,000 per iteration). Then these data are evaluated for quality using GPT-4 via LLM-as-a Judge prompting (Zheng et al. 2023), in alignment with the meta-skill training stage. ... Table 1: Training data details in our self-learning approach. (This table provides specific numerical details for Knowledge, Reasoning, Scoring, and General data used for M1 meta, M2, M3, M4, e.g., M1 meta: 5,000 Knowledge, 5,000 Reasoning, 4,237 Scoring, 25,000 General).
Hardware Specification	Yes	All experiments are performed on eight NVIDIA A100-SXM4-80GB GPUs.
Software Dependencies	No	The paper mentions using Llama3-8B and GPT-4 with version gpt-4-0125-preview as a judge model, and evaluation scripts derived from Open Compass. However, it does not provide specific version numbers for software libraries or dependencies used for implementation. For instance, it doesn't specify versions for PyTorch, TensorFlow, or other common machine learning frameworks.
Experiment Setup	Yes	All models are trained with full parameters for 2 epochs, using batch size of 32, learning rate of 2e-5, with 100 warmup steps. In inference phase, we set all temperatures as 0 to ensure better reproducibility. ... The score threshold for selecting data is 8.