reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing LLMs via High-Knowledge Data Selection

Authors: Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model s performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model. 3 Experiments
Researcher Affiliation	Collaboration	1Sino-French Engineer School, Beihang University, Beijing, China 2Peking University, Beijing, China 3Department of Automation, Tsinghua University, Beijing, China 4School of Computer Science and Engineering, Beihang University, Beijing, China 5Meituan, Beijing, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology, including definitions and formulas (e.g., score(x) = d(x) ln(c(x) + 1)), and outlines steps in prose and diagrams (Figure 1), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	We utilize the Pile (Gao et al. 2020) and Wudao (Yuan et al. 2021) datasets as our pre-training dataset for training a bilingual language model.
Dataset Splits	Yes	For the Pile dataset, we extract 10K samples from each subset to serve as a validation dataset, ensuring these samples are not encountered during the training process. Since the Wudao dataset does not have a predefined subset split, we divide it according to categories of the included knowledge elements. We then apply the same validation process as with the Pile dataset, extracting samples for evaluation.
Hardware Specification	Yes	We use Megatron framework to train our model in 16 A100 GPUs, with fp16 setting, which needs 21 hours to finish our training.
Software Dependencies	No	The paper mentions using the 'Megatron framework,' 'GPT4,' and a 'BERT-based model' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We train a model of 1.1B parameters, which has the same architecture of Bloom (Le Scao et al. 2023). We train our model in one epoch, with a cosine learning rate scheduler. We use a global batch size of 2048 with gradient accumulation and a max context window length of 2048.