Enhancing LLMs via High-Knowledge Data Selection

Authors: Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model s performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model. 3 Experiments
Researcher Affiliation Collaboration 1Sino-French Engineer School, Beihang University, Beijing, China 2Peking University, Beijing, China 3Department of Automation, Tsinghua University, Beijing, China 4School of Computer Science and Engineering, Beihang University, Beijing, China 5Meituan, Beijing, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology, including definitions and formulas (e.g., score(x) = d(x) ln(c(x) + 1)), and outlines steps in prose and diagrams (Figure 1), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing source code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets Yes We utilize the Pile (Gao et al. 2020) and Wudao (Yuan et al. 2021) datasets as our pre-training dataset for training a bilingual language model.
Dataset Splits Yes For the Pile dataset, we extract 10K samples from each subset to serve as a validation dataset, ensuring these samples are not encountered during the training process. Since the Wudao dataset does not have a predefined subset split, we divide it according to categories of the included knowledge elements. We then apply the same validation process as with the Pile dataset, extracting samples for evaluation.
Hardware Specification Yes We use Megatron framework to train our model in 16 A100 GPUs, with fp16 setting, which needs 21 hours to finish our training.
Software Dependencies No The paper mentions using the 'Megatron framework,' 'GPT4,' and a 'BERT-based model' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We train a model of 1.1B parameters, which has the same architecture of Bloom (Le Scao et al. 2023). We train our model in one epoch, with a cosine learning rate scheduler. We use a global batch size of 2048 with gradient accumulation and a max context window length of 2048.