Enhancing LLMs via High-Knowledge Data Selection
Authors: Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model s performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model. 3 Experiments |
| Researcher Affiliation | Collaboration | 1Sino-French Engineer School, Beihang University, Beijing, China 2Peking University, Beijing, China 3Department of Automation, Tsinghua University, Beijing, China 4School of Computer Science and Engineering, Beihang University, Beijing, China 5Meituan, Beijing, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology, including definitions and formulas (e.g., score(x) = d(x) ln(c(x) + 1)), and outlines steps in prose and diagrams (Figure 1), but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code, a link to a code repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | We utilize the Pile (Gao et al. 2020) and Wudao (Yuan et al. 2021) datasets as our pre-training dataset for training a bilingual language model. |
| Dataset Splits | Yes | For the Pile dataset, we extract 10K samples from each subset to serve as a validation dataset, ensuring these samples are not encountered during the training process. Since the Wudao dataset does not have a predefined subset split, we divide it according to categories of the included knowledge elements. We then apply the same validation process as with the Pile dataset, extracting samples for evaluation. |
| Hardware Specification | Yes | We use Megatron framework to train our model in 16 A100 GPUs, with fp16 setting, which needs 21 hours to finish our training. |
| Software Dependencies | No | The paper mentions using the 'Megatron framework,' 'GPT4,' and a 'BERT-based model' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We train a model of 1.1B parameters, which has the same architecture of Bloom (Le Scao et al. 2023). We train our model in one epoch, with a cosine learning rate scheduler. We use a global batch size of 2048 with gradient accumulation and a max context window length of 2048. |