SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Authors: Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui SHI, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Sci Lit LLM achieves promising performance in scientific literature understanding benchmarks. In this section, we perform experiments to answer the following research questions: (Q1) How does Sci Lit LLM perform on scientific literature understanding tasks? (Q2) Can CPT with domain-specific corpora aid in scientific knowledge injection? (Q3) Can SFT with Sci Lit Ins improve performance on scientific literature understanding tasks. The performance comparison of base models is shown in Table 2. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, 2DP Technology EMAIL, EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described in natural language and illustrated with flowcharts like Figure 5. |
| Open Source Code | Yes | We release the data processing codes1 and model weights2. 1https://github.com/dptech-corp/Uni-SMART 2https://huggingface.co/collections/Uni-SMART/scilitllm15-67283353ada975ba995629ef |
| Open Datasets | Yes | To maintain the model s general knowledge, we also include a similar scale of general corpus tokens from Redpajama (Computer, 2023). Our SFT training dataset consists of three parts: Sci Lit Ins, Sci RIFF (Wadden et al., 2024) and Infinity-Instruct6, as shown in Table 1. Infinity-Instruct is a collection of more than twenty open-source instructions datasets, covering various general domains. 6https://huggingface.co/datasets/BAAI/Infinity-Instruct |
| Dataset Splits | Yes | Instruction model benchmarks. We evaluate the instruct models on scientific literature understanding benchmarks: Sci RIFF (Wadden et al., 2024) and Sci Assess (Cai et al., 2024). |
| Hardware Specification | Yes | Llama3-8B-Instruct can process approximately 2.52 million tokens per Nvidia A100 GPU hour. The process takes over 5,000 A100 GPU hours to handle all the textbooks and research papers. The CPT training took approximately 3 days on 32 Nvidia A100 GPUs for Sci Lit LLM-7B-Base and about 7 days for the 14B model. The SFT training takes approximately 32 hours for the 7B and 70 hours for the 14B model on 32 A100 GPUs, resulting in Sci Lit LLM-7B-Instruct and Sci Lit LLM-14B-Instruct. |
| Software Dependencies | No | No explicit software dependencies with specific version numbers (e.g., Python 3.8, PyTorch 1.9) are provided. The paper mentions tools and models like Py PDF2, Llama3-8B-Instruct, Llama3-70B-Instruct, BERT, and fineweb-edu-classifier, but without corresponding version numbers for software dependencies. |
| Experiment Setup | Yes | CPT on Qwen2.5-Base (Qwen, 2024) for one epoch, encompassing 23.7 billion tokens (cf. Table 1), with a sequence length of 2,048 tokens. ... we gradually decrease the learning rate from 1 10 5 to 0 with a cosine scheduler. To address overfitting, we apply a weight decay of 0.1 and gradients were clipped at a maximum value of 1.0. The training is conducted with a sequence length of 4,096, a maximum learning rate of 1 10 5, and a cosine scheduler. |