reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

Authors: Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui SHI, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Sci Lit LLM achieves promising performance in scientific literature understanding benchmarks. In this section, we perform experiments to answer the following research questions: (Q1) How does Sci Lit LLM perform on scientific literature understanding tasks? (Q2) Can CPT with domain-specific corpora aid in scientific knowledge injection? (Q3) Can SFT with Sci Lit Ins improve performance on scientific literature understanding tasks. The performance comparison of base models is shown in Table 2.
Researcher Affiliation	Collaboration	1University of Science and Technology of China, 2DP Technology EMAIL, EMAIL
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described in natural language and illustrated with flowcharts like Figure 5.
Open Source Code	Yes	We release the data processing codes1 and model weights2. 1https://github.com/dptech-corp/Uni-SMART 2https://huggingface.co/collections/Uni-SMART/scilitllm15-67283353ada975ba995629ef
Open Datasets	Yes	To maintain the model s general knowledge, we also include a similar scale of general corpus tokens from Redpajama (Computer, 2023). Our SFT training dataset consists of three parts: Sci Lit Ins, Sci RIFF (Wadden et al., 2024) and Infinity-Instruct6, as shown in Table 1. Infinity-Instruct is a collection of more than twenty open-source instructions datasets, covering various general domains. 6https://huggingface.co/datasets/BAAI/Infinity-Instruct
Dataset Splits	Yes	Instruction model benchmarks. We evaluate the instruct models on scientific literature understanding benchmarks: Sci RIFF (Wadden et al., 2024) and Sci Assess (Cai et al., 2024).
Hardware Specification	Yes	Llama3-8B-Instruct can process approximately 2.52 million tokens per Nvidia A100 GPU hour. The process takes over 5,000 A100 GPU hours to handle all the textbooks and research papers. The CPT training took approximately 3 days on 32 Nvidia A100 GPUs for Sci Lit LLM-7B-Base and about 7 days for the 14B model. The SFT training takes approximately 32 hours for the 7B and 70 hours for the 14B model on 32 A100 GPUs, resulting in Sci Lit LLM-7B-Instruct and Sci Lit LLM-14B-Instruct.
Software Dependencies	No	No explicit software dependencies with specific version numbers (e.g., Python 3.8, PyTorch 1.9) are provided. The paper mentions tools and models like Py PDF2, Llama3-8B-Instruct, Llama3-70B-Instruct, BERT, and fineweb-edu-classifier, but without corresponding version numbers for software dependencies.
Experiment Setup	Yes	CPT on Qwen2.5-Base (Qwen, 2024) for one epoch, encompassing 23.7 billion tokens (cf. Table 1), with a sequence length of 2,048 tokens. ... we gradually decrease the learning rate from 1 10 5 to 0 with a cosine scheduler. To address overfitting, we apply a weight decay of 0.1 and gradients were clipped at a maximum value of 1.0. The training is conducted with a sequence length of 4,096, a maximum learning rate of 1 10 5, and a cosine scheduler.