reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Authors: David Grangier, Simin Fan, Skyler Seto, Pierre Ablin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.
Researcher Affiliation	Industry	David Grangier, Simin Fan, Skyler Seto, Pierre Ablin Apple
Pseudocode	Yes	Algorithm 1 CRISP Training
Open Source Code	No	The paper does not contain an explicit statement about releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It only mentions using third-party tools/datasets like Redpj2 and LM eval harness.
Open Datasets	Yes	Our generalist training set is Redpj2 (Together AI Team, 2023). For LM, we use Pile subsets from different domains (Gao et al., 2021): medical (Pubmed Central), programming Q&A (Stackexchange), and encyclopedic (Wikipedia). For MCQ answering, we use AI2 Reasoning Challenge (Clark et al., 2018, ARC), Massive Multitask Language Understanding (Hendrycks et al., 2021, MMLU), and Reward Bench Reasoning (Lambert et al., 2024, RWDB-R).
Dataset Splits	Yes	To provide a representative specialist train set Ds Ds, we split the questions into a train and test split, see Table 5 in Appendix C. For the MCQ data, we split each evaluation set into an equal sized train and test set uniformly at random.
Hardware Specification	Yes	GPUh are measured in training hours per graphic processor (Nvidia H100).
Software Dependencies	No	The paper mentions several tools and models like 'transformer LMs (Vaswani et al., 2017)', 'Adam (Kingma & Ba, 2015)', 'byte-pair encoding tokenizer (Sennrich et al., 2016b)', 'SBERT Mini LM-L6-v2 Reimers & Gurevych (2019)', and refers to 'Brown et al. (2020)' for architectures and optimization settings. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other common libraries.
Experiment Setup	Yes	Our architecture configurations are borrowed from Brown et al. (2020) and described in Table 6. We report the data selection hyperparameters in Table 7 and the clustering hyper-parameters in Table 8. For the classifier, the classification threshold is the main parameter. A threshold accepting 2.5% of Dg worked best in for the runs with 1.3B models over 120B tokens. For importance sampling, the results presented in this section relies on 260k clusters. Table 6: Model Hyperparameters (Embedding dim. 1,024, Latent dim. 4,096, Num. heads 16, Depth 24, Context limit 1,024, Batch size 96k, Learning rate 1e-4, Grad clipping 5.0, Steps 400k, Num. train tokens 40B for 350m model). Similar details are provided for 1.3B and 6.7B models.