Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Authors: David Grangier, Simin Fan, Skyler Seto, Pierre Ablin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.
Researcher Affiliation Industry David Grangier, Simin Fan, Skyler Seto, Pierre Ablin Apple
Pseudocode Yes Algorithm 1 CRISP Training
Open Source Code No The paper does not contain an explicit statement about releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It only mentions using third-party tools/datasets like Redpj2 and LM eval harness.
Open Datasets Yes Our generalist training set is Redpj2 (Together AI Team, 2023). For LM, we use Pile subsets from different domains (Gao et al., 2021): medical (Pubmed Central), programming Q&A (Stackexchange), and encyclopedic (Wikipedia). For MCQ answering, we use AI2 Reasoning Challenge (Clark et al., 2018, ARC), Massive Multitask Language Understanding (Hendrycks et al., 2021, MMLU), and Reward Bench Reasoning (Lambert et al., 2024, RWDB-R).
Dataset Splits Yes To provide a representative specialist train set Ds Ds, we split the questions into a train and test split, see Table 5 in Appendix C. For the MCQ data, we split each evaluation set into an equal sized train and test set uniformly at random.
Hardware Specification Yes GPUh are measured in training hours per graphic processor (Nvidia H100).
Software Dependencies No The paper mentions several tools and models like 'transformer LMs (Vaswani et al., 2017)', 'Adam (Kingma & Ba, 2015)', 'byte-pair encoding tokenizer (Sennrich et al., 2016b)', 'SBERT Mini LM-L6-v2 Reimers & Gurevych (2019)', and refers to 'Brown et al. (2020)' for architectures and optimization settings. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other common libraries.
Experiment Setup Yes Our architecture configurations are borrowed from Brown et al. (2020) and described in Table 6. We report the data selection hyperparameters in Table 7 and the clustering hyper-parameters in Table 8. For the classifier, the classification threshold is the main parameter. A threshold accepting 2.5% of Dg worked best in for the runs with 1.3B models over 120B tokens. For importance sampling, the results presented in this section relies on 260k clusters. Table 6: Model Hyperparameters (Embedding dim. 1,024, Latent dim. 4,096, Num. heads 16, Depth 24, Context limit 1,024, Batch size 96k, Learning rate 1e-4, Grad clipping 5.0, Steps 400k, Num. train tokens 40B for 350m model). Similar details are provided for 1.3B and 6.7B models.