Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Authors: David Grangier, Simin Fan, Skyler Seto, Pierre Ablin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | CRISP performs favorably compared to other methods that adjust the training distribution of the generalist data with guidance from the limited domain-specific data. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes. |
| Researcher Affiliation | Industry | David Grangier, Simin Fan, Skyler Seto, Pierre Ablin Apple |
| Pseudocode | Yes | Algorithm 1 CRISP Training |
| Open Source Code | No | The paper does not contain an explicit statement about releasing their source code, nor does it provide a direct link to a code repository for the methodology described. It only mentions using third-party tools/datasets like Redpj2 and LM eval harness. |
| Open Datasets | Yes | Our generalist training set is Redpj2 (Together AI Team, 2023). For LM, we use Pile subsets from different domains (Gao et al., 2021): medical (Pubmed Central), programming Q&A (Stackexchange), and encyclopedic (Wikipedia). For MCQ answering, we use AI2 Reasoning Challenge (Clark et al., 2018, ARC), Massive Multitask Language Understanding (Hendrycks et al., 2021, MMLU), and Reward Bench Reasoning (Lambert et al., 2024, RWDB-R). |
| Dataset Splits | Yes | To provide a representative specialist train set Ds Ds, we split the questions into a train and test split, see Table 5 in Appendix C. For the MCQ data, we split each evaluation set into an equal sized train and test set uniformly at random. |
| Hardware Specification | Yes | GPUh are measured in training hours per graphic processor (Nvidia H100). |
| Software Dependencies | No | The paper mentions several tools and models like 'transformer LMs (Vaswani et al., 2017)', 'Adam (Kingma & Ba, 2015)', 'byte-pair encoding tokenizer (Sennrich et al., 2016b)', 'SBERT Mini LM-L6-v2 Reimers & Gurevych (2019)', and refers to 'Brown et al. (2020)' for architectures and optimization settings. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other common libraries. |
| Experiment Setup | Yes | Our architecture configurations are borrowed from Brown et al. (2020) and described in Table 6. We report the data selection hyperparameters in Table 7 and the clustering hyper-parameters in Table 8. For the classifier, the classification threshold is the main parameter. A threshold accepting 2.5% of Dg worked best in for the runs with 1.3B models over 120B tokens. For importance sampling, the results presented in this section relies on 260k clusters. Table 6: Model Hyperparameters (Embedding dim. 1,024, Latent dim. 4,096, Num. heads 16, Depth 24, Context limit 1,024, Batch size 96k, Learning rate 1e-4, Grad clipping 5.0, Steps 400k, Num. train tokens 40B for 350m model). Similar details are provided for 1.3B and 6.7B models. |