TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

Authors: Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied Tap Weight to both molecular property prediction and natural language processing tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of Tap Weight.
Researcher Affiliation Academia Ruiyi Zhang EMAIL UC San Diego Sai Ashish Somayajula EMAIL UC San Diego Pengtao Xie EMAIL UC San Diego
Pseudocode Yes Algorithm 1 Tap Weight-optimization
Open Source Code Yes Our code is available at https://github.com/ruz048/Tap Weight.
Open Datasets Yes We perform continued pretraining of a pretrained Imagemol model on a dataset D, consisting of 1 million molecules from Pub Chem (Kim et al., 2023). For downstream tasks, we employ the Molecule Net benchmark, which includes 8 classification datasets focused on predicting biophysical and physiological properties essential for drug discovery (Wu et al., 2017). ... For downstream evaluation, we use RCT (Dernoncourt & Lee, 2017), AGNews (Zhang et al., 2015) and IMDB (Maas et al., 2011) datasets, which are widely used for evaluation of TAP methods (Gururangan et al., 2020; Shi & Lipani, 2023). We also use the GLUE benchmark for evaluation, which comprises 8 natural language understanding tasks, including sentiment analysis, semantic similarity prediction, and grammaticality classification (Wang et al., 2019).
Dataset Splits Yes We generate the training, validation and test split of these downstream datasets by applying scaffold splitting 1 with an 8:1:1 ratio. ... Following standard practices, we use the original GLUE development set as the test set in our experiments, and randomly split the original training set into a training set and validation set with a ratio of 8:1.
Hardware Specification Yes All experiments are conducted on 1 NVIDIA A100 GPU.
Software Dependencies No The complete algorithm is implemented using the Betty library (Choe et al., 2023c;b). ... As there is no analytical solution of θ (λ), it is difficult to directly compute the gradient θ λ . To tackle this challenge, we compute this gradient using IFT following previous literature (Lorraine et al., 2020): ... Nevertheless, directly computing the red term, which is the invert of a Hessian matrix ∇2Lpt(θ), is computational expensive due to its O(n3) complexity. Various methods have been proposed to approximate the inverted Hessian matrix, including Neumann series (Lorraine et al., 2020), conjugate gradients (Rajeswaran et al., 2019) and finite difference (Zhang et al., 2021). ... HVP has a complexity of O(n) as implemented in modern automatic differentiation (AD) libraries (e.g., torch.autograd.functional.hvp in Pytorch (Paszke et al., 2019)), which is more efficient than directly computing the Hessian.
Experiment Setup Yes We set the number of clusters for the loss terms Lmg1, Lmg2, and Lmg3 to 100, 1,000, and 10,000, respectively. During the continued pretraining, we set the unrolling step in the MLO framework to be 1. We use the SGD optimizer with a step learning rate scheduler across all three optimization levels. ... We set the global learning steps to be 30,000 for MUV dataset, 20,000 for HIV dataset, 10,000 for Tox21 and Toxcast datasets, and 3,000 for all other datasets. We set the batch size in level I to be 1024, and that in level II and level III to be 64 for all datasets. We set the learning rate to be 0.02 in level I, 0.05 in level II, and that in level III to be 200 for all datasets. We set the γ value in Equation 3 to be 0.001. ... When applying Tap Weight on the Ro BERTa encoder, we set the unrolling step in the MLO framework to 1. We use an Adam optimizer with a step learning rate scheduler across all three optimization levels. ... We set the global learning steps to 20,000 for the QQP and MNLI datasets, and 10,000 for all other datasets. The batch size for level I is set to 512, while for levels II and III, it is set to 32 across all datasets. The learning rate for levels I and II is 2e-5, and for level III, it is set to 1 for all datasets. We set the γ value in Equation 3 to be 0.005.