reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Pre-training for Private Fine-tuning

Authors: Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz Religa, Jian Yin, Huishuai Zhang

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to demonstrate the superiority of selective pre-training over standard pre-training. The representative results are presented in Figure 2. Our contributions are summarized as follows. 2. We empirically validate the proposed framework using the Enron email dataset (Cohen, 2015) and the GLUE benchmark (Wang et al., 2018).
Researcher Affiliation	Collaboration	Da Yu EMAIL Sun Yat-sen University Sivakanth Gopi EMAIL Microsoft Research Janardhan Kulkarni EMAIL Microsoft Research Zinan Lin EMAIL Microsoft Research Saurabh Naik EMAIL Microsoft Tomasz Lukasz Religa EMAIL Microsoft Jian Yin EMAIL Sun Yat-sen University Huishuai Zhang EMAIL Peking University
Pseudocode	No	The paper describes the algorithmic framework in sections 2 and 3 and uses figures like Figure 1 and Figure 3 to illustrate the process with numbered steps. However, these are high-level flowcharts and descriptions, not structured pseudocode blocks with code-like formatting (e.g., loops, conditionals, detailed function calls).
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide links to any code repositories. It mentions, 'Our framework was recently used in training an industry grade differentially private text prediction language model that now serves many NLP applications,' which implies internal use rather than public release.
Open Datasets	Yes	We run experiments with the Enron Email (Cohen, 2015) as the target and the Open Web Text dataset (Gokaslan & Cohen, 2019) as the source. ... Our first target task is causal language modeling on the Enron email dataset. The dataset contains approximately 0.5 million (M) emails written by employees of the Enron Corporation and is publicly available for research use. ... The source data is Open Web Text (Gokaslan & Cohen, 2019) which contains 4 billion tokens. ... We conduct experiments on the GLUE benchmark (Wang et al., 2018), a common benchmark for fine-tuning language models with DP (Yu et al., 2021; Li et al., 2022c; Bu et al., 2022b). ... The source data for GLUE tasks is the pre-training corpus of BERT (Devlin et al., 2019); It consists of a subset of Wikipedia and the entire Bookcorpus.
Dataset Splits	Yes	Target and Source Data We divide the text into sequences of length 256 and treat each sequence as a datapoint, which constitutes the granularity of our privacy guarantees. ... There are 70K sequences in total. We use 80% of them for training and evenly split the rest 20% for validation and testing. ... Our target tasks in this section are MNLI and SST-2, which have respectively the largest and smallest number of examples among the four tasks studied in previous work (Yu et al., 2021; Li et al., 2022c; Bu et al., 2022b; Mireshghallah et al., 2022). The numbers of training examples (N) in MNLI and SST-2 are 393K and 67K.
Hardware Specification	Yes	All models are pre-trained with nodes with 8x Nvidia Tesla V100 GPUs. ... With a single Tesla A100 GPU, it takes approximately one hour for fine-tuning the domain classifier. With eight Tesla V100 GPUs, it takes less than two hours for computing the confidence scores for all sequences in Open Web Text.
Software Dependencies	No	The paper references various methods and models (e.g., GPT, BERT, LoRA, DP-Adam, Abadi et al. (2016) for DP) but does not provide specific version numbers for software libraries, programming languages (like Python), or frameworks (like PyTorch or TensorFlow) used for its implementation.
Experiment Setup	Yes	The overall privacy budget is (7.3, 1 10 7)-DP, similar to previous works on this topic (Li et al., 2022c; Yu et al., 2022). To reduce the privacy cost of hyperparameter tuning (Liu & Talwar, 2019; Papernot & Steinke, 2022; Mohapatra et al., 2022), we follow the findings in previous work to set most of the hyperparameters and only tune the learning rate to adapt to models of different sizes. The hyperparameters for private learning are listed in Table 4 in Appendix C. ... The pre-training process uses common hyperparameters in the literature. For pre-training models from the BERT family, we follow the hyperparameters in Devlin et al. (2019). The hyperparameters for pre-training models from the GPT family are as follows. We use a dropout probability of 0.1 and a weight decay of 0.01. The β1 and β2 of Adam are 0.9 and 0.999, respectively. All models are pre-trained from scratch for 100K iterations with a batch size of 128. The initial learning rate is 5 10 4 and follows a linear decay schedule. ... Table 4: Hyperparameters for private fine-tuning. We use N to denote the size of the target dataset. Noise multiplier (Enron) 1.00 1.03 Noise multiplier (SST-2) 1.36 1.38 Noise multiplier (MNLI) 1.44 1.46 Train steps (domain classifier) N/A 100 Train steps (target task) [150, 500, 1000] Clipping norm 1 Learning rate [1e-4, 5e-4, 1e-3, 3e-3] Weight decay 0 Batchsize 0.03N Privacy budget (7.3, 1 10 7) for Enron; (4, 1/10N) for GLUE