W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

Authors: Shang Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comparative analysis on the GLUE and SQu AD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the Flexi BERT search space.
Researcher Affiliation Academia Shang Wang Shanghai Tech University EMAIL
Pseudocode Yes Algorithm 1: Crossover operation in genetic algorithm Input: Parent 1 encoding p1, Parent 2 encoding p2 Output: Offspring encoding 1 Function Crossover(p1, p2): 2 Create an empty offspring encoding child 3 for i 1 to length(p1) do 4 if random number < 0.5 then 5 Add the gene from the corresponding position in the parent 1 encoding to the offspring encoding 6 child[i] p1[i] 8 Add the gene from the corresponding position in the parent 2 encoding to the offspring encoding 9 child[i] p2[i] 12 return child
Open Source Code Yes 2Our implementation of W-PCA is available: https://github.com/ra225/W-PCA
Open Datasets Yes We conduct a comparative analysis on the GLUE and SQu AD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the Flexi BERT search space. ... The training dataset comprised 8,013,769 documents sourced from the Open Web Text (Gokaslan et al., 2019) corpus, amounting to a total of 38GB. ... We pretrain the model using the complete English Wikipedia (Devlin et al., 2019) and Books Corpus (Zhu et al., 2015).
Dataset Splits Yes We reserve the remaining 10% of the MNLI task to evaluate the accuracy of architectures in the search. ... Dolly Eval: This is a 500-sample test set that we extracted from the databricks-dolly-15k dataset. Self Inst (Wang et al., 2022a): A user-oriented instruction-following set comprising 252 samples. Vicuna Eval (Chiang et al., 2023): The evaluation includes 80 challenging questions used in the Vicuna project. S-NI: The test set of Super-Natural Instructions (Wang et al., 2022b), which consists of 9,000 samples across 119 tasks. Un NI: The core set of Unnatural Instructions (Honovich et al., 2022), which contains 60,000 samples.
Hardware Specification Yes Latency measurements of the models are conducted using the NVIDIA A100 GPU. ... All transformer architectures within the search space were trained on TPUv2s with 8 cores and 64 GB of memory using Google Colaboratory. ... For the evaluation of training-free metrics, 2.8 GHz Intel Cascade Lake processors with either 16 or 32 cores and 32 GB of memory were employed. ... For this task, we train the models generated by each zero-shot proxy on 8 NVIDIA V100 GPUs, with a total batch size of 64 for 3 epochs.
Software Dependencies No The paper mentions the use of an "Adam optimizer" and various loss functions (e.g., "MSE", "soft cross-entropy (CE)"), but it does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes During pretraining, the network is trained with a batch size set to 256. For the fine-tuning phase of the downstream tasks, the network is trained with a batch size set to 32. The Co LA task is trained for 50 epochs, while the other tasks are trained for 10 epochs. The learning rate is set at 0.0001 during pretraining. In the fine-tuning phase, the learning rate is set at 0.00005 for GLUE tasks and 0.0001 for SQu AD tasks. The training process utilizes the Adam optimizer with β1 and β2 values set at 0.9 and 0.999, respectively. The weight decay is set to 0.01. The learning rate decays linearly with a warm-up ratio set to 0.1.