reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Authors: Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform comprehensive experiments for Com PEFT to evaluate: (1) the performance of the compressed model on its original tasks, (2) the number of bits needed to store the models, (3) the mergeability and composability of the compressed checkpoints, and (4) how Com PEFT compares to other existing PEFT methods.
Researcher Affiliation	Collaboration	1 UNC-Chapel Hill, 2 MIT, 3 MIT-IBM Watson AI Lab, 4 University of Toronto, 5 Vector Institue Correspondence Email: {EMAIL}
Pseudocode	Yes	Algorithm 1 Com PEFT Compression Procedure. Input: Task vector τt, k, and a scaling value α. Output: Compressed task vector τt
Open Source Code	Yes	1Code is available at https://github.com/prateeky2806/Com PEFT.
Open Datasets	Yes	We follow the experimental setting from the QLo RA paper (Dettmers et al., 2023) and experiment with 8 recent instruction-following datasets that are diverse in terms of languages and dataset sizes. This collection includes datasets generated by language models (Alpaca (Taori et al., 2023), self-instruct (Wang et al., 2022), and unnatural-instructions (Honovich et al., 2022)), a multitask dataset (FLAN-v2 (Chung et al., 2022a)), two datasets created via human annotation and feedback (OASST1 (Köpf et al., 2023) and HH-RLHF (Bai et al., 2022)), and two hybrid datasets (Chip2 (LAION, 2023) and Longform (Köksal et al., 2023)).
Dataset Splits	Yes	For the experiments in 3.2 and 3.3 on the 7 GLUE (Wang et al., 2018a) tasks, we trained the large datasets (mnli, qnli, sst2, qqp) for 1 epoch and the small datasets (rte, mrpc, wnli) for 10 epochs. Whereas for the experiment in 3.5, we followed most of the hyperparameter configuration from the (IA)3 (Liu et al., 2022) paper and trained for 2500 steps with a batch size of 8. For each of the 11 datasets in 3.5, we selected 200 examples from the training set to be used as the validation set for best model selection as well as selecting the hyperparameters for Com PEFT.
Hardware Specification	Yes	We used a single 48GB NVIDIA A6000 GPU for these experiments.
Software Dependencies	No	The paper mentions software components like bfloat16 (data type) and refers to using code from original authors for merging methods, but does not specify version numbers for any key software libraries or packages used for their own implementation.
Experiment Setup	Yes	In all experiments, we sweep both α and k in the following ranges, k {5, 10, 20, 30, 50} and α {0.5, 1, 2, 3, 4, 5, 6, 8, 10}... For training (IA)3 models we selected the learning rate from {1e 2, 1e 3, 1e 4, 1e 5}, for Lo RA from {5e 2, 5e 3, 5e 4, 5e 5}, and for full model finetuning from {5e 3, 5e 4, 5e 5, 5e 6}. During the training process, bfloat16 was adopted to curtail GPU memory expenditure.