reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PEARL: Towards Permutation-Resilient LLMs

Authors: Liang CHEN, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong 2Shenzhen Campus of Sun Yat-sen University 3SMU EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Adversarial Optimization Algorithm for PEARL
Open Source Code	Yes	The code is available at https://github.com/Chan Liang/PEARL.
Open Datasets	Yes	We validate our method in two scenarios: (1) pretraining a transformer to in-context learn linear functions (Garg et al., 2022), and (2) instruction tuning of LLMs on the Super-Natural Instructions (Wang et al., 2022).
Dataset Splits	Yes	We selected 17 representative tasks, comprising 9 natural language generation (NLG) tasks and 8 natural language understanding (NLU) tasks. Following the methodology of Wang et al. (2022), we randomly designated 4 datasets as held-out test sets and used the remaining 13 datasets for training. Each training dataset contains 150 examples, and each test dataset contains 100 examples, resulting in a training set of 1,950 examples and a test set of 400 examples, as summarized in Table 2.
Hardware Specification	Yes	We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps.
Software Dependencies	No	The paper mentions models and optimizers like GPT-2, Adam W, BERT-base, LLa MA3-8B, FLAN-large, and Lo RA, but does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	Key training parameters include a batch size of 128 and 500k training steps. In the PEARL framework, the P-Net is initialized as a BERT-base (Devlin et al., 2019a) and also trained from scratch. ... We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps. The optimizer used was Adam W. The learning rates for the P-Net and the LLM are set to 1 10 4 and 3 10 4, respectively. For the Sinkhorn algorithm, we use 80 iterations, a temperature parameter of 0.1, and an entropy constraint coefficient β = 1.0. Table 6 also lists hyperparameter settings.