reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Single-round Active Adaptation: A Prediction Variability Perspective

Authors: Xiaoyang Wang, Yibo Jacky Zhang, Olawale Elijah Salaudeen, Mingyuan Wu, Hongpeng Guo, Chaoyang He, Klara Nahrstedt, Sanmi Koyejo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results further demonstrate that our approach consistently outperforms the random selection baseline by up to 1.26% for various vision and language tasks while other competitors often underperform the random selection baseline. We present the empirical verification of our approach and make comparisons with strong baselines. We conduct an ablation study to evaluate the performance of our approach under metrics other than exact match.
Researcher Affiliation	Collaboration	Xiaoyang Wang EMAIL University of Illinois Urbana-Champaign; Yibo Jacky Zhang EMAIL Stanford University; Olawale Salaudeen EMAIL Massachusetts Institute of Technology; Mingyuan Wu EMAIL University of Illinois Urbana-Champaign; Hongpeng Guo EMAIL University of Illinois Urbana-Champaign; Chaoyang He EMAIL Tensor Opera, Inc.; Klara Nahrstedt EMAIL University of Illinois Urbana-Champaign; Sanmi Koyejo EMAIL Stanford University
Pseudocode	Yes	Algorithm 1 Algorithm framework
Open Source Code	No	The paper does not contain any explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets	Yes	Tasks and datasets. In the image classification task, we use the VLCS dataset (Gulrajani & Lopez-Paz, 2021) and the Vis DA dataset (Peng et al., 2017). The sentiment classification task operates over the Amazon and Yelp review datasets (Mc Auley et al., 2015; Zhang et al., 2015). The span-based question-answering (QA) task employs the Squad and News datasets (Rajpurkar et al., 2016; Trischler et al., 2017). The reward modeling task utilizes the Anthropic-hh-rlhf dataset (Bai et al., 2022). For the span-QA task, we directly use a fine-tuned distilled-Bert on the Squad dataset 1 and use News as the target domain. (1https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad)
Dataset Splits	No	The paper mentions splitting data evenly across sources during training and using a 50% split for initial fine-tuning, and reporting results on a 'hold-out test set'. However, it does not provide specific train/validation/test split percentages or sample counts for the overall datasets that would be required to reproduce the data partitioning of the experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU, GPU models, or memory specifications used for running the experiments. It only mentions a speedup benchmark without specifying the hardware used.
Software Dependencies	No	The paper mentions using 'SGD optimization', 'Adam optimizer', and various 'models' (Resnet-50, distilled-Bert, GPT-2). However, it does not specify version numbers for any software libraries, programming languages, or specific frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed to replicate the experiment environment.
Experiment Setup	Yes	Hyper-parameters. We use the SGD optimization for the Resnet-50 model, the Adam optimizer for the distilled-Bert models, and the GPT-2 model. The initial learning rate is 1e-4 for all adaptation tasks, and we use linear decay scheduling for the GPT-2 model in a reward modeling task. The number of epochs for adaptation tasks is 4, and we train the reward model for 1 epoch. The batch size for the Resnet-50 model is 64, the distilled-Bert model is 16, and the GPT-2 model is 4.