reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Uncertainty-Based Active Learning for Reading Comprehension

Authors: Jing Wang, Jie Shen, Xiaofei Ma, Andrew Arnold

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate on benchmark datasets that 25% less labeled samples suﬃce to guarantee comparable, or even improved performance. Our results show strong evidence that for label-demanding scenarios, the proposed approach oﬀers a practical guide on data collection and model training. Section 4 is titled "Experiments" and contains detailed empirical results, tables, and figures comparing performance on datasets.
Researcher Affiliation	Collaboration	Jing Wang EMAIL Amazon Jie Shen EMAIL Stevens Institute of Technology Xiaofei Ma EMAIL Amazon Web Services Andrew O. Arnold EMAIL Delphia
Pseudocode	Yes	Algorithm 1 Albus: Active Learning By Uncertainty-based Sampling Require: a set of unlabeled instances U = {x1, . . . , xn}, initial MRC model w0, maximum iteration number T, thresholds {τ1, . . . , τT }, the number of instances to be labeled n0. Ensure: A new MRC model w T . 1: U1 U. 2: for t = 1, , T do 3: Compute wt 1(x) for all x Ut. 4: Bt {x Ut : wt 1(x) τt}. 5: Compute the sampling probability Pr(x) for all x Bt. 6: St randomly choose n0 instances from Bt by the distribution {Pr(x)}x Bt, and query their labels. 7: Update the model wt arg minw L(w; St). 8: Ut+1 Ut\St. 9: end for
Open Source Code	No	The paper mentions "BERT-base is used as the pretrained model and ﬁne-tuned for 2 epochs with a learning rate of 3e 5 and a batch size of 12, the default setting of Huggingface 3", with a footnote linking to "https://github.com/huggingface/transformers/tree/master/examples/question-answering". This refers to a third-party library used, not the authors' specific implementation code for their proposed algorithm.
Open Datasets	Yes	We focus on the span-based datasets, namely Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) and News QA (Trischler et al., 2017).
Dataset Splits	Yes	SQu AD consists of over 100,000 questions posed by crowdworkers on a set of 536 Wikipedia articles. We use the original split: 87,599 questions for training and 10,570 questions for testing. News QA is a machine comprehension dataset of over 100,000 human-generated question-answer pairs from over 10,000 news articles from CNN. The dataset is composed of 74,160 questions for training and 4,212 questions for validation
Hardware Specification	No	The paper mentions training parameters and software (BERT-base, Huggingface) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions "BERT-base" as the pretrained model and "Huggingface" for fine-tuning, but does not provide specific version numbers for these or any other software libraries or programming languages used in their implementation.
Experiment Setup	Yes	BERT-base is used as the pretrained model and ﬁne-tuned for 2 epochs with a learning rate of 3e 5 and a batch size of 12, the default setting of Huggingface. To ensure a comprehensive comparison among state-of-the-art approaches, we simulate the annotation process with human experts in the loop by selecting a ﬁxed number of examples n0 to query their labels from training set in each iteration (we set n0 = 2, 000 for SQu AD and n0 = 5, 000 for News QA). The MRC model is initialized with 1,000 labeled samples for SQu AD and 10,000 for News QA. The parameter τ0 is chosen from the range of [0.01, 0.1] based on the training set and decreases at the rate of 1.1.