reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sequential Best-Arm Identification with Application to P300 Speller

Authors: Xin Zhou, Botao Hao, Tor Lattimore, Jian Kang, Lexin Li

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the theoretical property of the proposed algorithm, and demonstrate its substantial empirical improvement through both simulations as well as the data generated from a P300 speller simulator that was built upon the real BCI experiments. ... We conduct intensive experiments using both simulations and data generated from a P300 speller simulator that was built based on real BCI experiments (Ma et al., 2022).
Researcher Affiliation	Collaboration	Xin Zhou EMAIL Division of Biostatistics, University of California at Berkeley Botao Hao EMAIL Google Deepmind Tor Lattimore EMAIL Google Deep Mind Jian Kang EMAIL Department of Biostatistics, University of Michigan Lexin Li EMAIL Division of Biostatistics, University of California at Berkeley
Pseudocode	Yes	Algorithm 1 Sequential best-arm identification.
Open Source Code	No	The paper does not provide concrete access to source code. It does not include an unambiguous statement of code release, nor a direct link to a code repository for the methodology described in the paper.
Open Datasets	No	The paper utilizes a P300 speller simulator (Ma et al., 2022) to generate data, rather than using a publicly available dataset directly. It also uses GPT-3 to generate words and references a benchmark phrase set (Mac Kenzie & Soukoreff, 2003) and a news article (Wong, 2024) as sources of text, but does not provide access to a specific dataset for evaluation.
Dataset Splits	No	The paper describes simulation setups where experiments are replicated (e.g., "We replicate each experiment B = 200 times" and "We repeat each prompt B = 100 times"), but it does not specify traditional training, test, or validation dataset splits, as the data is primarily generated through simulation rather than being a static pre-existing dataset.
Hardware Specification	No	All computations were done using CPUs on Google Colab Cluster. This statement mentions the type of processor (CPUs) and the computing environment (Google Colab Cluster) but lacks specific details such as CPU models, memory, or GPU specifications.
Software Dependencies	No	The paper mentions tools and models like GPT-3, GPT-2, and stepwise linear discriminant analysis, but it does not provide specific version numbers for any software libraries, frameworks, or languages used to implement the methodology or run the experiments.
Experiment Setup	Yes	We define the prior for the mean reward θm through (2.2), which requires the specification for the prior of the optimal arm, and the prior of the conditional mean reward. We assume the prior of the optimal arms satisfies the Markov property... we set the prior parameters as µ = 0, σ2 0 = 0.2, = 2. Moreover, when there is an oracle or external resource that reveals the identity of the optimal arm at the end of each task, STTS can start with an exact prior, which we call STTS-Oracle. We set pm,j [0, 1] for different algorithms as follows. ... We vary the number of arms J {10, 20}, and set the confidence level δ = 0.1. ... For the fixed-confidence setting, we stop the algorithm when the Chernoff stopping rule is satisfied; i.e., t : min aj =ψt,m µm,t,supp(ψt,m) µm,t,j q σ2 m,t,supp(ψt,m) + σ2 m,t,j γt where ψt,m = argmaxa a µm,t, and γt = [2 log{log(t)M/δ}]1/2. Here, we approximate the KL-divergence of two Gaussian mixture distributions by the KL-divergence of two Gaussian distributions with the same mean and variance (Hershey & Olsen, 2007). Moreover, as the theoretical stopping rule of BR is conservative, we multiply its range by a factor of 0.25. ... We vary tmax {5, 10, 15, . . . , 100}, and p {J 1, 0.5, 0.8, 1.0}. ... In our experiment, we set the number of electrodes to 16, the noise variance σ2 EEG {1, 2.5}, the noise spatial correlation based on a Gaussian kernel function, the noise temporal correlation from an AR(1) model with an autocorrelation 0.9, and the mean magnitude of the target stimulus five times that of the non-target stimulus. ... We truncate the vocabulary size of GPT-2 from the original size of 50257 to 100, following the top-K sampling (Fan et al., 2018) and the nucleus sampling (Holtzman et al., 2019).