reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Blind Spot Detection for Safe Sim-to-Real Transfer

Authors: Ramya Ramakrishnan, Ece Kamar, Debadeepta Dey, Eric Horvitz, Julie Shah

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach across two domains and demonstrate that it achieves higher predictive performance than baseline methods, and also that the learned model can be used to selectively query an oracle at execution time to prevent errors. We also empirically analyze the biases of various feedback types and how these biases inﬂuence the discovery of blind spots.
Researcher Affiliation	Collaboration	Ramya Ramakrishnan EMAIL Massachusetts Institute of Technology 77 Massachusetts Ave, Cambridge, MA 02139 Ece Kamar EMAIL Debadeepta Dey EMAIL Eric Horvitz EMAIL Microsoft Research 14865 NE 36th St, Redmond, WA 98052 Julie Shah EMAIL Massachusetts Institute of Technology 77 Massachusetts Ave, Cambridge, MA 02139
Pseudocode	Yes	Algorithm 1 Dawid-Skene
Open Source Code	Yes	Code for the experiments in this paper is available at https://github.com/ramya-ram/discovering-blind-spots.
Open Datasets	No	The paper uses modified versions of the Catcher and Flappy Bird games. These are custom environments/tasks described within the paper, and no specific access information (URL, DOI, repository, or citation to a public dataset) is provided for the exact datasets or environments used in their experiments. The paper does not refer to established benchmark datasets.
Dataset Splits	No	The paper mentions "runs threefold cross-validation with oversampled data" and "we reserve 30% of the full training data for calibration." While cross-validation is a splitting strategy, the explicit training/testing split percentages for the primary model evaluation or predefined standard splits are not provided. The 30% calibration split is for a specific part of the model learning process, not the overall dataset partitioning for evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU models, or memory specifications. It only discusses the experimental methodology and results.
Software Dependencies	No	The paper mentions the use of a "random forest (RF) classiﬁer" and the "Dawid-Skene algorithm," which are methods. However, it does not specify any software libraries or frameworks used (e.g., scikit-learn, PyTorch, TensorFlow) nor their specific version numbers. A programming language or specific version is also not mentioned.
Experiment Setup	Yes	To simulate diﬀerent acceptable functions, we ﬁrst trained an agent on the true realworld environment to obtain the optimal real-world Q-value function, Qreal. We then computed, for each state sreal Sreal, the diﬀerence in Q-values between the optimal action and every other action: Qi sreal = Qreal(sreal, a ) Qreal(sreal, ai), ai A. The set of all Q-value deltas, { Qi sreal}, quantiﬁes all possible mistakes the agent could make. The deltas are sorted in ascending order from least-dangerous to costliest mistakes, and the model chooses a cutoﬀdelta value δ based on a speciﬁed percentile p at which to separate acceptable and unacceptable actions. This cutoﬀvalue is used to deﬁne the acceptable function in an experimental setting and, consequently, the set of blind spots (agent observations with at least one unacceptable action in a real-world state mapping to it) for the task. ... for the lenient oracle, we used p = 0.95 for the Catcher domain and p = 0.7 for the Flappy Bird domain. ... All of these results are based on a high budget of 50,000 labels from the oracle. ... We set a constant budget of 250 labels obtained from demonstration data ... While keeping the amount of demonstration data constant at 1,000 labels, we increased the amount of combined corrections data...