Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Blind Spot Detection for Safe Sim-to-Real Transfer
Authors: Ramya Ramakrishnan, Ece Kamar, Debadeepta Dey, Eric Horvitz, Julie Shah
JAIR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach across two domains and demonstrate that it achieves higher predictive performance than baseline methods, and also that the learned model can be used to selectively query an oracle at execution time to prevent errors. We also empirically analyze the biases of various feedback types and how these biases influence the discovery of blind spots. |
| Researcher Affiliation | Collaboration | Ramya Ramakrishnan EMAIL Massachusetts Institute of Technology 77 Massachusetts Ave, Cambridge, MA 02139 Ece Kamar EMAIL Debadeepta Dey EMAIL Eric Horvitz EMAIL Microsoft Research 14865 NE 36th St, Redmond, WA 98052 Julie Shah EMAIL Massachusetts Institute of Technology 77 Massachusetts Ave, Cambridge, MA 02139 |
| Pseudocode | Yes | Algorithm 1 Dawid-Skene |
| Open Source Code | Yes | Code for the experiments in this paper is available at https://github.com/ramya-ram/discovering-blind-spots. |
| Open Datasets | No | The paper uses modified versions of the Catcher and Flappy Bird games. These are custom environments/tasks described within the paper, and no specific access information (URL, DOI, repository, or citation to a public dataset) is provided for the exact datasets or environments used in their experiments. The paper does not refer to established benchmark datasets. |
| Dataset Splits | No | The paper mentions "runs threefold cross-validation with oversampled data" and "we reserve 30% of the full training data for calibration." While cross-validation is a splitting strategy, the explicit training/testing split percentages for the primary model evaluation or predefined standard splits are not provided. The 30% calibration split is for a specific part of the model learning process, not the overall dataset partitioning for evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU models, or memory specifications. It only discusses the experimental methodology and results. |
| Software Dependencies | No | The paper mentions the use of a "random forest (RF) classifier" and the "Dawid-Skene algorithm," which are methods. However, it does not specify any software libraries or frameworks used (e.g., scikit-learn, PyTorch, TensorFlow) nor their specific version numbers. A programming language or specific version is also not mentioned. |
| Experiment Setup | Yes | To simulate different acceptable functions, we first trained an agent on the true realworld environment to obtain the optimal real-world Q-value function, Qreal. We then computed, for each state sreal Sreal, the difference in Q-values between the optimal action and every other action: Qi sreal = Qreal(sreal, a ) Qreal(sreal, ai), ai A. The set of all Q-value deltas, { Qi sreal}, quantifies all possible mistakes the agent could make. The deltas are sorted in ascending order from least-dangerous to costliest mistakes, and the model chooses a cutoffdelta value δ based on a specified percentile p at which to separate acceptable and unacceptable actions. This cutoffvalue is used to define the acceptable function in an experimental setting and, consequently, the set of blind spots (agent observations with at least one unacceptable action in a real-world state mapping to it) for the task. ... for the lenient oracle, we used p = 0.95 for the Catcher domain and p = 0.7 for the Flappy Bird domain. ... All of these results are based on a high budget of 50,000 labels from the oracle. ... We set a constant budget of 250 labels obtained from demonstration data ... While keeping the amount of demonstration data constant at 1,000 labels, we increased the amount of combined corrections data... |