reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Avoiding Negative Side Effects of Autonomous Systems in the Open World

Authors: Sandhya Saisubramanian , Ece Kamar, Shlomo Zilberstein

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate the trade-oﬀs in the performance of diﬀerent approaches in mitigating NSE in diﬀerent settings. ... We perform extensive evaluation of the diﬀerent feedback mechanisms for mitigating avoidable and unavoidable NSE. ... Values averaged over 100 trials of planning and execution, along with their standard errors, are reported for the following domains. ... The eﬀectiveness of shaping is evaluated in terms of the average NSE penalty incurred and the expected value of o P after shaping.
Researcher Affiliation	Collaboration	Sandhya Saisubramanian EMAIL School of Electrical Engineering and Computer Science Oregon State University ... Ece Kamar EMAIL Microsoft Research ... Shlomo Zilberstein EMAIL College of Information and Computer Sciences University of Massachusetts Amherst
Pseudocode	Yes	Algorithm 1 Slack Estimation ( M, N, E) ... Algorithm 2 Environment shaping to mitigate NSE ... Algorithm 3 Diverse modiﬁcations(b, Ω, Md, E0)
Open Source Code	No	The paper does not provide any explicit statement about releasing its own source code, nor does it include a link to a code repository. It mentions using 'sklearn Python package' but this refers to a third-party library, not the authors' implementation.
Open Datasets	No	The paper uses 'Boxpushing' and 'Driving' domains for its experiments, which are described as custom simulation environments. It does not provide specific links, DOIs, repository names, or formal citations for publicly available datasets used in the experiments. References like '(Seuken & Zilberstein, 2007)' and '(Saisubramanian, Kamar, & Zilberstein, 2020a; Wray et al., 2015)' are for related methodological papers, not specific dataset access.
Dataset Splits	No	The paper describes generating 'ﬁve instances with grid size 15 15' for the Boxpushing domain and 'Five test instances are generated with grid size 15 15' for the Driving domain. It also states that 'Values averaged over 100 trials of planning and execution' are reported. However, it does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined splits) for these instances or trials.
Hardware Specification	No	The paper states: 'The algorithms are implemented in Python and tested on a computer with 16GB of RAM.' This provides information about RAM but lacks specific details such as CPU models, GPU models, or processor types, which are necessary for a comprehensive hardware specification.
Software Dependencies	No	The paper mentions: 'Random forest regression from sklearn Python package is used for model learning.' and 'A random forest classiﬁer from the sklearn Python package is used for learning a predictive model.' While it names the 'sklearn Python package' and implicitly 'Python', it does not provide specific version numbers for either of these software dependencies.
Experiment Setup	Yes	We tested with β [0.1, 0.9] since o1 is prioritized in our formulation and report results with β = 0.8 as it achieved the best trade-oﬀin training. ... The slack is computed using Algorithm 1 and γ = 0.95. ... Conservative where the agent explores an action with probability 0.1 or follows its primary policy, moderate where the agent either explores an action with probability 0.5 or follows its primary policy, and radical where the agent predominantly explores with probability 0.9... Pushing the box on a surface type c = 1 results in severe NSE with a penalty of 10, pushing the box on a surface c=2 results in mild NSE and a penalty of 5... The cost of navigating at a low speed is two and that of high speed is one... We vary δA between 0-25% of V P ( s0\|E0) and δD between 0-25% of the NSE penalty of the actor s policy in E0.