reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Self-Explaining Deviations for Coordination

Authors: Hengyuan Hu, Samuel Sokota, David Wu, Anton Bakhtin, Andrei Lupu, Brandon Cui, Jakob Foerster

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, we evaluate IMPROVISED both in an illustrative toy setting and the popular benchmark setting Hanabi, where we show that it can produce so called ﬁnesse plays. We test the IMPROVISED in two different settings. The ﬁrst setting is the trampoline-tiger game explained before. Secondly, we apply IMPROVISED to three-player Hanabi, where we start from a blueprint trained on human data.
Researcher Affiliation	Collaboration	Hengyuan Hu Stanford University EMAIL Samuel Sokota Carnegie Mellon University EMAIL Meta AI EMAIL Anton Bakhtin Meta AI EMAIL Andrei Lupu Meta AI & FLAIR, University of Oxford EMAIL Brandon Cui Mosaic ML EMAIL Jakob N. Foerster FLAIR, University of Oxford EMAIL
Pseudocode	Yes	Please refer to the Appendix A for the detailed pseudocode.
Open Source Code	Yes	We provide the code for our Hanabi experiments at https://github.com/facebookresearch/off-belief-learning/blob/main/ pyhanabi/finesse.py.
Open Datasets	Yes	Lastly, we present experiments on the large scale benchmark Hanabi [1], where we show that IMPROVISED is able to produce ﬁnesse plays, which is one of the most interesting techniques that human experts perform frequently. To implement IMPROVISED in Hanabi, we ﬁrst need a belief function from which we can sample game states given either public or private knowledge of the game to perform Monte Carlo rollouts. Luckily, the belief over possible hands in Hanabi can be computed analytically [8]. We use a blueprint policy to generate selfplay games over a range of decks (game seeds)
Dataset Splits	No	The paper describes how specific experimental situations (finesse-able and finesse-complete) are generated for evaluation, but it does not provide explicit training, validation, or test dataset splits with percentages, counts, or specific pre-defined split methodologies for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or specific computing infrastructure) used for running its experiments.
Software Dependencies	No	The paper mentions 'pyhanabi' for its Hanabi experiments and refers to various prior works for agents (e.g., MAPPO, QMIX, SAD, Other-Play, OBL), but it does not list specific version numbers for any key software components or libraries used in its own experimental setup.
Experiment Setup	Yes	The detailed hyper-parameters and computational cost are in Section C.