reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How to talk so AI will learn: Instructions, descriptions, and autonomy

Authors: Theodore Sumers, Robert Hawkins, Mark K. Ho, Tom Griffiths, Dylan Hadfield-Menell

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our models with a behavioral experiment, demonstrating that (1) our speaker model predicts human behavior, and (2) our pragmatic listener successfully recovers humans reward functions.
Researcher Affiliation	Academia	Theodore R. Sumers Computer Science Princeton University EMAIL Robert D. Hawkins Princeton Neuroscience Institute Princeton University EMAIL Mark K. Ho Computer Science Princeton University EMAIL Thomas L. Griffiths Computer Science, Psychology Princeton University EMAIL Dylan Hadfield-Menell EECS, CSAIL MIT EMAIL
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Code and data are available at https://github.com/tsumers/how-to-talk.
Open Datasets	Yes	Code and data are available at https://github.com/tsumers/how-to-talk.
Dataset Splits	No	The paper describes calibrating model parameters (e.g., "To calibrate our pragmatic listeners, we tested βS1 [1, 10] and found that βS1 = 3 optimized Known H and Latent H listeners"), but does not explicitly provide training/validation/test splits for the human behavioral dataset collected in the experiment to enable reproduction of data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments or simulations.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	To calibrate our pragmatic listeners, we tested βS1 [1, 10] and found that βS1 = 3 optimized Known H and Latent H listeners (see Appendix B.3 for details)." and "we fix βL0 = 3 throughout this work".