reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Cooperation in Language Games with Bayesian Inference and the Cognitive Hierarchy

Authors: Joseph Bills, Christopher Archibald, Diego Blaylock

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this approach by constructing Bayesian agents for the game of Codenames, and show that they perform better in experiments where semantics is uncertain. Experimental evaluation of the agents will be given in Section 6. Table 2 shows the win-rate performance of the two Bayesian spymasters against different groups of guessers.
Researcher Affiliation	Academia	Joseph Bills, Christopher Archibald, Diego Blaylock Computer Science Department, Brigham Young University, Provo, UT, USA EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Figure 1: Bayesian Guesser evaluation of board card c
Open Source Code	No	The paper does not provide a direct link to a source-code repository, an explicit statement of code release, or mention of code in supplementary materials for the methodology described.
Open Datasets	Yes	Word2Vec (w2v) trained using a word context windows (Mikolov et al. 2013b). Dict2Vec (d2v) similar to w2v but trained on cleaned dictionary entries with an improvement on semantic similarity tasks (Tissier, Gravier, and Habrard 2017). Fast Text (ftxt) Uses bags of character n-grams with weighting by position (Mikolov et al. 2018). Glo Ve (g1, g3) trained on pre-computed statistical cooccurrence probabilities for words in a corpus (Pennington, Socher, and Manning 2014). Concept Net Numberbatch (cnnb) uses retrofitting to incorporate the Concept Net Knowledge graph into an embedding. (Speer, Chin, and Havasi 2017). ELMo (elmo) a 1024-dimensional de-contextualized embedding derived from 3 layers of a trained contextual model (Peters et al. 2018).
Dataset Splits	No	The paper describes playing '500 games' for each pairing, which implies simulation runs rather than a traditional dataset with specific training/test/validation splits in the machine learning sense. No explicit dataset splits are mentioned.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several word embedding models (e.g., Word2Vec, GloVe) but does not provide specific version numbers for any software libraries, programming languages, or solvers used in its implementation.
Experiment Setup	Yes	The Bayesian spymasters used 10 samples, and the Bayesian guessers used 1000 or 10,000 samples. Each pairing played 500 games. To more efficiently calculate the set of possible clues, the 300 nearest neighbors of each word were precomputed. The probability that a perturbed vector would fall in the Voronoi region for any clue was precomputed using 1000 samples at each noise level.