reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Playing Codenames with Language Graphs and Word Embeddings

Authors: Divya Koyyalagunta, Anna Sun, Rachel Lea Draelos, Cynthia Rudin

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with human evaluators demonstrate that our proposed innovations yield state-of-the-art performance, with up to 102.8% improvement in precision@2 in some cases.
Researcher Affiliation	Academia	Divya Koyyalagunta EMAIL Anna Sun EMAIL Rachel Lea Draelos EMAIL Cynthia Rudin EMAIL Department of Computer Science Duke University Durham, NC 27708, USA
Pseudocode	Yes	Appendix B. Algorithms Algorithm 1: Extracting single-word clues for a synset Algorithm 2: Querying Babel Net Edges
Open Source Code	Yes	All of our code for these proposed methods is available for public use.1 1. https://github.com/divyakoyy/codenames
Open Datasets	Yes	The Dict2Vec component of DETECT uses dictionary deﬁnitions from Cambridge, Oxford Collins, and dictionary.com, and an embedding method to summarize this data. The FREQ(w) component of DETECT uses a cleaned subset of Wikipedia (Mahoney, 2020). Mahoney, M. (2011 (accessed October 3, 2020)). About the test data.. http://mattmahoney.net/dc/textdata.html.
Dataset Splits	Yes	The Codenames boards used to tune the parameters were randomly sampled from the total of 208 * 20 = 3.68e+27 Codenames boards (208 being the possible board words, and 20 being the board size). The Codenames boards used for human evaluation on AMT were diﬀerent boards, randomly sampled from all possible boards with each having probability 1 / 3.68e+27. ... To compare the algorithms, 60 unique Codenames boards of 20 words each were randomly generated from a list of 208 words obtained from the oﬃcial Codenames cards.
Hardware Specification	No	The paper mentions using pre-trained models like BERT and references software packages, but it does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for running their own experiments.
Software Dependencies	No	The word2vec, Glo Ve, and fast Text vectors were obtained from the gensim library (ˇReh uˇrek & Sojka, 2010), and BERT contextualized embeddings were obtained from a pre-trained BERT model (bert 12 768 12, book corpus wiki en uncased) made available in the Gluon NLP package (Guo et al., 2020). An approximate nearest neighbor graph was produced from the ﬁnal embeddings using the Annoy library.2 ... The paper mentions specific libraries and packages (gensim, Gluon NLP, Annoy) but does not provide version numbers for them.
Experiment Setup	Yes	We found λB = 1 and λR = 0.5 to be eﬀective values empirically across all word representations. ... α was chosen empirically based on the distribution of document frequencies in a cleaned subset of the Wikipedia corpus as shown in Figure 5... We found λF = 2, λD = 1 for Glo Ve ﬁltered on the top 10k English words, and λD = 2 for all other representations, to be most eﬀective empirically.