Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Playing Codenames with Language Graphs and Word Embeddings
Authors: Divya Koyyalagunta, Anna Sun, Rachel Lea Draelos, Cynthia Rudin
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with human evaluators demonstrate that our proposed innovations yield state-of-the-art performance, with up to 102.8% improvement in precision@2 in some cases. |
| Researcher Affiliation | Academia | Divya Koyyalagunta EMAIL Anna Sun EMAIL Rachel Lea Draelos EMAIL Cynthia Rudin EMAIL Department of Computer Science Duke University Durham, NC 27708, USA |
| Pseudocode | Yes | Appendix B. Algorithms Algorithm 1: Extracting single-word clues for a synset Algorithm 2: Querying Babel Net Edges |
| Open Source Code | Yes | All of our code for these proposed methods is available for public use.1 1. https://github.com/divyakoyy/codenames |
| Open Datasets | Yes | The Dict2Vec component of DETECT uses dictionary definitions from Cambridge, Oxford Collins, and dictionary.com, and an embedding method to summarize this data. The FREQ(w) component of DETECT uses a cleaned subset of Wikipedia (Mahoney, 2020). Mahoney, M. (2011 (accessed October 3, 2020)). About the test data.. http://mattmahoney.net/dc/textdata.html. |
| Dataset Splits | Yes | The Codenames boards used to tune the parameters were randomly sampled from the total of 208 * 20 = 3.68e+27 Codenames boards (208 being the possible board words, and 20 being the board size). The Codenames boards used for human evaluation on AMT were different boards, randomly sampled from all possible boards with each having probability 1 / 3.68e+27. ... To compare the algorithms, 60 unique Codenames boards of 20 words each were randomly generated from a list of 208 words obtained from the official Codenames cards. |
| Hardware Specification | No | The paper mentions using pre-trained models like BERT and references software packages, but it does not specify any hardware details (e.g., GPU models, CPU types, memory amounts) used for running their own experiments. |
| Software Dependencies | No | The word2vec, Glo Ve, and fast Text vectors were obtained from the gensim library (ˇReh uˇrek & Sojka, 2010), and BERT contextualized embeddings were obtained from a pre-trained BERT model (bert 12 768 12, book corpus wiki en uncased) made available in the Gluon NLP package (Guo et al., 2020). An approximate nearest neighbor graph was produced from the final embeddings using the Annoy library.2 ... The paper mentions specific libraries and packages (gensim, Gluon NLP, Annoy) but does not provide version numbers for them. |
| Experiment Setup | Yes | We found λB = 1 and λR = 0.5 to be effective values empirically across all word representations. ... α was chosen empirically based on the distribution of document frequencies in a cleaned subset of the Wikipedia corpus as shown in Figure 5... We found λF = 2, λD = 1 for Glo Ve filtered on the top 10k English words, and λD = 2 for all other representations, to be most effective empirically. |