reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predicting Strategic Behavior from Free Text

Authors: Omer Ben-Porat, Sharon Hirsch, Lital Kuchy, Guy Elad, Roi Reichart, Moshe Tennenholtz

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments with three well-studied games, our algorithm compares favorably with strong alternative approaches. In ablation analysis, we demonstrate the importance of our modeling choices the representation of the text with the commonsensical personality attributes and our classiﬁer to the predictive power of our model.
Researcher Affiliation	Academia	Faculty of Industrial Engineering and Management Technion Israel Institute of Technology, Israel Faculty of Computer Science Technion Israel Institute of Technology, Israel
Pseudocode	No	The paper describes the clustering algorithm used in text, stating: "Particularly, we cluster the example set X with a bottom-up agglomerative clustering algorithm using Ward s minimum variance criterion for cluster merging (Ward Jr., 1963), which is the default linkage method in the package we employed, Scikit-learn (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, et al., 2011)." However, it does not present this algorithm, or any other procedure, in a structured pseudocode block or a clearly labeled algorithm section.
Open Source Code	Yes	Our data set and the all the information relevant for the data collection crowd-sourcing tasks are publicly available here: https://github.com/omerbp/Predicting-NLPGT.
Open Datasets	Yes	Our data set and the all the information relevant for the data collection crowd-sourcing tasks are publicly available here: https://github.com/omerbp/Predicting-NLPGT.
Dataset Splits	Yes	We randomly sample train and test sets, S and S , such that the training set is comprised of 90% of the data.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions "Scikit-learn" as the package employed for clustering but does not provide a specific version number. It also references "IBM Personality Insights service" and "Linguistic Inquiry and Word Count (LIWC)" as tools, but again, no specific version numbers for their usage within the experiments are given. Citations for underlying theories or components (like Glove word embeddings or Scikit-learn's original publication) are provided but do not specify the version used in the experiment.
Experiment Setup	Yes	For TAC we focus our evaluation on the range of 2 30 clusters. For K-NN we consider K {1, . . . , 5}. For the clustering with tf-idf representation, we consider 2 30 clusters, as for TAC, and compute tf-idf for the 1904 vocabulary words after removing stop words and punctuation marks. Hyper-parameter values that give the best results (upper table) are: K-NN: (1, 1, 1) neighbors, TAC: (13, 30, 26) clusters, TAC-IBM-13: (30, 25, 8) clusters, TAC-IBM-37: (4, 23, 19) clusters, TAC-LIWC-19: (17, 11, 14) clusters, TAC-LIWC-43: (28, 20, 30) clusters, and Trans Text Cluster: (28, 25, 28) clusters, for (Chicken, Box, Door), respectively.