reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Elicitation of Latent Information Using Natural Language

Authors: Jimmy Wang, Thomas P Zollo, Richard Zemel, Hongseok Namkoong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments focus on three applications: the Twenty Questions game (using our novel and publicly available dataset, described below), opinion polling, and student assessment. In each scenario, the objective is to adaptively select questions that reveal as much information as possible with respect to a separate (though potentially overlapping) set of target questions. ... Overall results for our method and 2 baselines across all 3 datasets are shown in Figure 2. The top row of plots record accuracy on the target questions, while the bottom row record perplexity (or negative log-likelihood loss).
Researcher Affiliation	Academia	1Columbia University. Correspondence to: Jimmy Wang <EMAIL>, Thomas Zollo <EMAIL>.
Pseudocode	No	The paper describes algorithms and procedures like 'Greedy Selection' and 'Lookahead / Monte Carlo Planning' in narrative text within Section 2.4, but it does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our code is available at https://github.com/namkoong-lab/adaptive-elicitation.
Open Datasets	Yes	To operationalize this game for benchmarking, we construct a novel Twenty Questions dataset from a curated set of objects in the THINGS database (Hebart et al., 2019)... Our dataset is publicly available,1 including the complete set of objects, curated questions, generated answers, and relevant metadata. Opinion QA (Santurkar et al., 2023) Originally created to evaluate the alignment of LLM opinions... EEDI Tutoring Dataset (Wang et al., 2020) EEDI is an online educational and tutoring platform...
Dataset Splits	Yes	We first split the training datasets by entity into train, validation, and test with a 70%, 15%, 15% split.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running the experiments. It mentions using 'Llama-3.1-8B model in FP16 precision' but no hardware specifications.
Software Dependencies	No	The paper mentions using a 'pre-trained Llama-3.1-8B model' and 'Lo RA (Hu et al., 2021) to finetune our model', and 'Adam W (Loshchilov & Hutter, 2019) optimizer', along with 'Alibaba-NLP/gte-large-en-v1.5 as our embedding model'. While these refer to specific models and techniques, the paper does not provide version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch/TensorFlow, CUDA).
Experiment Setup	Yes	We initialize a pre-trained Llama-3.1-8B model in FP16 precision and use Lo RA (Hu et al., 2021) to finetune our model with parameters α = 24, rank= 8, and dropout= 0.1. Additional details are shown in Appendix C.1. ... We initialize the Adam W (Loshchilov & Hutter, 2019) optimizer with learning rate of 0.0001 and β = (0.9, 0.95), weight decay of 0.1, and we use a linear warmup for the learning rate after which we use a cosine scheduler. We train our model for 10, 000 epochs with a batch size of 4 and block size of 1024, after which we take the checkpoint with the lowest validation loss.