reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Retrieve-and-Edit Framework for Predicting Structured Outputs

Authors: Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, Percy S. Liang

NeurIPS 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that on a new autocomplete task for Git Hub Python code and the Hearthstone cards benchmark, retrieve-and-edit signiﬁcantly boosts the performance of a vanilla sequence-to-sequence model on both tasks.
Researcher Affiliation	Academia	Tatsunori B. Hashimoto Department of Computer Science Stanford University EMAIL Kelvin Guu Department of Statistics Stanford University EMAIL Yonatan Oren Department of Computer Science Stanford University EMAIL Percy Liang Department of Computer Science Stanford University EMAIL
Pseudocode	No	The paper describes the overall procedure in text (Section 3.1.4) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Reproducibility. Data and code used to generate the results of this paper are available on the Coda Lab Worksheets platform at https://worksheets.codalab.org/worksheets/ 0x1ad3f387005c492ea913cf0f20c9bb89/.
Open Datasets	Yes	Our Python autocomplete dataset is a representative sample of Python code from Git Hub, obtained from Google Bigquery by retrieving Python code containing at least one block comment with restructured text (re ST) formatting (See Appendix C for details). ... The Hearthstone cards benchmark consists of 533 cards in a computer card game, where each card is associated with a code snippet. The Hearthstone cards benchmark [22]
Dataset Splits	Yes	We also removed any duplicate function/docstring pairs and split the train and test set at the repository level. ... obtained by evaluating BLEU scores on the development set of both datasets.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow).
Experiment Setup	Yes	Both the retriever and editor were trained for 1000 iterations on Hearthstone and 3000 on Git Hub via ADAM minibatch gradient descent, with batch size 16 and a learning rate of 0.001.