reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Authors: Hojae Han, seung-won hwang, Rajhans Samdani, Yuxiong He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments using both CONVCODEWORLD and CONVCODEBENCH across 21 different open and closed-source models including R1-Distill (Deep Seek-AI et al. (2025); Appendix A), we have gathered several key insights
Researcher Affiliation	Collaboration	Hojae Han Seung-won Hwang Rajhans Samdani Yuxiong He Snowflake AI Research Seoul National University
Pseudocode	No	The paper describes methodologies and provides code examples in appendices, but does not include any clearly labeled pseudocode or algorithm blocks for its primary method.
Open Source Code	Yes	All implementations and benchmarks are publicly available at https://huggingface.co/spaces/Conv Code World/Conv Code World.
Open Datasets	Yes	To implement CONVCODEWORLD, we extended Big Code Bench-Full-Instruct (Zhuo et al., 2024), a single-turn Python code generation benchmark
Dataset Splits	No	The paper mentions Big Code Bench with 1,140 problems as the basis for evaluation but does not specify any training/test/validation splits for these problems in the context of their experiments. It primarily describes evaluating pre-existing LLMs on this set of problems.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies	No	The paper mentions using DSPy but does not provide a specific version number. It also cites Python 3.9 in traceback logs, but this is not explicitly stated as a dependency for their implementation with version information.
Experiment Setup	Yes	Hyperparameters are set as follows: We used greedy decoding (temperature = 0) in all experiments, following Chen et al. (2023). The total number of turns n = 10, with a maximum token length of 8K for all code generation models. For models with a lower token limit, we use their respective maximum length. For verbal feedback generation, we use GPT-4o-2024-05-13 with a token limit of 2K.