ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Authors: Hojae Han, seung-won hwang, Rajhans Samdani, Yuxiong He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments using both CONVCODEWORLD and CONVCODEBENCH across 21 different open and closed-source models including R1-Distill (Deep Seek-AI et al. (2025); Appendix A), we have gathered several key insights
Researcher Affiliation Collaboration Hojae Han Seung-won Hwang Rajhans Samdani Yuxiong He Snowflake AI Research Seoul National University
Pseudocode No The paper describes methodologies and provides code examples in appendices, but does not include any clearly labeled pseudocode or algorithm blocks for its primary method.
Open Source Code Yes All implementations and benchmarks are publicly available at https://huggingface.co/spaces/Conv Code World/Conv Code World.
Open Datasets Yes To implement CONVCODEWORLD, we extended Big Code Bench-Full-Instruct (Zhuo et al., 2024), a single-turn Python code generation benchmark
Dataset Splits No The paper mentions Big Code Bench with 1,140 problems as the basis for evaluation but does not specify any training/test/validation splits for these problems in the context of their experiments. It primarily describes evaluating pre-existing LLMs on this set of problems.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper mentions using DSPy but does not provide a specific version number. It also cites Python 3.9 in traceback logs, but this is not explicitly stated as a dependency for their implementation with version information.
Experiment Setup Yes Hyperparameters are set as follows: We used greedy decoding (temperature = 0) in all experiments, following Chen et al. (2023). The total number of turns n = 10, with a maximum token length of 8K for all code generation models. For models with a lower token limit, we use their respective maximum length. For verbal feedback generation, we use GPT-4o-2024-05-13 with a token limit of 2K.