ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments
Authors: Hojae Han, seung-won hwang, Rajhans Samdani, Yuxiong He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments using both CONVCODEWORLD and CONVCODEBENCH across 21 different open and closed-source models including R1-Distill (Deep Seek-AI et al. (2025); Appendix A), we have gathered several key insights |
| Researcher Affiliation | Collaboration | Hojae Han Seung-won Hwang Rajhans Samdani Yuxiong He Snowflake AI Research Seoul National University |
| Pseudocode | No | The paper describes methodologies and provides code examples in appendices, but does not include any clearly labeled pseudocode or algorithm blocks for its primary method. |
| Open Source Code | Yes | All implementations and benchmarks are publicly available at https://huggingface.co/spaces/Conv Code World/Conv Code World. |
| Open Datasets | Yes | To implement CONVCODEWORLD, we extended Big Code Bench-Full-Instruct (Zhuo et al., 2024), a single-turn Python code generation benchmark |
| Dataset Splits | No | The paper mentions Big Code Bench with 1,140 problems as the basis for evaluation but does not specify any training/test/validation splits for these problems in the context of their experiments. It primarily describes evaluating pre-existing LLMs on this set of problems. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU specifications, or memory. |
| Software Dependencies | No | The paper mentions using DSPy but does not provide a specific version number. It also cites Python 3.9 in traceback logs, but this is not explicitly stated as a dependency for their implementation with version information. |
| Experiment Setup | Yes | Hyperparameters are set as follows: We used greedy decoding (temperature = 0) in all experiments, following Chen et al. (2023). The total number of turns n = 10, with a maximum token length of 8K for all code generation models. For models with a lower token limit, we use their respective maximum length. For verbal feedback generation, we use GPT-4o-2024-05-13 with a token limit of 2K. |