reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Future Research and Challenges Remain Towards AI for Software Engineering

Authors: Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Kevin Ellis, Koushik Sen, Armando Solar-Lezama

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, our goal is threefold. First, we provide a taxonomy of measures and tasks to categorize work towards AI software engineering. Second, we outline key bottlenecks permeating today s approaches. Finally, we call for large open-source community efforts and lay out a collection of promising research directions to address these challenges, hoping that we can all come together to advance and shape the future of AI for code. This paper provides an opinionated view of the tasks, challenges, and promising directions towards achieving this goal.
Researcher Affiliation	Academia	1MIT 2UC Berkeley 3Cornell.
Pseudocode	No	The paper does not provide any pseudocode or algorithm blocks for its own methodology. It includes code listings (e.g., Listing 1, Listing 2, Listing C.7, Listing C.8) as examples or from other works, not as pseudocode for its own contribution.
Open Source Code	No	The paper does not provide any statement or link for open-source code for its own methodology.
Open Datasets	Yes	Function-level scope refers to single, self-contained functions such as in Human Eval (Chen et al., 2021a) and MBPP (Austin et al., 2021). Self-contained unit scope goes beyond singular functions and to larger chunks of code such as entire files and classes, such as Full Stack Bench (Liu et al., 2024d) and Big Code Bench (Zhuo et al., 2024). Finally, project-level scope refers to larger codebases such as entire repositories, such as in Commit0 (Zhao et al., 2024) and SWE-Bench (Jimenez et al., 2024). When developing LLMs for code, the open-source community relies on datasets like the Stack (Lozhkov et al., 2024), consisting of trillions of Git Hub tokens.
Dataset Splits	No	The paper does not describe any experiments or new datasets with specific training/test/validation splits, as it is a position paper.
Hardware Specification	No	The paper does not describe any experiments that would require specific hardware specifications for its own methodology, as it is a position paper.
Software Dependencies	No	The paper does not describe any software implementation of its own methodology that would have specific software dependencies with version numbers, as it is a position paper.
Experiment Setup	No	The paper is a position paper and does not describe its own experimental setup details.