reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Authors: Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Sync Bench uncover critical insights into existing LLM agents capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agents 3.33% to Claude-3.5-Sonnet 28.18%), their consistently low collaboration willingness ( 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents resource-aware out-of-sync
Researcher Affiliation	Collaboration	1University of Illinois Urbana-Champaign 2All Hands AI 3Northwestern University. Work done during internship at UIUC. Correspondence to: Xuehang Guo <EMAIL>, Xingyao Wang <EMAIL, EMAIL>, Yangyi Chen <EMAIL>, Sha Li <EMAIL>, Chi Han <EMAIL>, Manling Li <EMAIL>, Heng Ji <EMAIL>.
Pseudocode	No	The paper describes frameworks and methodologies (Sync Mind, Sync Bench) and provides figures illustrating processes (Fig. 3, Fig. 5) but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are openly available on our project website: https://xhguo7.github.io/Sync Mind/.
Open Datasets	Yes	Our code and data are openly available on our project website: https://xhguo7.github.io/Sync Mind/. Sync Bench, a benchmark featuring 24,332 instances of agent outof-sync scenarios in real-world CSE derived from 21 popular Git Hub repositories with executable verification tests.
Dataset Splits	Yes	In constructing our evaluation subset with 300 representative instances1 across 21 repositories, we downsample each repository s data to less than 15 instances while maintaining the original patch distribution over all sampled data, thereby applying the same task complexity distribution to all downsampled instances. As such, we finalize our evaluation samples as 300 instances1 with evenly distributed Caller and Callee samples (150 each).
Hardware Specification	No	The paper discusses 'computing resources for debugging and testing' and 'costly expenditure of extensive model evaluations', but does not provide specific hardware details such as GPU/CPU models or processor types.
Software Dependencies	Yes	We employ Docker (Founadi et al., 2013) to configure isolated, reproducible, and executable testing environments... platform linux -Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
Experiment Setup	Yes	Recovery Protocol. For baselines, each agent is allowed up to 30 turns to achieve Bn = Sn, which is then extended to 50 turns to assess agents temporal resource awareness and exploitation. Financial resources are mapped similarly to each resource-aware recovery task. Provided with different action options interacting with Env, proposing a solution, or proactively seeking collaborator assistance ( 2.2) both independent and collaborative agents take each of their moves autonomously. ... Setting the balanced cost of both solution-proposal and assistance-seeking as $100, we encourage agents to take these two recovery actions by providing them with an initial budget of $300, $1000, and $3000, respectively. Meanwhile, all experiments are conducted with the maximum time limit set to 30 turns