SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering
Authors: Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Sync Bench uncover critical insights into existing LLM agents capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agents 3.33% to Claude-3.5-Sonnet 28.18%), their consistently low collaboration willingness ( 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents resource-aware out-of-sync |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2All Hands AI 3Northwestern University. Work done during internship at UIUC. Correspondence to: Xuehang Guo <EMAIL>, Xingyao Wang <EMAIL, EMAIL>, Yangyi Chen <EMAIL>, Sha Li <EMAIL>, Chi Han <EMAIL>, Manling Li <EMAIL>, Heng Ji <EMAIL>. |
| Pseudocode | No | The paper describes frameworks and methodologies (Sync Mind, Sync Bench) and provides figures illustrating processes (Fig. 3, Fig. 5) but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are openly available on our project website: https://xhguo7.github.io/Sync Mind/. |
| Open Datasets | Yes | Our code and data are openly available on our project website: https://xhguo7.github.io/Sync Mind/. Sync Bench, a benchmark featuring 24,332 instances of agent outof-sync scenarios in real-world CSE derived from 21 popular Git Hub repositories with executable verification tests. |
| Dataset Splits | Yes | In constructing our evaluation subset with 300 representative instances1 across 21 repositories, we downsample each repository s data to less than 15 instances while maintaining the original patch distribution over all sampled data, thereby applying the same task complexity distribution to all downsampled instances. As such, we finalize our evaluation samples as 300 instances1 with evenly distributed Caller and Callee samples (150 each). |
| Hardware Specification | No | The paper discusses 'computing resources for debugging and testing' and 'costly expenditure of extensive model evaluations', but does not provide specific hardware details such as GPU/CPU models or processor types. |
| Software Dependencies | Yes | We employ Docker (Founadi et al., 2013) to configure isolated, reproducible, and executable testing environments... platform linux -Python 3.11.9, pytest-8.3.2, pluggy-1.5.0 |
| Experiment Setup | Yes | Recovery Protocol. For baselines, each agent is allowed up to 30 turns to achieve Bn = Sn, which is then extended to 50 turns to assess agents temporal resource awareness and exploitation. Financial resources are mapped similarly to each resource-aware recovery task. Provided with different action options interacting with Env, proposing a solution, or proactively seeking collaborator assistance ( 2.2) both independent and collaborative agents take each of their moves autonomously. ... Setting the balanced cost of both solution-proposal and assistance-seeking as $100, we encourage agents to take these two recovery actions by providing them with an initial budget of $300, $1000, and $3000, respectively. Meanwhile, all experiments are conducted with the maximum time limit set to 30 turns |