reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CursorCore: Assist Programming through Aligning Anything

Authors: Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose a new framework that comprehensively integrates these information sources, and collect data to train models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as Git Hub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the Cursor Core series. We show that Cursor Core outperforms other models of comparable size.
Researcher Affiliation	Collaboration	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3i FLYTEK Co., Ltd.
Pseudocode	No	The paper describes methods and processes in narrative text and uses code examples in figures (e.g., Figure 1, Figure 2) to illustrate concepts, but does not contain any structured pseudocode or algorithm blocks with step-by-step procedures.
Open Source Code	Yes	Code, models and data are freely available at https:// github.com/Techx Genus/Cursor Core.
Open Datasets	Yes	For AI Programmer, we gather code snippets from datasets such as the Stack (Kocetkov et al., 2023) and OSS-Instruct (Wei et al., 2023b), then prompt LLMs to generate the programming process. For Git Commit data, we collect relevant information from Edit Pack FT (Cassano et al., 2023b) (a filtered version of Commit Pack FT (Muennighoff et al., 2024)) and further refine it through post-processing and filtering. Regarding Online Judge Submission data, we source the programming process from the Codenet dataset (Puri et al., 2021). ... we also incorporate the Evol-Instruct dataset (ISE-UIUC, 2023) collected using the GPT series (Ouyang et al., 2022)
Dataset Splits	No	The paper states that 219K samples are generated for training data, and a new benchmark APEval is introduced for evaluation. While APEval's collection process is described, there is no explicit mention of how the 219K training samples were split into training, validation, or test sets for the Cursor Core models' development. The paper mentions evaluating on APEval's Python version using the test set created by Eval Plus (Liu et al., 2023), but this refers to the evaluation benchmark, not the internal splits of their model's training data.
Hardware Specification	Yes	For Mistral-Large-Instruct, we quantize the model using the GPTQ (Frantar et al., 2022) algorithm and deploy it locally with SGLang (Zheng et al., 2023a) and Marlin kernel (Frantar et al., 2024) on 4 Nvidia RTX 4090 GPUs.
Software Dependencies	Yes	Our models are trained for 2 epochs using the Transformers library (Wolf et al., 2020). We enhance memory efficiency and speed with techniques such as Deepspeed Ze RO3 (Rajbhandari et al., 2019), Ze RO Offload (Ren et al., 2021), Flash Attention2 (Dao, 2024), and triton kernels (Hsu et al., 2024). ... The training process employs the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate of 5e-5, coupled with a cosine scheduler featuring 15 warm-up steps.
Experiment Setup	Yes	Our models are trained for 2 epochs using the Transformers library (Wolf et al., 2020). ... The training process employs the Adafactor optimizer (Shazeer & Stern, 2018) with a learning rate of 5e-5, coupled with a cosine scheduler featuring 15 warm-up steps.