reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CycleResearcher: Improving Automated Research via Automated Review

Authors: Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, Linyi Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate that Cycle Reviewer achieves promising performance with a 26.89% reduction in mean absolute error (MAE) compared to individual human reviewers in predicting paper scores... Our experiments show that Cycle Reviewer demonstrates promising capabilities in supporting the peer review process, while Cycle Researcher exhibits consistent performance in research ideation and experimental design compared to API-based agents (Lu et al., 2024).
Researcher Affiliation	Academia	1Research Center for Industries of the Future, Westlake University 2School of Engineering, Westlake University 3Zhejiang University 4William&Mary 5University College London
Pseudocode	No	The paper describes methods and processes through narrative text and figures (e.g., Figure 2: Iterative Training Framework, Section 3: ITERATIVE TRAINING FRAMEWORK) and mathematical equations, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code, dataset and model weight are released at https://wengsyx.github.io/Researcher/. ... We have made extensive efforts to ensure the reproducibility of all results presented in this paper. Firstly, the models discussed in this work, including Cycle Researcher and Cycle Reviewer, will be made available as open-source, along with detailed documentation for setup and usage (See in Section 3.1, Section 3.2, and Appendix F).
Open Datasets	Yes	To train these models, we develop two new datasets, Review-5k and Research-14k, reflecting real-world machine learning research and peer review dynamics. ... We release two large-scale datasets, Review-5k and Research-14k, which are publicly available and designed to capture the complexity of both peer review and research paper generation in machine learning.
Dataset Splits	Yes	Finally, we split our dataset into mutually exclusive training/testing sets, we keep 4,189 paper reviews for training and 782 samples for testing. ... After filtering papers that do not meet the requirements, the final dataset, Research-14k, includes 12,696 training samples and 802 test samples. ... The training and test sets are split chronologically, with test papers published later than the training ones.
Hardware Specification	Yes	We use the Mistral-Large-2 model with Lo RA-GA (Wang et al., 2024a) on an 8x H100 80G cluster, with a learning rate of 1e-5 and a batch size of 4x8, for 12 epochs on the Reviewer-5k dataset. ... All models are trained using 8x H100 GPUs and Deep Speed + Ze RO2 (Rajbhandari et al., 2020; Rasley et al., 2020). ... costing approximately $20 and taking 6 hours on a single A100 GPU server.
Software Dependencies	No	The paper mentions specific frameworks and techniques like "Lo RA-GA" and "Deep Speed + Ze RO2". However, it does not provide specific version numbers for these, nor does it list core software components like Python, PyTorch, or TensorFlow with their respective versions.
Experiment Setup	Yes	We use the Mistral-Large-2 model with Lo RA-GA (Wang et al., 2024a) on an 8x H100 80G cluster, with a learning rate of 1e-5 and a batch size of 4x8, for 12 epochs on the Reviewer-5k dataset. ... We maximized context length by setting the 12B model to 32K tokens, while the 72B and 123B models were set to 24K tokens. Given memory constraints, samples exceeding the preset context length are randomly truncated. We use a batch size of 2 8, a learning rate of 4e 5, and train for a total of 12,000 steps. ... In our experiments, we utilize a uniform architecture for all networks consisting of an embedding layer with an input size equal to the input dataset value and a three-layer MLP with a hidden dimension of size 50. ... The default learning rates for our experiments were 1e-4 for random initialization (tuned from 1e-1 to 1e-5) and 5e-4 for structured initialization (tuned from 1e-3 to 1e-5). Consistent across runs, we use an L2 regularization coefficient (α) of 0.05, a dropout rate (pdropout) of 0.3, gradient accumulation of 40, and a batch size of 50000. We set the first momentum (β1) and second momentum weights (β2) to 0.9 and 0.999 respectively. Each network was trained for 7000 update steps.