reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AAAR-1.0: Assessing AI’s Potential to Assist Research

Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EQUATIONINFERENCE, assessing the correctness of equations based on the contextual information in paper submissions; (ii) EXPERIMENTDESIGN, designing experiments to validate research ideas and solutions; and (iii) PAPERWEAKNESS, identifying weaknesses in paper submissions. An evaluation of both open-source and closed-source LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We conduct extensive experiments across numerous mainstream LLMs.
Researcher Affiliation	Collaboration	1Pennsylvania State University; 2Netflix; 3University of California, Davis; 4University of Illinois Chicago; 5Individual Researcher; 6University of Alabama at Birmingham; 7Ohio State University. Correspondence to: Renze Lou <EMAIL>, Wenpeng Yin <EMAIL>.
Pseudocode	No	The paper describes tasks and evaluation metrics, but does not present any structured pseudocode or algorithm blocks for a specific method or procedure. The text only refers to algorithms in a general sense within the context of the tasks, for example, 'This paper proposes an algorithm [ ], the result z is defined as below: z = where W is the parameter, a and b are the [ ]'.
Open Source Code	No	The paper provides a project webpage: 'Project Webpage: https://renzelou.github.io/AAAR-1.0/'. While this is a project page, it does not explicitly state that the source code for the methodology described in the paper is being released or provide a direct link to a code repository.
Open Datasets	Yes	In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EQUATIONINFERENCE... (ii) EXPERIMENTDESIGN... and (iii) PAPERWEAKNESS... We are committed to the careful distribution of data collected in our research, ensuring it is used solely for research purposes. Project Webpage: https://renzelou.github.io/AAAR-1.0/
Dataset Splits	No	For EQINFER, it mentions '# of positive equations 1,049 # of negative equations 3,147'. For WEAKNESS, it says 'we then uniformly sample papers from different research tracks to improve the domain diversity. Meanwhile, during sampling, we also keep the accept/reject papers distributed equally to avoid data bias. In a word, we finally collect a total of 1,000 papers (500 accepted; 500 rejected), uniformly covering all 13 tracks.' While these describe the dataset composition and collection, the paper does not explicitly provide training/validation/test splits for any of the datasets (EQINFER, EXPDESIGN, or WEAKNESS) for model evaluation.
Hardware Specification	Yes	We use Pytorch 2.4.0 with CUDA 12.1, and use 8 NVIDIA A100 GPUs for the LLMs inference.
Software Dependencies	Yes	We use Pytorch 2.4.0 with CUDA 12.1, and use 8 NVIDIA A100 GPUs for the LLMs inference.
Experiment Setup	Yes	Settings. As different LLMs have distinct context windows, to ensure a fair comparison, we fix the maximum input length for all models. According to Table 7, we empirically use 1,000 words for both contexts before and after equations, i.e., 2,000 surrounding words. ... Settings. Similarly, we unify the input context length of different LLMs to ensure a fair comparison. According to Table 8, we set 2,000 and 3,000 input words for open- and closed-source LLMs, respectively. ... For the length of each small piece, we set 2,000 and 3,000 words for open- and closed-source LLMs, respectively.