reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning

Authors: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Michael Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael Brenner, Sameera Ponda, Subhashini Venugopalan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information, and multi-step reasoning.
Researcher Affiliation	Collaboration	1Google, 2Harvard, 3University of Zurich, 4NIST, 5UMD College Park, 6Rutgers, 7FU Berlin, 8Modelyst, 9Cornell
Pseudocode	No	The paper describes tasks for LLMs that involve generating code (e.g., 'DFT-C: Write python code for DFT calculations' and Figure 13 is a 'Prompt for generating Python code'). However, the paper does not contain pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	Evaluation code and data links in: https://github.com/google/curie
Open Datasets	Yes	We make the data, prompts, and code available in https://github.com/google/curie under the Apache 2.0 license. Our dataset is available under a CC-BY license.
Dataset Splits	Yes	The CURIE benchmark introduces 10 tasks, with a total of 580 input and solution pairs based on 429 research documents across six diverse scientiﬁc disciplines: materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins covering both experimental and theoretical aspects of scientiﬁc research. We follow a standard zero-shot prompt template across tasks... Performance is reported for each model on each task using a single run.
Hardware Specification	No	The paper evaluates several state-of-the-art LLMs but does not specify the hardware (e.g., CPU, GPU models) used to run these evaluations for the experiments described in the paper.
Software Dependencies	No	The paper mentions using the 'Atomic Simulation Environment (ASE) library' and the 'Biopython library', but it does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	We follow a standard zero-shot prompt template across tasks, ﬁrst describing the task the model needs to perform and the desired output format, and then providing the text of the full paper (except for Long LLa Ma-3B full paper was provided ﬁrst). In the case of DFT and MPV tasks, we provide the output format in the context of an additional hand-crafted excerpt to clarify expectation of formats for each ﬁeld. The BIOGR task is multimodal, and for this we provide just the image and caption as input, rather than the full paper. Performance is reported for each model on each task using a single run.