CURIE: Evaluating LLMs on Multitask Scientific Long-Context Understanding and Reasoning
Authors: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Michael Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael Brenner, Sameera Ponda, Subhashini Venugopalan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information, and multi-step reasoning. |
| Researcher Affiliation | Collaboration | 1Google, 2Harvard, 3University of Zurich, 4NIST, 5UMD College Park, 6Rutgers, 7FU Berlin, 8Modelyst, 9Cornell |
| Pseudocode | No | The paper describes tasks for LLMs that involve generating code (e.g., 'DFT-C: Write python code for DFT calculations' and Figure 13 is a 'Prompt for generating Python code'). However, the paper does not contain pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | Evaluation code and data links in: https://github.com/google/curie |
| Open Datasets | Yes | We make the data, prompts, and code available in https://github.com/google/curie under the Apache 2.0 license. Our dataset is available under a CC-BY license. |
| Dataset Splits | Yes | The CURIE benchmark introduces 10 tasks, with a total of 580 input and solution pairs based on 429 research documents across six diverse scientific disciplines: materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins covering both experimental and theoretical aspects of scientific research. We follow a standard zero-shot prompt template across tasks... Performance is reported for each model on each task using a single run. |
| Hardware Specification | No | The paper evaluates several state-of-the-art LLMs but does not specify the hardware (e.g., CPU, GPU models) used to run these evaluations for the experiments described in the paper. |
| Software Dependencies | No | The paper mentions using the 'Atomic Simulation Environment (ASE) library' and the 'Biopython library', but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | We follow a standard zero-shot prompt template across tasks, first describing the task the model needs to perform and the desired output format, and then providing the text of the full paper (except for Long LLa Ma-3B full paper was provided first). In the case of DFT and MPV tasks, we provide the output format in the context of an additional hand-crafted excerpt to clarify expectation of formats for each field. The BIOGR task is multimodal, and for this we provide just the image and caption as input, rather than the full paper. Performance is reported for each model on each task using a single run. |