Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models
Authors: Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson-Parris, Yian Ma, Leon Bergen, Taylor Berg-Kirkpatrick
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this issue, we develop Clima Gen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. As a result, we present Clima QA-Gold, an expert-annotated benchmark dataset alongside Clima QA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science. Finally, we develop evaluation strategies and compare different LLMs on our benchmarks. Our results offer novel insights into various approaches used to enhance knowledge of climate LLMs. |
| Researcher Affiliation | Academia | Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson-Parris, Yian Ma, Leon Bergen & Taylor Berg-Kirkpatrick University of California, San Diego EMAIL |
| Pseudocode | No | The paper describes the Clima Gen framework and QA generation pipeline in natural language and figures (e.g., Figure 1), but it does not include explicitly labeled pseudocode or algorithm blocks. The process steps are explained within the main text of sections like '4.2 QA GENERATION FRAMEWORK'. |
| Open Source Code | Yes | Clima QA s source code is publicly available at https://github.com/Rose-STL-Lab/genie-climaqa |
| Open Datasets | Yes | The Clima QA dataset, both gold and silver, is publicly available at Hugging Face1. While we cannot release the scraped textbook data due to copyright restrictions, the references to all textbooks used are provided in the appendix A.1, allowing for reconstruction of this dataset. |
| Dataset Splits | Yes | Table 2: Contents of the Clima QA dataset. Both Clima QA-Gold and Clima QA-Silver include 3 task-forms with varying levels of complexity for MCQ and Freeform. Dataset Task Base Reasoning Hypothetical Total Clima QA-Gold MCQ 126 72 47 245 Freeform 54 52 55 161 Cloze 160 Clima QA-Silver MCQ 501 264 235 1000 Freeform 507 241 252 1000 Cloze 1000 |
| Hardware Specification | No | The paper mentions using specific LLMs for inference (e.g., gemma-27b, llama3-70b, mixtral-8x22b, gpt-3.5-turbo, gpt-4o) and platforms (Together AI, OpenAI), but does not specify the underlying hardware (GPU models, CPU types, etc.) used for running their experiments or training their models (e.g., for continued pre-training or fine-tuning). |
| Software Dependencies | No | The paper mentions using specific LLM models (e.g., gemma-27b, llama3-70b, mixtral-8x22b, gpt-3.5-turbo, gpt-4o) and the LoRA technique, but it does not specify software dependencies with version numbers (e.g., specific Python libraries like PyTorch, Transformers, or their versions). |
| Experiment Setup | Yes | We evaluate each of these models in 3 settings default, few-shot prompting (FS)(Brown, 2020), and Retrieval Augmented Generation (RAG) (Lewis et al., 2020). For the MCQs, the models were prompted to output a single letter representing the correct option, and the top-most token was chosen as the answer. For Freeform QA, the models were prompted to output concise answers with a maximum of 2 sentences. For Cloze QA, the models were prompted to output a single scientific word that best fits the blank with respect to the context around it. A.3 TRAINING DETAILS We used Llama3.1-8B and Mistral-7B-v0.3 as our base models and performed continued pre-training and fine-tuning on them. We utilized the Low-Rank Adaptation (Lo RA) (Hu et al., 2021) technique for efficient continued pre-training and fine-tuning. A.3.1 CONTINUED PRE-TRAINING ON GRADUATE TEXTBOOK DATA Table 8: Parameters used for graduate textbook continued pre-training Model Lo RA Rank Lo RA Alpha Epoch Count Learning Rate Mistral-7b-v0.3 64 16 1 5e-5 Llama-3.1-8b 16 16 2 2e-5 A.3.2 FINE-TUNING ON CLIMAQA-SILVER Table 9: Parameters used for question finetuning Model Lo RA Rank Lo RA Alpha Epoch Count Learning Rate Mistral-7b-v0.3 16 16 3 5e-5 Llama-3.1-8b 16 16 3 5e-5 |