reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

Authors: Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, Jingbo Shang, Julian McAuley

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs general abilities in downstream tasks or their internal knowledge. In this section, we evaluate our proposed method, OCEAN, by conducting chain-of-thought alignment on four LLM backbone models and evaluating several downstream tasks.
Researcher Affiliation	Collaboration	1UC San Diego 2The University of New South Wales 3East China Normal University 4Adobe Research 5CSIRO s Data61
Pseudocode	No	The paper describes methods like the KG-IPS estimator and policy gradient optimization using mathematical formulations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets	Yes	For knowledge-intensive reasoning, we use datasets that require deep domain understanding. ARC (Clark et al., 2018) tests advanced reasoning with grade-school science questions, Pub Med QA (Jin et al., 2019) assesses biomedical reasoning from abstracts, and Sci QA (Auer et al., 2023) challenges models using the Open Research Knowledge Graph. For multi-hop reasoning, where models combine multiple sources, we use Hotpot QA (Yang et al., 2018) (reasoning across Wikipedia articles), Mu Si Que (Trivedi et al., 2022) (requiring 2-4 inference hops), and Strategy QA (Geva et al., 2021) (testing implicit reasoning). For commonsense reasoning, we evaluate using three commonsense QA benchmarks (CSQA (Talmor et al., 2021), CSQA2 (Saha et al., 2018), and CSQA-COT1000 (Li et al., 2024a)), along with Open Book QA (Mihaylov et al., 2018) and Wino Grande (Sakaguchi et al., 2021). These tasks test models general commonsense question-answering abilities. [...] For chain-of-thought alignment in OCEAN, we use the CWQ question-answering dataset (Talmor & Berant, 2018) as the source data, in which the question-answering pairs are developed from knowledge graphs. [...] Wikidata5M (Wang et al., 2021) knowledge graph.
Dataset Splits	Yes	We also use each test/validation split for each dataset and report policy evaluation ˆV (θ) results. [...] We also use the test/validation split for each dataset to report estimated policy values ˆV (θ).
Hardware Specification	No	The paper lists the LLM backbone models used (e.g., Gemma-2, Llama-3, Phi-3.5-mini, Mistral-0.2) but does not provide any specific details about the hardware (GPUs, CPUs, etc.) used for conducting their experiments or fine-tuning.
Software Dependencies	No	The paper mentions applying Lo RA (Hu et al., 2021) for instruction tuning and using the pre-trained GPT2-Medium model (Radford et al., 2019), but it does not provide specific version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	For the Instruction tuning experiments, we apply Lo RA (Hu et al., 2021) to the pre-trained model and fine-tune it on each dataset for 10 epochs. Throughout these experiments, the rank parameter in Lo RA is fixed at 16, and we set α in Lo RA to 32 across all tasks. [...] The model is then fine-tuned with a base learning rate of 1e 4 for 10 epochs with a linear learning scheduler.