OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models
Authors: Junda Wu, Xintong Li, Ruoyu Wang, Yu Xia, Yuxin Xiong, Jianing Wang, Tong Yu, Xiang Chen, Branislav Kveton, Lina Yao, Jingbo Shang, Julian McAuley
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs general abilities in downstream tasks or their internal knowledge. In this section, we evaluate our proposed method, OCEAN, by conducting chain-of-thought alignment on four LLM backbone models and evaluating several downstream tasks. |
| Researcher Affiliation | Collaboration | 1UC San Diego 2The University of New South Wales 3East China Normal University 4Adobe Research 5CSIRO s Data61 |
| Pseudocode | No | The paper describes methods like the KG-IPS estimator and policy gradient optimization using mathematical formulations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. |
| Open Datasets | Yes | For knowledge-intensive reasoning, we use datasets that require deep domain understanding. ARC (Clark et al., 2018) tests advanced reasoning with grade-school science questions, Pub Med QA (Jin et al., 2019) assesses biomedical reasoning from abstracts, and Sci QA (Auer et al., 2023) challenges models using the Open Research Knowledge Graph. For multi-hop reasoning, where models combine multiple sources, we use Hotpot QA (Yang et al., 2018) (reasoning across Wikipedia articles), Mu Si Que (Trivedi et al., 2022) (requiring 2-4 inference hops), and Strategy QA (Geva et al., 2021) (testing implicit reasoning). For commonsense reasoning, we evaluate using three commonsense QA benchmarks (CSQA (Talmor et al., 2021), CSQA2 (Saha et al., 2018), and CSQA-COT1000 (Li et al., 2024a)), along with Open Book QA (Mihaylov et al., 2018) and Wino Grande (Sakaguchi et al., 2021). These tasks test models general commonsense question-answering abilities. [...] For chain-of-thought alignment in OCEAN, we use the CWQ question-answering dataset (Talmor & Berant, 2018) as the source data, in which the question-answering pairs are developed from knowledge graphs. [...] Wikidata5M (Wang et al., 2021) knowledge graph. |
| Dataset Splits | Yes | We also use each test/validation split for each dataset and report policy evaluation ˆV (θ) results. [...] We also use the test/validation split for each dataset to report estimated policy values ˆV (θ). |
| Hardware Specification | No | The paper lists the LLM backbone models used (e.g., Gemma-2, Llama-3, Phi-3.5-mini, Mistral-0.2) but does not provide any specific details about the hardware (GPUs, CPUs, etc.) used for conducting their experiments or fine-tuning. |
| Software Dependencies | No | The paper mentions applying Lo RA (Hu et al., 2021) for instruction tuning and using the pre-trained GPT2-Medium model (Radford et al., 2019), but it does not provide specific version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | For the Instruction tuning experiments, we apply Lo RA (Hu et al., 2021) to the pre-trained model and fine-tune it on each dataset for 10 epochs. Throughout these experiments, the rank parameter in Lo RA is fixed at 16, and we set α in Lo RA to 32 across all tasks. [...] The model is then fine-tuned with a base learning rate of 1e 4 for 10 epochs with a linear learning scheduler. |