reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Rui Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensures real-time feedback in large-scale scenarios. 4 EXPERIMENTAL VERIFICATION Table 1: AUROC, FPR95, and AUPR results of all methods in four diverse domains with different LLMs. Figure 2: Co E feature distribution of correct and incorrect sample sets in four diverse domains. Figure 3: Co E trajectory visualization of correct and incorrect sample sets in four diverse domains.
Researcher Affiliation	Collaboration	αDepartment of Computer Science and Engineering, Shanghai Jiao Tong University EMAIL βTongyi Lab, Alibaba Group Inc. γNLP2CT Lab, University of Macau EMAIL
Pseudocode	Yes	Figure 4 shows the Computation Sketch of the two Co E scores, and the complete Algorithmic Process of the two Co E scores is shown in Appendix B.1. Algorithm 1 Co E-R Computation Algorithm 2 Co E-C Computation
Open Source Code	Yes	The code is public at: https://github.com/Alsace08/Chain-of-Embedding.
Open Datasets	Yes	Dataset. We select six datasets across four domains for our self-evaluation experiments. These domains reflect the four critical dimensions of LLM capabilities (Zheng et al., 2024; Huang et al., 2024): (1) GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) for the Mathematics domain; (2) Commonsense QA (Talmor et al., 2019) and Theorem QA (Chen et al., 2023) for the Reasoning domain; (3) MMLU (Hendrycks et al., 2020) for the Knowledge domain; (4) Belebele (Bandarkar et al., 2023) for the Understanding domain. Dataset details are shown in Appendix C.1.
Dataset Splits	No	The paper does not explicitly provide specific dataset split information (e.g., exact percentages for train/validation/test sets) beyond stating the number of test problems for some datasets: "GSM8K ... It contains 1318 test problems", "MATH ... It contains 5000 test problems", "Commonsense QA ... It contains 1221 test problems", "Theorem QA ... It contains 800 test problems". It does not detail how the entire dataset is partitioned or reference specific predefined splits for all components (train, val, test).
Hardware Specification	Yes	Additionally, for a 7B+ model, we deploy it using two 32G V100 GPUs, while for a 70B+ model, we deploy it using four 80G A100 GPUs.
Software Dependencies	No	The paper mentions "Python sklearn library (Pedregosa et al., 2011)" and that models are from "official Hugging Face2 repository" but does not provide specific version numbers for Python, sklearn, or the Hugging Face transformers library.
Experiment Setup	Yes	Considering the inconsistent difficulty of different tasks, especially since some mathematical tasks may produce longer outputs, we set the maximum output length to 2048 tokens and used the <eos_token> for truncation. The inference process employs greedy decoding without random sampling. C.2.3 INSTRUCTION We select instructions followed by LLMs from two open-source projects: OPENCOMPASS3 and SIMPLE-EVALS4. They can ensure the professionalism of instructions. Specifically, all instructions used for each dataset are as follows: [Presents detailed prompt templates]