Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Authors: Yiming Wang, Pei Zhang, Baosong Yang, Derek Wong, Rui Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensures real-time feedback in large-scale scenarios. 4 EXPERIMENTAL VERIFICATION Table 1: AUROC, FPR95, and AUPR results of all methods in four diverse domains with different LLMs. Figure 2: Co E feature distribution of correct and incorrect sample sets in four diverse domains. Figure 3: Co E trajectory visualization of correct and incorrect sample sets in four diverse domains.
Researcher Affiliation Collaboration αDepartment of Computer Science and Engineering, Shanghai Jiao Tong University EMAIL βTongyi Lab, Alibaba Group Inc. γNLP2CT Lab, University of Macau EMAIL
Pseudocode Yes Figure 4 shows the Computation Sketch of the two Co E scores, and the complete Algorithmic Process of the two Co E scores is shown in Appendix B.1. Algorithm 1 Co E-R Computation Algorithm 2 Co E-C Computation
Open Source Code Yes The code is public at: https://github.com/Alsace08/Chain-of-Embedding.
Open Datasets Yes Dataset. We select six datasets across four domains for our self-evaluation experiments. These domains reflect the four critical dimensions of LLM capabilities (Zheng et al., 2024; Huang et al., 2024): (1) GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) for the Mathematics domain; (2) Commonsense QA (Talmor et al., 2019) and Theorem QA (Chen et al., 2023) for the Reasoning domain; (3) MMLU (Hendrycks et al., 2020) for the Knowledge domain; (4) Belebele (Bandarkar et al., 2023) for the Understanding domain. Dataset details are shown in Appendix C.1.
Dataset Splits No The paper does not explicitly provide specific dataset split information (e.g., exact percentages for train/validation/test sets) beyond stating the number of test problems for some datasets: "GSM8K ... It contains 1318 test problems", "MATH ... It contains 5000 test problems", "Commonsense QA ... It contains 1221 test problems", "Theorem QA ... It contains 800 test problems". It does not detail how the entire dataset is partitioned or reference specific predefined splits for all components (train, val, test).
Hardware Specification Yes Additionally, for a 7B+ model, we deploy it using two 32G V100 GPUs, while for a 70B+ model, we deploy it using four 80G A100 GPUs.
Software Dependencies No The paper mentions "Python sklearn library (Pedregosa et al., 2011)" and that models are from "official Hugging Face2 repository" but does not provide specific version numbers for Python, sklearn, or the Hugging Face transformers library.
Experiment Setup Yes Considering the inconsistent difficulty of different tasks, especially since some mathematical tasks may produce longer outputs, we set the maximum output length to 2048 tokens and used the <eos_token> for truncation. The inference process employs greedy decoding without random sampling. C.2.3 INSTRUCTION We select instructions followed by LLMs from two open-source projects: OPENCOMPASS3 and SIMPLE-EVALS4. They can ensure the professionalism of instructions. Specifically, all instructions used for each dataset are as follows: [Presents detailed prompt templates]