reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EAVIT: Efficient and Accurate Human Value Identification From Text Data via LLMs

Authors: Wenhao Zhu, Yuhang Xie, Guojie Song, Xin Zhang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments 5.1 Value Identification on Public Datasets Datasets and Methods We conducted experiments on three public and manuallylabelled datasets: Value Net (Augmented) [Qiu et al., 2022], Webis-Arg Values-22 [Kiesel et al., 2022], and Touch e23Value Eval [Kiesel et al., 2023]. ... For all datasets, we report the accuracy and the officially recommended F1-score on the validation and test data.
Researcher Affiliation	Academia	Wenhao Zhu1 , Yuhang Xie1 , Guojie Song 1 and Xin Zhang2 1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2School of Psychological and Cognitive Sciences, Peking University EMAIL, EMAIL
Pseudocode	No	The paper describes the three stages of the EAVIT method: (1) Training value detector; (2) Generating candidate value set; (3) Final value identification using LLMs, and provides prompt templates. However, it does not include structured pseudocode or algorithm blocks with numbered steps.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a link to a code repository. It only refers to an extended version of the paper on arXiv.
Open Datasets	Yes	We conducted experiments on three public and manuallylabelled datasets: Value Net (Augmented) [Qiu et al., 2022], Webis-Arg Values-22 [Kiesel et al., 2022], and Touch e23Value Eval [Kiesel et al., 2023]. ... Our experiments will use these public, human-annotated datasets as the basis for training and validation.
Dataset Splits	No	The paper mentions using 'validation and test data' and refers to the 'original Touch e23-Value Eval train dataset', but it does not provide specific details on the dataset splits, such as exact percentages, sample counts, or the methodology used for splitting, within the provided text. It states 'Details can be found in Appendix', which is not included.
Hardware Specification	Yes	With QLo RA [Dettmers et al., 2023; Hu et al., 2021], finetuning Llama2-13b-chat can be executed on 4 Nvidia RTX 4090 GPUs with 24GB VRAM.
Software Dependencies	No	The paper mentions specific language models used (Llama2-13b-chat, GPT-4o-mini, GPT-4o, GPT-4) and techniques like QLoRA and Alpaca format. However, it does not provide specific version numbers for software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python version) used for implementation.
Experiment Setup	Yes	In Section 4.2, it specifies parameters for candidate set generation: 'Usually we set L = 5 to achieve the balance of reducing randomness and efficiency. Next, we set two thresholds 0 < plow < phigh < 1.' In Section 5.1, it further clarifies: 'For EAVIT, we set plow = 0.2, phigh = 0.8 and report the results of the value detector the entire method.' It also mentions reporting 'the average and std of 3 random individual runs'.