reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models Often Say One Thing and Do Another

Authors: Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixiang Zhou, Le Sun, Yingfei Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect.
Researcher Affiliation	Academia	1University of Chinese Academy of Sciences, 2Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences 3State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences
Pseudocode	No	The paper describes a "Construction Pipeline" for Deed questions with a diagram (Figure 2), but it does not contain any explicit sections or figures labeled as "Pseudocode" or "Algorithm" with structured code-like formatting.
Open Source Code	Yes	Dataset and code are available at https://github.com/icipcas/Word-Deed-Consistency-Test.
Open Datasets	Yes	Dataset and code are available at https://github.com/icipcas/Word-Deed-Consistency-Test. We have collected topics from various domains to ensure the generalizability of the results. Opinion For this domain, we collect topics from debate datasets, where both pro and con opinions hold certain validity. Specifically, from the Argument Annotated Essays (Stab & Gurevych, 2014) dataset, we retain 115 topics out of 402 debate topics. Similarly, we obtain 276 topics from the Recorded Debating (Ein-Dor et al., 2020) dataset and 118 topics from the Evidences Sentences (Orbach et al., 2020) dataset. Non-ethical Value For this domain, we collect topics from universal values theories. Specifically, we get 9 topics from Kluckhohn and Strodtbeck s values orientation theory (Hills, 2002) and 106 topics from World Values Survey Wave 7 (Haerpfer et al., 2020). Ethical Value For this domain, we collect topics from established moral datasets. Specifically, we randomly sample 500 fine-grained value principles from Moral Story dataset (Emelin et al., 2021). Theory For this domain, we collect topics from textbooks. Specifically, we collected 101 topics from the KEY CONCEPTS section at the end of each chapter in Mankiw s Principles of Macroeconomics (Mankiw et al., 2007).
Dataset Splits	No	The paper refers to the WDCT as an "evaluation benchmark" consisting of "test items" (1225 test items in total). While it mentions training models with the Alpaca dataset with a mixing ratio, it does not provide specific training, validation, and test splits for the WDCT dataset itself that are used for its main evaluations, nor does it detail how the Alpaca dataset was split.
Hardware Specification	Yes	The models underwent separate training on three A100 80GB GPUs for three hours each.
Software Dependencies	Yes	Word questions are constructed by directly inquiring about models views on specific topics... For the theory segment, we use GPT-43 to identify multiple-choice questions... These questions are subsequently double-checked by two graduate students... To construct corresponding deed questions, we use the powerful LLM, GPT-4, to incorporate vivid characters... 3We used gpt-4-0613 in word and deed question construction.
Experiment Setup	Yes	We experimented with learning rates of [1e-6, 5e-6, 1e-5, 5e-7, 1e-7], presenting the results using the best-performing learning rate of 1e-5, except for Mistral-7B-Instruct, which used 1e-6, and Llama-27B, which used 1e-7. In the DPO phase, multiple-choice questions were transformed into preference data pairs... Similarly, we set a learning rate of 5e-6, except for Mistral-7B and Mistral-7B-Instruct, which used 5e-7. β of 0.1 was set. Four rounds of SFT and DPO were completed.