reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unlearning Misalignment for Personalized LLM Adaptation via Instance-Response-Dependent Discrepancies

Authors: Cheng Chen, Atsushi Nitanda, Ivor Tsang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated across a diverse range of domain-specific datasets and model architectures, CM yields notable improvements in response alignment and robustness. We believe Consistent Marginalization represents a valuable step toward enabling LLMs to become genuinely personable and adaptive conversational agents by understanding user preferences and generating responses that are better aligned with individual user expectations.
Researcher Affiliation	Academia	Cheng Chen EMAIL University of Technology Sydney, Australian Artificial Intelligence Institute FEIT, Australia Center for Frontier AI Research Institute of High Performance Computing Agency for Science, Technology and Research, Singapore; Atsushi Nitanda EMAIL Center for Frontier AI Research Institute of High Performance Computing Agency for Science, Technology and Research, Singapore College of Computing and Data Science, Nanyang Technological University, Singapore; Ivor W. Tsang EMAIL Center for Frontier AI Research Institute of High Performance Computing Agency for Science, Technology and Research, Singapore College of Computing and Data Science, Nanyang Technological University, Singapore
Pseudocode	No	The paper describes the Consistent Marginalization (CM) framework and its pipeline conceptually and with mathematical formulations (e.g., equations 1, 2, 3) and flow diagrams (Figure 4), but does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, a link to a code repository, or mentions of code being available in supplementary materials.
Open Datasets	Yes	We validate our method on multiple user preference-related large-scale datasets. The experimental results demonstrate its effectiveness in enhancing the user personalised response alignment of LLMs without fine-tuning, indicating broad applicability. ... on five diverse, real-world datasets: Stack Exchange: a multi-domain QA corpus... CLINC150: 150 intent categories... BANK77: banking-themed user queries... MOTE: a multilingual dataset... Massive Scenario: 51 typologically-diverse multilingual natural language understanding dataset...
Dataset Splits	Yes	More specifically, we can define DUser as the distribution of the cleanly labelled small sample set, denoted as user-preference sample, which contains pairs (X, Y, Y ) where Y = Y, representing a candidate label set encompassing all responses. This can be expressed as {(Xi, Yi, Y )}s i=1, with s being the total number of clean samples. ... The learning objective is to design a prompt strategy that leverages DUser, which constitutes about 5% of the total training samples, to allow large language models (LLMs) to accurately annotate the large unsupervised dataset Dlarge. ... We perform an ablation using Llama-8b-Instruct and two annotation budgets 1% and 5% of the training set across all five datasets (Table 2).
Hardware Specification	No	The paper states that experiments were conducted on 'Chatgpt-4o-mini, Chatgpt-3.5, and Llama-8b-Instruct' but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these LLMs or the experiments.
Software Dependencies	No	The paper mentions various LLM models (e.g., Chatgpt-4o-mini, Chatgpt-3.5, Llama-8b-Instruct) and prompt methods (e.g., Cot, FoT) but does not specify any programming languages, libraries, or their respective version numbers used for implementation.
Experiment Setup	No	The paper mentions running experiments with 'two random seeds for robustness' and using '5% of user-preference samples' (and an ablation with 1% and 5% budgets), but it does not provide specific hyperparameters such as learning rates, batch sizes, number of epochs, or optimizer settings, which are essential details for the experimental setup.