reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Authors: Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency. ... In this section, we conduct extensive experiments to evaluate Amulet with various combinations of LLMs, datasets, and user preferences. Our results demonstrate that our framework significantly improves LLMs alignment performance, indicating its great potential for real-time user preference adaptation.
Researcher Affiliation	Academia	1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Shanghai Jiao Tong University 4Zhongguancun Academy
Pseudocode	Yes	We have further provided the pseudo code for showing the details of the full decoding process with Amulet in Algorithm 1.
Open Source Code	No	The paper does not provide a specific link or explicit statement about releasing the source code for the Amulet framework. The links provided are for external tools/benchmarks used in the evaluation (e.g., Hugging Face, OpenAI GPT-4o, choix Python library).
Open Datasets	Yes	Help Steer (Wang et al., 2023)... Ultra Feedback (Cui et al., 2023)... Truthful QA (Lin et al., 2021)... Ultra Chat (Ding et al., 2023)... Personal Preference Eval (Personal) (Gao et al., 2024)
Dataset Splits	Yes	Help Steer is a QA dataset... We extracted the question part, focusing on single-sentence questions to create a dataset of 1,236 testing instances. ... Truthful QA (Lin et al., 2021), which includes 811 testing problems... Ultra Chat (Ding et al., 2023), from which we applied similar extraction and filtering as with Help Steer, resulting in 3,845 testing problems. ... Personal Preference Eval (Personal) (Gao et al., 2024) ... containing 548 testing instances.
Hardware Specification	Yes	We conducted experiments on an Ubuntu 20.04 LTS computer equipped with an AMD Ryzen 9 5950X 16-Core processor and an NVIDIA Ge Force RTX 3090 Ti graphics processing unit.
Software Dependencies	No	The paper mentions using the 'transformers library' and 'the Python library choix' but does not specify their version numbers.
Experiment Setup	Yes	Iteration Number T. We conduct experiments using 0, 20, 40, 60, 80, and 100 iterations. ... Learning Rate η. We conduct the experiments ranging from 2, 4, . . . , 20. ... Parameter α and λ. We conduct experiments of both the parameters ranging from 1, 2, . . . , 10.