reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PersonalLLM: Tailoring LLMs to Individual Preferences

Authors: Thomas Zollo, Andrew Siah, Naimeng Ye, Li, Hongseok Namkoong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explore basic in-context learning and meta-learning baselines to illustrate the utility of Personal LLM and highlight the need for future methodological development.
Researcher Affiliation	Academia	Thomas P. Zollo Columbia University EMAIL Andrew Wei Tung Siah Columbia University EMAIL Naimeng Ye Columbia University EMAIL Ang Li Columbia University EMAIL Hongseok Namkoong Columbia University EMAIL
Pseudocode	Yes	C.1 PSEUDOCODE Below is the pseudocode for the baselines in Section 4. Actual code is available at Algorithm 1 Meta Learn KShot ICLAlgorithm
Open Source Code	Yes	Our data 1 and code 2 are publicly available, and full documentation for our dataset is available in Appendix A. ... 2https://github.com/namkoong-lab/Personal LLM
Open Datasets	Yes	Our dataset is available at https:// huggingface.co/datasets/namkoong-lab/Personal LLM.
Dataset Splits	Yes	We split the resulting dataset into 9,402 training examples and 1,000 test examples.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (GPU/CPU models, memory, etc.) used to run its experiments. It mentions LLMs used for response generation but not the experimental compute infrastructure.
Software Dependencies	No	Semantic features were captured using pre-trained classifiers, while syntactic features were engineered using nltk (Bird and Loper, 2004). ... Our linear regression models are built using sklearn (Pedregosa et al., 2011), with default parameter settings. ... The paper mentions the use of 'nltk' and 'sklearn' but does not provide specific version numbers for these software dependencies, only citations to their foundational papers.
Experiment Setup	Yes	Inference is performed using 1, 3, and 5 such examples (see Appendix C.1 for exact templates), and evaluated by scoring with each user s (weighted-ensembled) preference model. We also compare to a zero-shot baseline, with no personalization. ... C.2 PROMPT TEMPLATE Below is a prompt template we used in our experiments for winning and losing responses appended during inference.