reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Language Models with Demonstrated Feedback

Authors: Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Hyundong Cho, Michael Bernstein, Diyi Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N = 16). Across our benchmarks and user study, we find that winrates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an avg. of 19% points.
Researcher Affiliation	Academia	Omar Shaikh Stanford University EMAIL Michelle S. Lam Stanford University EMAIL Joey Hejna Stanford University EMAIL Yijia Shao Stanford University Hyundong Cho USC Michael S. Bernstein Stanford University Diyi Yang Stanford University
Pseudocode	Yes	Algorithm 1: DITTO Input :LM πref, demos DE = {(xi, y E i )}i N, sample size M, sample frequency K Init :π0 SFT(πref, DE), t = 0 while not converged do Dt N i=1{(xi, yj πt( \|xi)}M j=1 for k = 1, 2, 3, ..., K do Sample batch B = {(x, yw, yl)} of comparisons from induced ranking: DE Dt Dt 1 ... D0 πt DPO(πt, B) # Update policy t t + 1
Open Source Code	Yes	Equal Contribution 1Code: https://github.com/SALT-NLP/demonstrated-feedback
Open Datasets	Yes	We collect data from 20 distinct authors from two sources: (1) emails and blog posts from the CMCC dataset (Goldstein et al., 2008) that contain only one author and (2) news articles from the CCAT dataset (Lewis et al., 2004).
Dataset Splits	Yes	We randomly select 10 authors from each dataset, use 7 samples to train, and split the remainder into test and validation. Table 4 in the Appendix describes the finalized train/val/test counts across each benchmark.
Hardware Specification	Yes	All training was conducted on 1 A100 80GB GPU.
Software Dependencies	No	The paper mentions using Mistral Instruct v0.2 7B, Lo RA, DPO, and Adam W, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We run a random hyperparameter sweep over a single, randomly selected author from each corpus, using lr = {1e 4, 3e 4, 1e 5, 3e 5, 1e 6, 3e 6}, epoch = {10, 15, 20, 25, 30}, and β = {0.01, 0.05, 0.1}. We additionally tune how frequently DITTO samples negatives (K = {1, 5, 10}); and how many negatives DITTO samples (M = {1, 5, 10}). Finally, we tuned the replay / expert / intermodel fractions, selecting between 0.2 / 0.7 / 0.1, 0.25 / 0.5 / 0.25 and 0.1 / 0.7 / 0.2.