Aligning Language Models with Demonstrated Feedback
Authors: Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Hyundong Cho, Michael Bernstein, Diyi Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N = 16). Across our benchmarks and user study, we find that winrates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an avg. of 19% points. |
| Researcher Affiliation | Academia | Omar Shaikh Stanford University EMAIL Michelle S. Lam Stanford University EMAIL Joey Hejna Stanford University EMAIL Yijia Shao Stanford University Hyundong Cho USC Michael S. Bernstein Stanford University Diyi Yang Stanford University |
| Pseudocode | Yes | Algorithm 1: DITTO Input :LM πref, demos DE = {(xi, y E i )}i N, sample size M, sample frequency K Init :π0 SFT(πref, DE), t = 0 while not converged do Dt N i=1{(xi, yj πt( |xi)}M j=1 for k = 1, 2, 3, ..., K do Sample batch B = {(x, yw, yl)} of comparisons from induced ranking: DE Dt Dt 1 ... D0 πt DPO(πt, B) # Update policy t t + 1 |
| Open Source Code | Yes | Equal Contribution 1Code: https://github.com/SALT-NLP/demonstrated-feedback |
| Open Datasets | Yes | We collect data from 20 distinct authors from two sources: (1) emails and blog posts from the CMCC dataset (Goldstein et al., 2008) that contain only one author and (2) news articles from the CCAT dataset (Lewis et al., 2004). |
| Dataset Splits | Yes | We randomly select 10 authors from each dataset, use 7 samples to train, and split the remainder into test and validation. Table 4 in the Appendix describes the finalized train/val/test counts across each benchmark. |
| Hardware Specification | Yes | All training was conducted on 1 A100 80GB GPU. |
| Software Dependencies | No | The paper mentions using Mistral Instruct v0.2 7B, Lo RA, DPO, and Adam W, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We run a random hyperparameter sweep over a single, randomly selected author from each corpus, using lr = {1e 4, 3e 4, 1e 5, 3e 5, 1e 6, 3e 6}, epoch = {10, 15, 20, 25, 30}, and β = {0.01, 0.05, 0.1}. We additionally tune how frequently DITTO samples negatives (K = {1, 5, 10}); and how many negatives DITTO samples (M = {1, 5, 10}). Finally, we tuned the replay / expert / intermodel fractions, selecting between 0.2 / 0.7 / 0.1, 0.25 / 0.5 / 0.25 and 0.1 / 0.7 / 0.2. |