reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning from Natural Language Feedback

Authors: Angelica Chen, Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, Ethan Perez

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the target distribution and demonstrate proof-of-concepts on text summarization and program synthesis tasks. For code generation, ILF improves a Codegen-Mono 6.1B model s pass@1 rate from 22% to 36% on the MBPP benchmark, outperforming both fine-tuning on MBPP and on humanwritten repaired programs. For summarization, we show that ILF can be combined with learning from human preferences to improve a GPT-3 model s summarization performance to be comparable to human quality, outperforming fine-tuning on human-written summaries.
Researcher Affiliation	Collaboration	Angelica Chen EMAIL New York University Jérémy Scheurer EMAIL Apollo Research Jon Ander Campos EMAIL New York University; Hi TZ Center, University of the Basque Country UPV/EHU Tomasz Korbak EMAIL New York University; FAR AI; University of Sussex Jun Shern Chan EMAIL New York University; FAR AI Samuel R. Bowman EMAIL New York University; Anthropic PBC Kyunghyun Cho EMAIL New York University; Genentech; CIFAR LMB Ethan Perez EMAIL New York University; FAR AI; Anthropic PBC
Pseudocode	Yes	Algorithm 1 A single round of imitation learning from natural language feedback for code generation. 1: Input: Dataset D, initial LLM πθ, unit test verification function Eval, LLM πRefine : V [0, 1] trained to incorporate feedback into code 2: C {(x0, t, u) \| x0 πθ( \|t), Eval(x0, t) = 0, (t, u) D} 3: Cannotated {(x0, f, t) \| (x0, t, u) C} Humans write feedback f for x0 C. 4: DRefined {(t, x1) πRefine( \| t, x0, f) \| Eval(x1, t) = 1, (x0, f, t) Cannotated} πRefine generates refinements x1 that incorporate feedback f into x0. 5: πθ Finetune(πθ, DRefined)
Open Source Code	Yes	Our data and code are open-sourced at https://github.com/nyu-mll/ ILF-for-code-generation.
Open Datasets	Yes	Our data and code are open-sourced at https://github.com/nyu-mll/ ILF-for-code-generation. Dataset We train and evaluate our models on the Mostly Basic Python Problems (MBPP) dataset (Odena et al., 2021). We evaluate the effectiveness of ILF on the task of text summarization using the TL;DR dataset (Völske et al., 2017), which consists of Reddit titles, posts, and their corresponding summaries. We then hire experienced annotators through Surge AI6 to create our language feedback dataset, which we open source along with our code.7 Data: https://huggingface.co/datasets/Jeremy Alain/SLF5K
Dataset Splits	Yes	MBPP includes a designated prompt/training/validation/test split of the dataset, but we re-split the dataset into the following splits: MBPPRefine: These are tasks with IDs in the range 111-310 for which Code Gen-Mono 6.1B did not generate any correct completions. For the experiments where πRefine is a fine-tuned model, this split is used to train πRefine. MBPPTrain: These are tasks with IDs in the range 311-974 for which Codegen-Mono 6.1B did not generate any correct completions. This split is first used to evaluate the correctness of refinements generated by πRefine. Then, the correct refinements in this split are used to train πθ to obtain πθ (step 5 in Algorithm 1). MBPPTest: These are tasks with IDs in the range 11-110 that we use to evaluate the final performance of πθ . To ensure the quality of our dataset, we follow the same preprocessing steps as outlined in Stiennon et al. (2020) and extract a train dataset with 5000 samples, a development dataset with 200 samples, a validation dataset with 500 samples, and a test dataset with 698 samples5.
Hardware Specification	Yes	We selected this model because it is open-source, can be fine-tuned on a single 4 100 A100 (80 GB) node, and demonstrated pass@k scores comparable to Codex-12B (Chen et al., 2021; Nijkamp et al., 2022).
Software Dependencies	Yes	We implement all experimental pipelines with the Hugging Face transformers (v4.12.5) (Wolf et al., 2020), Huggingface datasets (v2.7.1) (Lhoest et al., 2021), and Pytorch (v1.11) (Paszke et al., 2019) libraries.
Experiment Setup	Yes	For the experiments in Section 3.3, we run a hyperparameter sweep for all methods except for ILF. The hyperparameter value ranges that we sweep include learning rate {1.0 6, 5.0 6, 1.0 5}, batch size {32, 64, 128}, and number of epochs {1, 2, 5}. For all summarization experiments we sample up to 48 tokens (as in Stiennon et al., 2020) with nucleus sampling Holtzman et al. (2019) with p = 0.95 and temperature t = 1.0. To determine the optimal hyperparameters, we perform a sweep over a range of values for the following parameters: epochs {1, 2, 3, 4}, prompt loss weight {0, 0.01, 0.05, 0.1}, and learning rates {0.02, 0.05, 0.1, 0.2}. We first sweep over epochs and select the best value, then perform a sweep using that value for the prompt loss weight, and so on. Our empirical observations indicate that the number of epochs has the greatest impact on perplexity, with training for more than one epoch resulting in overfitting. The selected hyperparameters can be found in Table 16.