Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Authors: Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric J Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Chenyu Zhang, Ruiqi Zhong, Sean O hEigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwani, Yoshua Bengio, Danqi Chen, Philip Torr, Samuel Albanie, Tegan Maharaj, Jakob Nicolaus Foerster, Florian Tramèr, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The second category, Development and Deployment Methods (Section 3), presents the known limitations of existing techniques in assuring safety and alignment in LLMs. We identify opportunities to help improve model alignment by modifying the pretaining process to produce more aligned models, survey several limitations of finetuning in assuring alignment and safety, discuss issues underlying the evaluation crisis , review challenges in interpreting and explaining model behavior, and finally provide an appraisal of security challenges like jailbreaks, prompt-injections, and data poisoning. On the whole, this section pertains to researching empirical techniques that may help improve the alignment, safety, and security of LLMs. 3.3 LLM Evaluations Are Confounded and Biased Sound and fair empirical evaluations are necessary to develop a calibrated understanding of the capabilities of LLMs, as well as their risks. Evaluation has historically been a sore point in the fields of machine learning and natural language processing (Raji et al., 2021; Bowman and Dahl, 2021; Liao et al., 2021; Hutchinson et al., 2022; Kapoor and Narayanan, 2023; Mc Intosh et al., 2024); however, the evaluation crisis is considerably more acute for LLMs (Mitchell, 2023).
Researcher Affiliation Collaboration 1University of Cambridge 2New York University 3ETH Zurich 4UNC Chapel Hill 5University of Michigian 6University of California, Berkeley 7Massachusetts Institute of Technology 8University of Oxford 9Harvard University 10Peking University 11LMU Munich 12University of Virginia 13Universitat Politècnica de València 14University of Sussex 15Stanford University 16 Modulo Research 17 Center for the Governance of AI 18 Newcastle University 19Mila Quebec AI Institute, Université de Montréal 20Princeton University 21University of Toronto 22University of Edinburgh 23University of Washington, Allen Institute for AI
Pseudocode No The paper describes various concepts and challenges but does not include any explicitly labeled pseudocode or algorithm blocks. Methodologies are discussed in narrative form.
Open Source Code No The paper is a survey and agenda-setting document. It identifies challenges and research questions but does not describe a specific methodology for which source code would be provided. There are no explicit statements about releasing code for the work described in this paper, nor any links to code repositories.
Open Datasets No The paper cites existing datasets like "The Pile (Gao et al., 2020)" and "Red Pajama (Together Computer, 2023)" as examples of data used in other LLM research. However, this paper itself is a survey and agenda, and does not conduct its own experiments that would utilize a specific dataset for which it provides access information.
Dataset Splits No The paper is a survey and agenda-setting document that does not describe its own experiments or data collection. Therefore, it does not provide specific details about training, validation, or test dataset splits.
Hardware Specification No The paper is a survey and agenda-setting document. It does not describe any experiments conducted by the authors that would require specific hardware specifications.
Software Dependencies No The paper is a survey and agenda-setting document. It does not describe any experiments conducted by the authors that would require specific software dependencies or their version numbers.
Experiment Setup No The paper is a survey and agenda-setting document. It identifies challenges and poses research questions but does not detail a specific experimental setup, hyperparameters, or training configurations for its own methodology.