Clone-Robust AI Alignment

Authors: Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Although our contributions are primarily theoretical, we supplement our results with a synthetic case study that highlights an instance where the weighted MLE is more robust than the standard regularized MLE under diverse preferences. This case study moves our theory closer to practice in a few different ways. First, LLMs typically generate the responses seen by human annotators, and so our case study considers textual responses generated by the gpt-4o-mini model. ... In this experiment, we show that the output of the standard MLE is significantly more affected by the presence of approximate clones than the output of the weighted MLE, which supports our theoretical results.
Researcher Affiliation Academia 1Department of Computer Science, Harvard University 2Department of Statistics, Harvard University.
Pseudocode No The paper describes algorithms (standard MLE and weighted MLE) and their mathematical formulations (Equation 1 and Equation 2), but it does not present them in a structured pseudocode block or algorithm box.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials.
Open Datasets No In the case study, our goal is to train a reward function which evaluates answers to a single question: Describe Paris. We use Open AI s gpt-4o-mini model both to generate textual descriptions of Paris and to simulate human annotators with diverse preferences. ... We then construct two preference datasets for this population, one which includes additional approximate clones (Clones) and one which does not (Original).
Dataset Splits No Each dataset contains 1000 data points. Full prompts and sample responses for all interactions with gpt-4o-mini are given below. ... Training on dataset Original leads to romance being the topic with the highest reward. However, training on dataset Clones leads to art being the topic with the highest reward.
Hardware Specification No The paper mentions using OpenAI's gpt-4o-mini model for generating responses and simulating annotators, and OpenAI's text-embedding-3-small model for extracting embeddings. It also states, 'We approximate both the standard MLE algorithm and the weighted MLE algorithm using neural networks.' However, no specific details about the hardware (e.g., GPU models, CPU types, memory) used for training these neural networks or running the simulations are provided.
Software Dependencies No To generate the context vectors, we first use Open AI s text-embedding-3-small model to extract embedding vectors from the textual descriptions of Paris in each dataset. For each dataset, we then conduct principal component analysis (PCA) on the associated embedding vectors using the PCA class from sklearn.decomposition... implemented the training using Py Torch.
Experiment Setup Yes Each neural network takes as input a context vector and outputs a reward value. ... The neural network we used had 2 layers, and each hidden layer had size 32 (and output size 1). For training we used the Adam optimizer with a learning rate of 10 4 and a batch size of 512 and implemented the training using Py Torch. We trained the neural networks for 500 steps and then averaged the results over 20 runs to form the graphs included in this paper.