Inverse Constitutional AI: Compressing Preferences into Principles
Authors: Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Samuel Albanie, Robert Mullins
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3. Providing experimental results and case studies. We test our approach experimentally on four datasets: (a) we first provide a proof-of-concept on synthetic data with known underlying principles; (b) we then demonstrate applicability to human-annotated data on the Alpaca Eval dataset (Dubois et al., 2023); (c) we showcase applicability to interpreting individual user preferences via Chatbot Arena Conversations data (Zheng et al., 2023); (d) we investigate the use-case of bias detection on different datasets; and finally (e) we demonstrate our method s ability to help interpret differing group preferences on PRISM data (Kirk et al., 2024). We demonstrate the highly sample-efficient generation of personalised constitutions with human-readable and editable principles. We release our code at https://github.com/rdnfn/icai. |
| Researcher Affiliation | Academia | Arduin Findeis University of Cambridge Cambridge, UK Timo Kaufmann LMU Munich, MCML Munich Munich, Germany Eyke H ullermeier LMU Munich, MCML Munich DFKI, Kaiserslautern, Germany Samuel Albanie London, UK Robert Mullins University of Cambridge Cambridge, UK |
| Pseudocode | No | We propose a first Inverse Constitutional AI (ICAI) algorithm, outlined in Figure 2, consisting of five main steps: principle generation, principle clustering, principle subsampling, principle testing, and principle filtering. In the following, we describe each step in detail. |
| Open Source Code | Yes | We release the source code for our algorithm and experiments at https://github.com/rdnfn/icai. |
| Open Datasets | Yes | We test our approach experimentally on four datasets: (1) synthetic data to demonstrate the basic functionality of our algorithm, (2) human-annotated Alpaca Eval data to demonstrate the applicability of our algorithm to real-world data, (3) Chatbot Arena data to illustrate the application of our algorithm to infer individual user preferences, and (4) PRISM data to showcase interpreting group preferences with our algorithm. Full dataset details are available in Appendix A. (...) Alpaca Eval is a dataset of 648 human-annotated preferences (...). It is licensed under CC-BY-NC-4.0 and can be accessed at https://huggingface.co/datasets/tatsu-lab/alpaca_eval. Chatbot Arena Conversations is a dataset of 33,000 preferences (...). It is licensed under CC-BY-NC-4.0 and can be accessed at https://huggingface.co/datasets/lmsys/chatbot_arena_conversations. PRISM is a dataset of 8,011 human-annotated preferences (...). It is licensed under CC-BY-4.0 and can be accessed at https://huggingface.co/datasets/Hannah Rose Kirk/prism-alignment. Anthropic HH-RLHF is a collection of human-annotated preference datasets by Bai et al. (2022a) (...). The data is available under MIT license at https://github.com/anthropics/hh-rlhf. |
| Dataset Splits | Yes | For each seed, we randomly select mutually exclusive training and test subsets with 65 annotated pairs each. Constitutions are generated on the training subset and results reported on the (unseen) test subset. (...) We repeat the experiment on the Alpaca Eval unaligned dataset with the full 648 preference pairs in the original dataset, using 324 samples each for training and testing. (...) Due to limited samples, there is no training-test split. (...) We randomly sample two training sets of 100 data points each from separate helpful and harmless datasets in Anthropic HH-RLHF.9 (...) We similarly sample two separate test sets of 1,000 data points from each dataset. |
| Hardware Specification | No | We primarily use two models from Open AI: GPT-3.5-Turbo and GPT-4o. (...) All experiments were run using models via API access from Open AI and Anthropic. |
| Software Dependencies | No | We primarily use two models from Open AI: GPT-3.5-Turbo and GPT-4o. (...) text-embedding-ada-002 embedding model for clustering steps in the algorithm (across all experiments). Detailed model descriptions of these Open AI models are available at https://platform.openai.com/docs/models/. Certain experiments use additional models, these are described in the relevant experiments discussions. (...) All experiments were run using models via API access from Open AI and Anthropic. |
| Experiment Setup | Yes | For each seed, we randomly select mutually exclusive training and test subsets with 65 annotated pairs each. Constitutions are generated on the training subset and results reported on the (unseen) test subset. (...) We primarily use two models from Open AI: GPT-3.5-Turbo and GPT-4o. (...) Full dataset details are available in Appendix A. We primarily use two models from Open AI: GPT-3.5-Turbo and GPT-4o. Example constitutions in all figures were chosen for illustrative purposes. We provide more constitutions, experiment details (including numerical results), and model details in Appendices D, F and H, respectively. (...) We use annotators from the Alpaca Eval framework (...) To evaluate constitution effectiveness, we create custom prompts (...) Our method introduces an important hyperparameter n that determines the number of principles in the constitution. (...) Thus, we use n = 5 in our experiments. (...) Fine-tuning is performed for up to five additional epochs and a batch size of 1, with validation accuracy used to select the best model. |