Shh, don't say that! Domain Certification in LLMs
Authors: Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior. [...] 3 EXPERIMENTS We empirically test our method proposed in Section 2.2 across 3 domains: Shakespeare, Computer Science News, and Medical QA. After describing the experimental setup in Section 3.1, we examine the rejection behavior of our method by examining the log L(y|x)/G(y) ratio and associated certificates under a finite set of ground-truth test samples from T and F in Section 3.2. In Section 3.3, we repeat this analysis by applying our Algorithm 1. Finally, we demonstrate how to evaluate a certified model on standardized benchmarks in Section 3.4. |
| Researcher Affiliation | Academia | 1University of Oxford 2Vienna University of Technology |
| Pseudocode | Yes | Algorithm 1 VALID Require: LLM L, Guide model G, hyperparameters k and T, prompt x for t {1, . . . , T} do Sample y L( |x) Ny length(y) if log L(y|x) G(y) k Ny then Return: y Return: Abstained . |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. It mentions using pre-trained models and toolkits, but not their own implementation's code. |
| Open Datasets | Yes | We fine-tune a Gemma-2-2b (Team et al., 2024) as model L and train a GPT-2 architecture (33.7M parameters, Radford et al. (2019)) from scratch for G on Tiny Shakespeare (TS) (Karpathy, 2015). We use TS s test split as in-domain dataset, DT, and following previous literature (Zhang et al., 2024) compose DF of IMDB (Maas et al., 2011), RTE (Wang et al., 2019) and SST2 (Minaee et al., 2024), adding an old Bible dataset (Reis, 2019) as it is linguistically close to Tiny Shakespeare. [...] 20NG dataset (Lang, 1995). [...] Pub Med QA (Jin et al., 2019). [...] Stanford Question and Answering Dataset (SQu AD; excluding medical categories; Rajpurkar et al. (2016)) |
| Dataset Splits | Yes | We use TS s test split as in-domain dataset, DT, and following previous literature (Zhang et al., 2024) compose DF of IMDB (Maas et al., 2011), RTE (Wang et al., 2019) and SST2 (Minaee et al., 2024), adding an old Bible dataset (Reis, 2019) as it is linguistically close to Tiny Shakespeare. At testing, we consider 256-token long sequences and use the first 128 tokens as prompt. [...] We use computer science articles from 20NG s test split as target domain DT and the remaining categories as DF together with the OOD dataset used for Shakespeare. [...] We use the Pu Med QA test set as in-domain dataset DT [...] For the Char Task dataset described in Appendix D.1. We create two distinct datasets with non-overlapping splits for training, validation, and testing. The in-domain dataset consists of 1M training samples. The generalist dataset DT+F = Char Task (All, Int + Char) contains all possible tasks with sequences consisting of integers and characters. We use 1M training sequences per task, and hence 4M sequences in total. The validation and test sets are 64 sequences and 4096 sequences, respectively. |
| Hardware Specification | Yes | On 8 H100, the total training takes about 2 hours. |
| Software Dependencies | Yes | To do so, we utilise the scikit-learn (v1.5.1) (Pedregosa et al., 2011) options to remove headers, footers and quotes. |
| Experiment Setup | Yes | We train L with Adam W (weight decay 0.01) for 1536 steps with a cosine learning rate schedule with 64 steps warmup, a maximum learning rate of 5e-5, scheduled for 32 epochs. We train with 256 context window using next-token prediction. |