reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shh, don't say that! Domain Certification in LLMs

Authors: Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior. [...] 3 EXPERIMENTS We empirically test our method proposed in Section 2.2 across 3 domains: Shakespeare, Computer Science News, and Medical QA. After describing the experimental setup in Section 3.1, we examine the rejection behavior of our method by examining the log L(y\|x)/G(y) ratio and associated certificates under a finite set of ground-truth test samples from T and F in Section 3.2. In Section 3.3, we repeat this analysis by applying our Algorithm 1. Finally, we demonstrate how to evaluate a certified model on standardized benchmarks in Section 3.4.
Researcher Affiliation	Academia	1University of Oxford 2Vienna University of Technology
Pseudocode	Yes	Algorithm 1 VALID Require: LLM L, Guide model G, hyperparameters k and T, prompt x for t {1, . . . , T} do Sample y L( \|x) Ny length(y) if log L(y\|x) G(y) k Ny then Return: y Return: Abstained .
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. It mentions using pre-trained models and toolkits, but not their own implementation's code.
Open Datasets	Yes	We fine-tune a Gemma-2-2b (Team et al., 2024) as model L and train a GPT-2 architecture (33.7M parameters, Radford et al. (2019)) from scratch for G on Tiny Shakespeare (TS) (Karpathy, 2015). We use TS s test split as in-domain dataset, DT, and following previous literature (Zhang et al., 2024) compose DF of IMDB (Maas et al., 2011), RTE (Wang et al., 2019) and SST2 (Minaee et al., 2024), adding an old Bible dataset (Reis, 2019) as it is linguistically close to Tiny Shakespeare. [...] 20NG dataset (Lang, 1995). [...] Pub Med QA (Jin et al., 2019). [...] Stanford Question and Answering Dataset (SQu AD; excluding medical categories; Rajpurkar et al. (2016))
Dataset Splits	Yes	We use TS s test split as in-domain dataset, DT, and following previous literature (Zhang et al., 2024) compose DF of IMDB (Maas et al., 2011), RTE (Wang et al., 2019) and SST2 (Minaee et al., 2024), adding an old Bible dataset (Reis, 2019) as it is linguistically close to Tiny Shakespeare. At testing, we consider 256-token long sequences and use the first 128 tokens as prompt. [...] We use computer science articles from 20NG s test split as target domain DT and the remaining categories as DF together with the OOD dataset used for Shakespeare. [...] We use the Pu Med QA test set as in-domain dataset DT [...] For the Char Task dataset described in Appendix D.1. We create two distinct datasets with non-overlapping splits for training, validation, and testing. The in-domain dataset consists of 1M training samples. The generalist dataset DT+F = Char Task (All, Int + Char) contains all possible tasks with sequences consisting of integers and characters. We use 1M training sequences per task, and hence 4M sequences in total. The validation and test sets are 64 sequences and 4096 sequences, respectively.
Hardware Specification	Yes	On 8 H100, the total training takes about 2 hours.
Software Dependencies	Yes	To do so, we utilise the scikit-learn (v1.5.1) (Pedregosa et al., 2011) options to remove headers, footers and quotes.
Experiment Setup	Yes	We train L with Adam W (weight decay 0.01) for 1536 steps with a cosine learning rate schedule with 64 steps warmup, a maximum learning rate of 5e-5, scheduled for 32 epochs. We train with 256 context window using next-token prediction.