reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What's New in My Data? Novelty Exploration via Contrastive Generation

Authors: Masaru Isonuma, Ivan Titov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we assess the effectiveness of CGE in detecting novel domains within fine-tuning datasets. We begin by evaluating the contrastive score s ability to distinguish between novel and in-distribution examples, comparing its performance against existing novelty detection methods. This preliminary analysis assumes access to the fine-tuning dataset. Then we proceed to the main experiment, where we operate under the assumption of no access to the fine-tuning dataset. Here, we demonstrate that CGE can identify novel domains through generation.
Researcher Affiliation	Academia	Masaru Isonuma 1,2,3 Ivan Titov 1,4 1University of Edinburgh 2University of Tokyo 3National Institute of Informatics 4University of Amsterdam EMAIL EMAIL
Pseudocode	No	The paper describes the method using mathematical equations (e.g., Equation 1 to Equation 7) and textual explanations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at: https://github.com/misonuma/cge
Open Datasets	Yes	We used Open LLa MA-3B,4 an open reproduction of LLa MA (Touvron et al., 2023). Open LLa MA uses exactly the same decoder-only architecture, preprocessing steps, and hyperparameters as the original LLa MA, while being pretrained on 1T tokens from the publicly available Red Pajama dataset (Computer, 2023).5 ... We used Falcon-RW-1B,6 a decoder-only model pre-trained on the Refined Web dataset (Penedo et al., 2023). ... For non-English text, we used Wikipedia articles in 10 languages ... Regarding source code, we used the Git Hub Code dataset10 ... As for toxic text, we used Toxi Gen (Hartvigsen et al., 2022),12
Dataset Splits	No	The paper describes the composition of fine-tuning datasets, such as '1% of the dataset' for non-English languages and '10 examples' for toxic text. It also mentions evaluating generated texts using detection and coverage rates. However, it does not specify standard training, validation, or test dataset splits in terms of percentages or explicit counts for the experimental setup.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments.
Software Dependencies	No	The paper mentions optimizers like Adam and techniques like LoRA but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation.
Experiment Setup	Yes	We fine-tuned Open LLa MA for three epochs by Adam (Kingma, 2014) with a learning rate of 5e-5, β1 = 0.9, β2 = 0.999 and a batch size of four on each dataset. ... Specifically, we used a plausibility constraint with α=0.01 and beam sampling with a beam size of 4. ... The intermediate representation dimension is set to r = 8 with a scaling factor of α = 16, and the model is fine-tuned for three epochs with a learning rate of 5e-4.