reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controllable Generation via Locally Constrained Resampling

Authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on several tasks, including LLM detoxification and solving Sudoku puzzles. We show that by disallowing a list of toxic expressions our approach is able to steer the model s outputs away from toxic generations, outperforming similar approaches to detoxification. We conclude by showing that our approach achieves a perfect accuracy on Sudoku compared to < 50% for GPT4-o and Gemini 1.5.
Researcher Affiliation	Academia	Kareem Ahmed , Kai-Wei Chang & Guy Van den Broeck Department of Computer Science University of California, Los Angeles EMAIL
Pseudocode	Yes	Algorithm 1 Compute py( y \| α) Algorithm 2 Locally Constrained Resampling
Open Source Code	No	The implementation of our approach will be made publicly available. The code to process the list of words, the code to create the constraint as well as the constraint itself will be made publicly available upon paper acceptance.
Open Datasets	Yes	We start by evaluating on Warcraft shortest-path finding, where we are given an image of a Warcraft tilemap, and are tasked with autoregressively generating one of the potentially many minimum-cost paths between two end points conditioned on the map, where the cost is determined by the underlying cost of the tiles spanned by the path. We use the dataset provided by Wang et al. (2019), consisting of 10K Sudoku puzzles, split into 9K training examples, and 1K test samples, all puzzles having 10 missing entries. Similar to previous work (Gehman et al., 2020; Wang et al., 2022), we evaluate on the REALTOXICITYPROMPTS, a dataset of almost 100k prompts ranging from nontoxic, assigned a toxicity score of 0, to very toxic, assigned a toxicity score of 1.
Dataset Splits	Yes	We use the dataset provided by Wang et al. (2019), consisting of 10K Sudoku puzzles, split into 9K training examples, and 1K test samples, all puzzles having 10 missing entries. Our final results are reported on a random subset of the Real Toxicity Prompts dataset of size 10k, average over 5 different runs using 5 different seeds.
Hardware Specification	Yes	The experiments were run on a server with an AMD EPYC 7313P 16-Core Processor @ 3.7GHz, 3 NVIDIA RTX A6000, and 252 GB RAM. The experiments were run on a server with an AMD EPYC 7313P 16-Core Processor @ 3.7GHz, 2 NVIDIA RTX A6000, and 252 GB RAM.
Software Dependencies	No	Our LLM detoxification experiments utilized both GPUs using the Huggingface Accelerate (Gugger et al., 2022) library. Our full algorithm is shown in Algorithm 2, and follows Py Torch syntax (Paszke et al., 2019).
Experiment Setup	Yes	We use a CNN-LSTM model, where, presented with an image of a terrain map, we use a Res Net-18 (He et al., 2016) to obtain a 128 image embedding, which is then passed on to an LSTM with a single layer, a hidden dim of size 512, and at every time step predicts the next edge in the path conditioned on the image embedding and previous edges. We use a batch size of 10 during generation, and only sample every sentence 5 times. The model sentence y was generated using nucleus sampling with p = 0.9 and a temperature of 1. We experimented with tempering the contextualized pseudo-likelihood distribution on a random set of prompts of size 1000 using τ = {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. Our final results are reported on a random subset of the Real Toxicity Prompts dataset of size 10k, average over 5 different runs using 5 different seeds. For only this task, our implementation makes use of top-k to construct the pseudo-likelihood distribution (lines 7-12 in Algorithm 1) due to the lack of computational resources. Generations from all methods were limited to a maximum of 20 new tokens.