reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlled LLM Decoding via Discrete Auto-regressive Biasing

Authors: Patrick Pynadath, Ruqi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the advantages of our controlled decoding method on sentiment control, language detoxification, and keyword-guided generation. We evaluate DAB on three distinct controlled-generation tasks: sentiment-guided generation, language detoxification, and keyword-guided generation.
Researcher Affiliation	Academia	Patrick Pynadath, Ruqi Zhang Department of Computer Science Purdue University West Lafayette, Indiana, 47906, USA EMAIL
Pseudocode	Yes	We include the psuedo-code for our algorithm in 1; information on hyper-parameter settings in Appendix Cl and further details on each experiment in Appendix D.2, D.3, D.4. Additionally, we include the code-base used to produce our results at the following repository: https://github.com/patrickpynadath1/dab. (Referring to Algorithm 1 Discrete Autoregressive Biasing in Appendix B)
Open Source Code	Yes	We make our code available at the following url: https://github.com/patrickpynadath1/dab. (Abstract) Additionally, we include the code-base used to produce our results at the following repository: https://github.com/patrickpynadath1/dab. (Reproducibility section)
Open Datasets	Yes	Additionally we confirm that our experiments use only public datasets. (Reproducibility) We use 1,000 prompts sampled from the Real Toxicity Prompts introduced and generate continuations of length 20 tokens (Gehman et al., 2020; Kumar et al., 2022; Liu et al., 2023a). We use a RoBERTafine-tuned on the Jigsaw toxic comment dataset, following Kumar et al. (2022); Liu et al. (2023a). The internal model is a RoBERTA with GPT2-Large Embeddings fine-tuned on the yelp polarity dataset. We train this model following the codebase of Liu et al. (2023a). We train the steering matrix using the SST2 dataset, as done in Han et al. (2024).
Dataset Splits	No	The paper mentions using "1,000 prompts sampled from the Real Toxicity Prompts" and generating sequences of specific lengths (12, 20, 50). It also refers to datasets used for fine-tuning models (yelp polarity dataset, Jigsaw toxic comment dataset, SST2 dataset). However, it does not explicitly provide the training/validation/test splits (e.g., percentages or exact counts for each split) for these datasets, which is necessary to reproduce the data partitioning.
Hardware Specification	No	The paper mentions evaluating efficiency by timing operations on a "GPU" (Table 3), but it does not specify any particular GPU model (e.g., NVIDIA A100, RTX 3090) or other hardware details such as CPU, RAM, or specific server configurations used for running experiments.
Software Dependencies	No	The paper mentions several software components, including "fine-tuned RoBERTa model from Morris et al. (2020)", "GPT2-XL", "Hugging Face evaluate package", and "Auto Grad profiler within Pytorch". However, it does not provide specific version numbers for these software components (e.g., PyTorch 1.x, Hugging Face Transformers 4.x, Python 3.x), which are essential for reproducibility.
Experiment Setup	Yes	Here we include additional details on the experiment setup. We provide the hyper-parameter settings for our algorithm for each experiment in Table 4. (Appendix D). Table 4 lists: Hyper-parameter Sentiment Detoxify Topic, with specific values for Proposal Temp, Top-k, Bias Weight Value, and Number Sample Steps.