reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Authors: Souradip Chakraborty, Sujay Bhatt, Udari Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, Sumitra Ganesh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present a comprehensive empirical analysis of our proposed framework, tested across various open-source datasets and state-of-the-art models (Lambert et al., 2024). Our findings demonstrate Collab s effectiveness in aligning language model outputs with specific target rewards. For implementation, we set the number of tokens sampled (top-p) p = 10 and the decoding alignment parameter α = 1. Reproducibility is ensured through the use of publicly available resources.
Researcher Affiliation	Collaboration	1JPMorgan AI Research 2University of Maryland, College Park 3Princeton University
Pseudocode	Yes	Algorithm 1 Mixture of Agents based Controlled Decoding for LLM Alignment
Open Source Code	No	Reproducibility is ensured through the use of publicly available resources. This statement refers to resources used for experiments, not the authors' own code.
Open Datasets	Yes	1. Evaluation-1 to Evaluation-4 (Task-I): For this task, we utilize the Berkeley Nectar dataset (Zhu et al., 2023) to test the agent s capacity for multi-turn dialogues and question answering. 2. Evaluation-5 to Evaluation-7 (Task-II): We employ the HH-RLHF dataset (Bai et al., 2022) to assess the agent s helpfulness and ethical alignment in response generation.
Dataset Splits	No	For evaluation, we compare the performance of the response generated by the language model corresponding to each prompt in the test dataset. Following (Khanov et al., 2024; Chakraborty et al., 2024b), we limit the maximum length of the prompt and generated continuation to 128 and 2048 tokens, respectively. The paper mentions using a "test dataset" but does not specify the explicit splits (e.g., percentages, counts) for the datasets used.
Hardware Specification	Yes	We run all experiments with Python 3.7.4 and Py Torch 1.9.0. For all experimentation, we use two Nvidia RTX A6000 GPUs.
Software Dependencies	Yes	We run all experiments with Python 3.7.4 and Py Torch 1.9.0.
Experiment Setup	Yes	For implementation, we set the number of tokens sampled (top-p) p = 10 and the decoding alignment parameter α = 1.