reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tokenized Bandit for LLM Decoding and Alignment

Authors: Suho Shin, Chenghao Yang, Haifeng Xu, Mohammadtaghi Hajiaghayi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We finally provide our experimental results. 6.1. Validating DDMC The first is on validation of DDMC assumption. Further experimental results are presented in Appendix D. 6.2. Performance of EOFUL We numerically validate the performance of EOFUL using synthetic data under the LLM alignment scenario presented in Section 5.1.
Researcher Affiliation	Academia	1University of Maryland 2University of Chicago. Correspondence to: Suho Shin <EMAIL>.
Pseudocode	Yes	ALGORITHM 1: EOFUL Input: Token set V, linear contextual bandit algorithm ALG Initialize y , ˆθ1 0, C1 = {ˆθ}, y = for t = 1, 2, . . . , T do User arrives and submits query xt for k = 1, 2, . . . , L 1 do Compute τ = argmaxτ V,θ Ct θ, e(xt, y : τ) Set y y : τ if τ = EOS then Break; Submit y and observe reward rt Compute ˆθt+1 by (3.1) and Ct+1 by (3.2)
Open Source Code	No	The paper does not provide any specific link to source code developed by the authors for the methodology described in the paper, nor does it explicitly state that the code will be made available.
Open Datasets	Yes	For the token sequences, to validate our DDMC in real-world datasets, we use Truthful QA dataset (Lin et al., 2022) and HH-RLHF dataset (Bai et al., 2022),
Dataset Splits	No	The paper mentions grouping data by common suffix length for DDMC validation and using synthetic data for EOFUL performance validation, but does not provide specific train/test/validation splits in percentages or sample counts for any experiment.
Hardware Specification	No	The paper does not provide specific details on the hardware used (e.g., GPU models, CPU types, memory) for running the experiments.
Software Dependencies	No	The paper mentions obtaining embeddings from 'Llama3-8B-Instruct model', but does not list specific software libraries or tools with version numbers used for implementing the algorithms or running experiments.
Experiment Setup	Yes	In particular, we set the length of each sentence to be L = 30, and truncate top-15 tokens for every algorithm for efficiency. Further, we set γ = 0.8, θ = (0.5, 0.5, . . . , 0.5). Dteails in the query generation can be found in Appendix D.