Tokenized Bandit for LLM Decoding and Alignment
Authors: Suho Shin, Chenghao Yang, Haifeng Xu, Mohammadtaghi Hajiaghayi
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We finally provide our experimental results. 6.1. Validating DDMC The first is on validation of DDMC assumption. Further experimental results are presented in Appendix D. 6.2. Performance of EOFUL We numerically validate the performance of EOFUL using synthetic data under the LLM alignment scenario presented in Section 5.1. |
| Researcher Affiliation | Academia | 1University of Maryland 2University of Chicago. Correspondence to: Suho Shin <EMAIL>. |
| Pseudocode | Yes | ALGORITHM 1: EOFUL Input: Token set V, linear contextual bandit algorithm ALG Initialize y , ˆθ1 0, C1 = {ˆθ}, y = for t = 1, 2, . . . , T do User arrives and submits query xt for k = 1, 2, . . . , L 1 do Compute τ = argmaxτ V,θ Ct θ, e(xt, y : τ) Set y y : τ if τ = EOS then Break; Submit y and observe reward rt Compute ˆθt+1 by (3.1) and Ct+1 by (3.2) |
| Open Source Code | No | The paper does not provide any specific link to source code developed by the authors for the methodology described in the paper, nor does it explicitly state that the code will be made available. |
| Open Datasets | Yes | For the token sequences, to validate our DDMC in real-world datasets, we use Truthful QA dataset (Lin et al., 2022) and HH-RLHF dataset (Bai et al., 2022), |
| Dataset Splits | No | The paper mentions grouping data by common suffix length for DDMC validation and using synthetic data for EOFUL performance validation, but does not provide specific train/test/validation splits in percentages or sample counts for any experiment. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used (e.g., GPU models, CPU types, memory) for running the experiments. |
| Software Dependencies | No | The paper mentions obtaining embeddings from 'Llama3-8B-Instruct model', but does not list specific software libraries or tools with version numbers used for implementing the algorithms or running experiments. |
| Experiment Setup | Yes | In particular, we set the length of each sentence to be L = 30, and truncate top-15 tokens for every algorithm for efficiency. Further, we set γ = 0.8, θ = (0.5, 0.5, . . . , 0.5). Dteails in the query generation can be found in Appendix D. |