Reinforcement Learning with Segment Feedback

Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly. ... We also present experiments to validate our theoretical results. ... 5. Experiments: Below we present experiments for RL with segment feedback to validate our theoretical results.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign 2Stanford University 3NVIDIA Research 4Technion. Correspondence to: Yihan Du <EMAIL>, R. Srikant <EMAIL>.
Pseudocode Yes Algorithm 1 Seg Bi TS ... Algorithm 2 E-Lin UCB ... Algorithm 3 Seg Bi TS-Tran ... Algorithm 4 Lin UCB-Tran
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide any links to a code repository in the main text, acknowledgements, or supplementary materials.
Open Datasets No The paper describes the construction of custom MDP instances for its experiments, rather than using or providing access to pre-existing public datasets. For example: "For the binary segment feedback setting, we consider an MDP as in Figure 2(a): There are 9 states and 5 actions. For any a A, we have r(s0, a) = 0, r(si, a) = rmax for any i {1, 3, 5, 7} (called good states), and r(si, a) = rmax for any i {2, 4, 6, 8} (called bad states)."
Dataset Splits No The paper describes experiments conducted on custom-designed Markov Decision Processes (MDPs). These environments are defined by states, actions, rewards, and transitions, and the experiments involve simulating agent interactions within them. The concept of splitting a fixed dataset into training, validation, and test sets is not applicable here as data is generated through interaction.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies No The paper does not mention any specific software dependencies, libraries, or their version numbers used for implementing the algorithms or running the experiments.
Experiment Setup Yes In both settings, we set rmax = 0.5, δ = 0.005, H = 100 and m {1, 2, 4, 5, 10, 20, 25, 50, 100}. For each algorithm, we perform 20 independent runs, and plot the average cumulative regret up to episode K across runs with a 95% confidence interval.