Reinforcement Learning with Segment Feedback
Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly. ... We also present experiments to validate our theoretical results. ... 5. Experiments: Below we present experiments for RL with segment feedback to validate our theoretical results. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign 2Stanford University 3NVIDIA Research 4Technion. Correspondence to: Yihan Du <EMAIL>, R. Srikant <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Seg Bi TS ... Algorithm 2 E-Lin UCB ... Algorithm 3 Seg Bi TS-Tran ... Algorithm 4 Lin UCB-Tran |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide any links to a code repository in the main text, acknowledgements, or supplementary materials. |
| Open Datasets | No | The paper describes the construction of custom MDP instances for its experiments, rather than using or providing access to pre-existing public datasets. For example: "For the binary segment feedback setting, we consider an MDP as in Figure 2(a): There are 9 states and 5 actions. For any a A, we have r(s0, a) = 0, r(si, a) = rmax for any i {1, 3, 5, 7} (called good states), and r(si, a) = rmax for any i {2, 4, 6, 8} (called bad states)." |
| Dataset Splits | No | The paper describes experiments conducted on custom-designed Markov Decision Processes (MDPs). These environments are defined by states, actions, rewards, and transitions, and the experiments involve simulating agent interactions within them. The concept of splitting a fixed dataset into training, validation, and test sets is not applicable here as data is generated through interaction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper does not mention any specific software dependencies, libraries, or their version numbers used for implementing the algorithms or running the experiments. |
| Experiment Setup | Yes | In both settings, we set rmax = 0.5, δ = 0.005, H = 100 and m {1, 2, 4, 5, 10, 20, 25, 50, 100}. For each algorithm, we perform 20 independent runs, and plot the average cumulative regret up to episode K across runs with a 95% confidence interval. |