Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints
Authors: Qinbo Bai, Vaneet Aggarwal, Ather Gattami
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the proposed algorithm on a communication channel, where the transmitter is powered by renewable energy. Such a model has been studied widely in communication systems... In Fig. 2, we plot the mean and variance for the sum of the transmission rate V πk 1 and constraint violations defined in Eq. (11) and compare the learning speed for different choice of ξ = 0.1, 0.01, 0.001. |
| Researcher Affiliation | Collaboration | Qinbo Bai EMAIL School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA Vaneet Aggarwal EMAIL School of IE and ECE Purdue University West Lafayette, IN 47907, USA Ather Gattami EMAIL AI Sweden Stockholm, Sweden |
| Pseudocode | Yes | Algorithm 1 Constrained Q-Learning Algorithm 1: Initialize Qh(s, a) ηH, Wh(s, a) ηH, Nh(s, a) 0, µh(s, a) 0, σh(s, a) 0 and β0(s, a, h) 0 for all (s, a, h) S A [H]. Initial parameter ξ and η 2: for episode k = 1, ...K do 3: Observe s1 4: for step h = 1, ...H do 5: Take action ah arg max a Qh(sh, a ) and observe sh+1 6: t = Nh(sh, ah) Nh(sh, ah) + 1 7: µh(sh, ah) µh(sh, ah) + Wh+1(sh+1) 8: σh(sh, ah) σh(sh, ah) + (Wh+1(sh+1))2 9: βt(sh, ah, h) min{c1( q t (σh(sh,ah) (µh(sh,ah))2 t + ηH)ℓ+ η 10: bt βt(sh,ah,h) (1 αt)βt 1(sh,ah,h) 2αt 11: Qh(sh, ah) (1 αt)Qh(sh, ah)+αt[Rh(sh, ah)+Wh+1(sh+1)+bt] (Rh as defined in Eq. (12)) 12: Wh(sh) min{ηH, max a A Qh(sh, a )} 13: end for 14: end for |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | No | In this section, we evaluate the proposed algorithm on a communication channel... We set the distribution of Eh as truncated Gaussian with mean µ and standard deviation σ, where the truncation levels are 0 and E, and we let it be independent across episodes. The problem is discretized to integers in order to apply the proposed algorithm. According to the selection of the parameters in Wang et al. (2014), we set the horizon H = 20 timeslots, battery capacity B = 20, power constraint P = 8, maximal harvest energy E = 20, mean and standard deviation µ = 10, σ = 5, respectively. In the simulation, 1000 trajectories are generated by the above MDP. ... To test our algorithm and compare with the offline DT algorithm, we manually design two setups with 5 jobs and 9 jobs, respectively. With 5 jobs, we consider two scenarios, where the processing time of the jobs are deterministic or stochastic. For 9 jobs, we consider deterministic processing times. We label the three examples here as Example 1: 5 jobs, deterministic processing time, Example 2: 9 jobs, deterministic processing time, and Example 3: 5 jobs, stochastic processing time. The details for these examples are listed in Tables 1, 2 and 3, respectively. |
| Dataset Splits | No | For the Energy Harvesting Communication System, the paper mentions that '1000 trajectories are generated by the above MDP.' For the Single machine scheduling problem, the authors state, 'we manually design two setups with 5 jobs and 9 jobs'. No explicit dataset splits (e.g., train/test/validation percentages or counts) are mentioned for either of the evaluation scenarios. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, memory, or computational resources used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or library versions used for implementation. |
| Experiment Setup | Yes | According to the selection of the parameters in Wang et al. (2014), we set the horizon H = 20 timeslots, battery capacity B = 20, power constraint P = 8, maximal harvest energy E = 20, mean and standard deviation µ = 10, σ = 5, respectively. In the simulation, 1000 trajectories are generated by the above MDP. In Fig. 2, we plot the mean and variance for the sum of the transmission rate V πk 1 and constraint violations defined in Eq. (11) and compare the learning speed for different choice of ξ = 0.1, 0.01, 0.001. The Slater parameter for γ is chosen to be ξ/2. Note that in each episode k, we evaluate the policy πk, which is the averaged policy by π1, π2, , πk as defined in Theorem 10. It can be seen that the sum of transmission rate converges to the optimal and the constraint violation converge to 0 as the number of episodes increases. Moreover, we can find that larger γ gives a higher convergence speed while the difference between convergence speed for different γ is quite small. In order to compare the proposed algorithm, we consider three other baseline algorithms: the greedy policy, the balanced policy, and the optimal non-causal algorithm. ... In Fig. 3, we set the mean of the harvested energy as 8, 9, 10, 11, 12 and compare the performance of different algorithms. In order to illustrate the policy convergence to the optimal, we choose K = 50000 for the proposed algorithm. P is set to 15 in this comparison. ... To test our algorithm and compare with the offline DT algorithm, we manually design two setups with 5 jobs and 9 jobs, respectively. ... The details for these examples are listed in Tables 1, 2 and 3, respectively. |