reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provably Sample-Efficient Model-Free Algorithm for MDPs with Peak Constraints

Authors: Qinbo Bai, Vaneet Aggarwal, Ather Gattami

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the proposed algorithm on a communication channel, where the transmitter is powered by renewable energy. Such a model has been studied widely in communication systems... In Fig. 2, we plot the mean and variance for the sum of the transmission rate V πk 1 and constraint violations defined in Eq. (11) and compare the learning speed for different choice of ξ = 0.1, 0.01, 0.001.
Researcher Affiliation	Collaboration	Qinbo Bai EMAIL School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907, USA Vaneet Aggarwal EMAIL School of IE and ECE Purdue University West Lafayette, IN 47907, USA Ather Gattami EMAIL AI Sweden Stockholm, Sweden
Pseudocode	Yes	Algorithm 1 Constrained Q-Learning Algorithm 1: Initialize Qh(s, a) ηH, Wh(s, a) ηH, Nh(s, a) 0, µh(s, a) 0, σh(s, a) 0 and β0(s, a, h) 0 for all (s, a, h) S A [H]. Initial parameter ξ and η 2: for episode k = 1, ...K do 3: Observe s1 4: for step h = 1, ...H do 5: Take action ah arg max a Qh(sh, a ) and observe sh+1 6: t = Nh(sh, ah) Nh(sh, ah) + 1 7: µh(sh, ah) µh(sh, ah) + Wh+1(sh+1) 8: σh(sh, ah) σh(sh, ah) + (Wh+1(sh+1))2 9: βt(sh, ah, h) min{c1( q t (σh(sh,ah) (µh(sh,ah))2 t + ηH)ℓ+ η 10: bt βt(sh,ah,h) (1 αt)βt 1(sh,ah,h) 2αt 11: Qh(sh, ah) (1 αt)Qh(sh, ah)+αt[Rh(sh, ah)+Wh+1(sh+1)+bt] (Rh as defined in Eq. (12)) 12: Wh(sh) min{ηH, max a A Qh(sh, a )} 13: end for 14: end for
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets	No	In this section, we evaluate the proposed algorithm on a communication channel... We set the distribution of Eh as truncated Gaussian with mean µ and standard deviation σ, where the truncation levels are 0 and E, and we let it be independent across episodes. The problem is discretized to integers in order to apply the proposed algorithm. According to the selection of the parameters in Wang et al. (2014), we set the horizon H = 20 timeslots, battery capacity B = 20, power constraint P = 8, maximal harvest energy E = 20, mean and standard deviation µ = 10, σ = 5, respectively. In the simulation, 1000 trajectories are generated by the above MDP. ... To test our algorithm and compare with the offline DT algorithm, we manually design two setups with 5 jobs and 9 jobs, respectively. With 5 jobs, we consider two scenarios, where the processing time of the jobs are deterministic or stochastic. For 9 jobs, we consider deterministic processing times. We label the three examples here as Example 1: 5 jobs, deterministic processing time, Example 2: 9 jobs, deterministic processing time, and Example 3: 5 jobs, stochastic processing time. The details for these examples are listed in Tables 1, 2 and 3, respectively.
Dataset Splits	No	For the Energy Harvesting Communication System, the paper mentions that '1000 trajectories are generated by the above MDP.' For the Single machine scheduling problem, the authors state, 'we manually design two setups with 5 jobs and 9 jobs'. No explicit dataset splits (e.g., train/test/validation percentages or counts) are mentioned for either of the evaluation scenarios.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU/GPU models, memory, or computational resources used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies or library versions used for implementation.
Experiment Setup	Yes	According to the selection of the parameters in Wang et al. (2014), we set the horizon H = 20 timeslots, battery capacity B = 20, power constraint P = 8, maximal harvest energy E = 20, mean and standard deviation µ = 10, σ = 5, respectively. In the simulation, 1000 trajectories are generated by the above MDP. In Fig. 2, we plot the mean and variance for the sum of the transmission rate V πk 1 and constraint violations defined in Eq. (11) and compare the learning speed for different choice of ξ = 0.1, 0.01, 0.001. The Slater parameter for γ is chosen to be ξ/2. Note that in each episode k, we evaluate the policy πk, which is the averaged policy by π1, π2, , πk as defined in Theorem 10. It can be seen that the sum of transmission rate converges to the optimal and the constraint violation converge to 0 as the number of episodes increases. Moreover, we can find that larger γ gives a higher convergence speed while the difference between convergence speed for different γ is quite small. In order to compare the proposed algorithm, we consider three other baseline algorithms: the greedy policy, the balanced policy, and the optimal non-causal algorithm. ... In Fig. 3, we set the mean of the harvested energy as 8, 9, 10, 11, 12 and compare the performance of different algorithms. In order to illustrate the policy convergence to the optimal, we choose K = 50000 for the proposed algorithm. P is set to 15 in this comparison. ... To test our algorithm and compare with the offline DT algorithm, we manually design two setups with 5 jobs and 9 jobs, respectively. ... The details for these examples are listed in Tables 1, 2 and 3, respectively.