reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Importance Sampling Techniques for Policy Optimization

Authors: Alberto Maria Metelli, Matteo Papini, Nico Montali, Marcello Restelli

JMLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The resulting algorithms are ﬁnally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods. (Abstract) and In this section, we present the experimental evaluation of POIS in its diﬀerent ﬂavors (parameter based, action based, action based per decision). (Section 7 Introduction)
Researcher Affiliation	Academia	Alberto Maria Metelli EMAIL Matteo Papini EMAIL Nico Montali EMAIL Marcello Restelli EMAIL Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
Pseudocode	Yes	Algorithm 1 Parameter based POIS. (Page 12), Algorithm 2 Action based POIS. (Page 14), Algorithm 3 Parabolic Line Search (Appendix G.1)
Open Source Code	Yes	The implementation of POIS can be found at https://github.com/T3p/baselines.
Open Datasets	Yes	The resulting algorithms are ﬁnally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods. Keywords: Reinforcement Learning, Policy Optimization, Importance Sampling, Per Decision Importance Sampling, Multiple Importance Sampling (Abstract). Also: on classical control tasks (Duan et al., 2016; Todorov et al., 2012). (Section 1).
Dataset Splits	Yes	At each on line iteration h 1, 2, ..., Mon-line, we sample NJ parameters tθh i u NJ i 1 independently from νρh 0 . For each of the θh i , we collect a single trajectory τ h i by running policy πθh i in the environment and we observe its return Rpτ h i q. (Section 5.1). Also in Appendix H.1: Episodes per iteration: 100 and Appendix H.2: Timesteps per iteration: 50000.
Hardware Specification	Yes	We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40cm, Titan XP and Tesla V100 used for this research.
Software Dependencies	No	The paper mentions environments like MuJoCo (Todorov et al., 2012) and continuous control benchmarks (Duan et al., 2016), but does not specify any software libraries or frameworks with version numbers used for implementation.
Experiment Setup	Yes	The hyperparameters of the individual algorithms are reported in Table 4. (Section 7.1). Also Appendix H.1 and H.2 detail numerous hyperparameters for both linear and deep neural policies. Table 4: Hyperparameter value of the individual algorithms employed in the experiments shown in Figure 4. (Page 20).