Importance Sampling Techniques for Policy Optimization

Authors: Alberto Maria Metelli, Matteo Papini, Nico Montali, Marcello Restelli

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods. (Abstract) and In this section, we present the experimental evaluation of POIS in its different flavors (parameter based, action based, action based per decision). (Section 7 Introduction)
Researcher Affiliation Academia Alberto Maria Metelli EMAIL Matteo Papini EMAIL Nico Montali EMAIL Marcello Restelli EMAIL Politecnico di Milano Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB)
Pseudocode Yes Algorithm 1 Parameter based POIS. (Page 12), Algorithm 2 Action based POIS. (Page 14), Algorithm 3 Parabolic Line Search (Appendix G.1)
Open Source Code Yes The implementation of POIS can be found at https://github.com/T3p/baselines.
Open Datasets Yes The resulting algorithms are finally evaluated on a set of continuous control tasks, using both linear and deep policies, and compared with modern policy optimization methods. Keywords: Reinforcement Learning, Policy Optimization, Importance Sampling, Per Decision Importance Sampling, Multiple Importance Sampling (Abstract). Also: on classical control tasks (Duan et al., 2016; Todorov et al., 2012). (Section 1).
Dataset Splits Yes At each on line iteration h 1, 2, ..., Mon-line, we sample NJ parameters tθh i u NJ i 1 independently from νρh 0 . For each of the θh i , we collect a single trajectory τ h i by running policy πθh i in the environment and we observe its return Rpτ h i q. (Section 5.1). Also in Appendix H.1: Episodes per iteration: 100 and Appendix H.2: Timesteps per iteration: 50000.
Hardware Specification Yes We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40cm, Titan XP and Tesla V100 used for this research.
Software Dependencies No The paper mentions environments like MuJoCo (Todorov et al., 2012) and continuous control benchmarks (Duan et al., 2016), but does not specify any software libraries or frameworks with version numbers used for implementation.
Experiment Setup Yes The hyperparameters of the individual algorithms are reported in Table 4. (Section 7.1). Also Appendix H.1 and H.2 detail numerous hyperparameters for both linear and deep neural policies. Table 4: Hyperparameter value of the individual algorithms employed in the experiments shown in Figure 4. (Page 20).