Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

Authors: Simone Parisi, Matteo Pirotta, Marcello Restelli

JAIR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.
Researcher Affiliation Academia Simone Parisi EMAIL Technische Universit at Darmstadt Hochschulstr. 10, 64289 Darmstadt, Germany Matteo Pirotta EMAIL Marcello Restelli EMAIL Politecnico di Milano Piazza Leonardo da Vinci 32, 20133 Milano, Italy
Pseudocode Yes Algorithm 1 Pareto Manifold Gradient Algorithm
Open Source Code Yes 6. Source code available at https://github.com/sparisi/mips.
Open Datasets No The paper describes two problems: a Linear-Quadratic Gaussian regulator (LQG) and a water reservoir control task. Both are described as mathematical models or simulated environments, rather than referencing a pre-existing public dataset with specific access information. For instance, the LQG problem is defined by its dynamics and reward functions, and the water reservoir control task is described with its state-transition function and reward definitions, along with parameters used in the simulation. No external, publicly available datasets are mentioned with concrete access details (links, DOIs, or specific citations to data repositories).
Dataset Splits No The paper describes experiments in simulated environments (LQG and water reservoir control task). It mentions 'all policies are evaluated over 1,000 episodes of 100 steps, while the learning phase requires a different number of episodes over 30 steps'. This refers to the duration and number of simulation runs for learning and evaluation, not the splitting of a pre-existing dataset into explicit training, testing, or validation subsets. There are no mentions of dataset split percentages, sample counts for splits, or citations to predefined splits.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as CPU models, GPU models, or memory specifications. It only describes the algorithms and their empirical evaluation.
Software Dependencies No The paper mentions various algorithms and methods (e.g., 'weighted sum Stochastic Dynamic Programming', 'Multi-objective FQI', 'Relative Entropy Policy Search', 'SMS-EMOA', 'Radial Algorithm', 'Pareto Following Algorithm') but does not specify the software implementations or libraries used, nor their version numbers. There are no details like 'Python 3.8' or 'PyTorch 1.9' that would allow for replication of the software environment.
Experiment Setup Yes The parameters used for all the experiments are the following: γ = 0.9, ξ = 0.1 and initial state s0 = [10, 10]T and s0 = [10, 10, 10]T for the 2 and 3 objective case, respectively. The following sections compare the performance of the proposed metrics under several settings. We will made use of tables to summarize the results at the end of each set of experiments. In this work we consider three objectives: flooding along the lake shores, irrigation supply and hydro-power supply. The immediate rewards are defined by... We used four centers ci uniformly placed in the interval [ 20, 190] and widths wi of 60, for a total of six policy parameters. According to the results presented in Section 6.1.3, the integral estimate in PMGA is performed using a Monte Carlo algorithm fed with only one random point. For each instance of variable t, 50 trajectories by 30 steps are used to estimate the gradient and the Hessian of the policy. Regarding the learning rate, the adaptive one described in Equation (8) was used with ε = 2.