reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation

Authors: Simone Parisi, Matteo Pirotta, Marcello Restelli

JAIR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, the properties of the proposed approach are empirically evaluated on two problems, a linear-quadratic Gaussian regulator and a water reservoir control task.
Researcher Affiliation	Academia	Simone Parisi EMAIL Technische Universit at Darmstadt Hochschulstr. 10, 64289 Darmstadt, Germany Matteo Pirotta EMAIL Marcello Restelli EMAIL Politecnico di Milano Piazza Leonardo da Vinci 32, 20133 Milano, Italy
Pseudocode	Yes	Algorithm 1 Pareto Manifold Gradient Algorithm
Open Source Code	Yes	6. Source code available at https://github.com/sparisi/mips.
Open Datasets	No	The paper describes two problems: a Linear-Quadratic Gaussian regulator (LQG) and a water reservoir control task. Both are described as mathematical models or simulated environments, rather than referencing a pre-existing public dataset with specific access information. For instance, the LQG problem is defined by its dynamics and reward functions, and the water reservoir control task is described with its state-transition function and reward definitions, along with parameters used in the simulation. No external, publicly available datasets are mentioned with concrete access details (links, DOIs, or specific citations to data repositories).
Dataset Splits	No	The paper describes experiments in simulated environments (LQG and water reservoir control task). It mentions 'all policies are evaluated over 1,000 episodes of 100 steps, while the learning phase requires a different number of episodes over 30 steps'. This refers to the duration and number of simulation runs for learning and evaluation, not the splitting of a pre-existing dataset into explicit training, testing, or validation subsets. There are no mentions of dataset split percentages, sample counts for splits, or citations to predefined splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as CPU models, GPU models, or memory specifications. It only describes the algorithms and their empirical evaluation.
Software Dependencies	No	The paper mentions various algorithms and methods (e.g., 'weighted sum Stochastic Dynamic Programming', 'Multi-objective FQI', 'Relative Entropy Policy Search', 'SMS-EMOA', 'Radial Algorithm', 'Pareto Following Algorithm') but does not specify the software implementations or libraries used, nor their version numbers. There are no details like 'Python 3.8' or 'PyTorch 1.9' that would allow for replication of the software environment.
Experiment Setup	Yes	The parameters used for all the experiments are the following: γ = 0.9, ξ = 0.1 and initial state s0 = [10, 10]T and s0 = [10, 10, 10]T for the 2 and 3 objective case, respectively. The following sections compare the performance of the proposed metrics under several settings. We will made use of tables to summarize the results at the end of each set of experiments. In this work we consider three objectives: flooding along the lake shores, irrigation supply and hydro-power supply. The immediate rewards are defined by... We used four centers ci uniformly placed in the interval [ 20, 190] and widths wi of 60, for a total of six policy parameters. According to the results presented in Section 6.1.3, the integral estimate in PMGA is performed using a Monte Carlo algorithm fed with only one random point. For each instance of variable t, 50 trajectories by 30 steps are used to estimate the gradient and the Hessian of the policy. Regarding the learning rate, the adaptive one described in Equation (8) was used with ε = 2.