Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality

Authors: François G. Ged, Maria Han Veiga

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a proof of concept, we evaluate numerically MPG on standard test benchmarks. Numerically, we successfully train agents on standard simple tasks without relying on RL tricks, and confirm our theoretical findings (see Section 4).
Researcher Affiliation Academia Fran cois G. Ged1,2 EMAIL 1Chair of Statistical Field Theory Ecole Polytechnique F ed erale de Lausanne Lausanne, Switzerland 2Dynamical Systems in Biomathematics University of Vienna Vienna, Austria Maria Han Veiga EMAIL Department of Mathematics The Ohio State University Columbus, USA All authors are affiliated with universities: Ecole Polytechnique Fédérale de Lausanne, University of Vienna, and The Ohio State University.
Pseudocode Yes Algorithm 1 MPG implementation for N horizon task
Open Source Code No The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the methodology described. It refers to numerical experiments but does not offer access to their implementation.
Open Datasets Yes Then, we study two benchmarks from Open AI: the Frozen Lake game and the Cart Pole.
Dataset Splits No The paper mentions "Number of episodes: 1000." for training the models on the Frozen Lake and Cart Pole tasks. However, it does not specify explicit training/testing/validation splits for the data, which is typical for simulated reinforcement learning environments where data is generated dynamically.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using a "deep neural network" and "Re LU activation function" and describes the model architecture. However, it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup Yes Input: initial temperature τ0, initial learning rate η0, final temperature τT , final learning rate ηT τ τ0 η η0 for t = 1, ..., episodes do generate trajectory from policies {πn t , πn 1 t , ..., π1 t }: {(si, si+1, ai, ri)}n 1 i=0 for i = 1, , n do Ci = Pn 1 ℓ=n i rℓ τ log π(n ℓ) t θ(i) t+1 = θ(i) t + ηCi log π(i) t (an i|sn i) end for decay τ, η using dτ = τT 1/episodes and dη = ηT 1/episodes Furthermore, the paper provides tables detailing the "Hyper-parameters for Frozen lake" and "Hyper-parameters for balancing cart pole task" including initial and terminal learning rates and temperatures.