Feudal Graph Reinforcement Learning
Authors: Tommaso Marzi, Arshjot Singh Khehra, Andrea Cini, Cesare Alippi
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed framework on a graph clustering problem and Mu Jo Co locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies. In Fig. 3 we show the success rate of each agent in clustering the graph and the median of the Normalized Mutual Information (NMI) score computed across different runs. We report the results for the 4 agents in Fig. 5. |
| Researcher Affiliation | Academia | Tommaso Marzi EMAIL Università della Svizzera italiana, IDSIA Arshjot Khehra EMAIL Università della Svizzera italiana Andrea Cini EMAIL Università della Svizzera italiana, IDSIA Cesare Alippi EMAIL Università della Svizzera italiana, IDSIA Politecnico di Milano |
| Pseudocode | No | The paper only describes the methodology using text, equations (Eq. 2, 3, 4, 5), and a diagram (Fig. 2), without any explicit 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Code to reproduce experiments is available online at https://github.com/tommasomarzi/fgrl. |
| Open Datasets | Yes | We validate our framework on two scenarios, namely a synthetic graph clustering problem inspired by Bianchi et al. (2020) and continuous control environments from the standard Mu Jo Co locomotion tasks (Todorov et al., 2012), where we follow Huang et al. (2020). |
| Dataset Splits | No | The paper describes a synthetic graph clustering problem where graphs are generated with varying parameters (β, Nβ) and continuous control environments from Mu Jo Co locomotion tasks (Todorov et al., 2012) where agents interact with a simulator. It does not provide explicit training/test/validation dataset splits, as is common for simulated or procedurally generated environments. |
| Hardware Specification | Yes | Experiments were run on a workstation equipped with AMD EPYC 7513 CPUs. |
| Software Dependencies | No | The paper mentions that the code was developed relying on open-source libraries and publicly available code of previous works, and states the use of Adam optimizer and PPO, but does not provide specific version numbers for key software components or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 3 provides detailed hyperparameters including Population size, Initial step size, Dimension of state representation, Dimension of hidden layer, Activation function, Aggregation functions, Maximal hierarchy height, and Message-passing rounds. Appendix D.4 further specifies PPO hyperparameters such as learning rate (3e-6), hidden layers ([64, 64] with tanh), discount factor (0.99), clipping value (0.2), policy update epochs (10), batch size (64), updating horizon (2048), and action standard deviation decay schedule. |