Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning
Authors: Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results witness that La Re (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks. We evaluate La Re1 on two widely used benchmarks in both single-agent and multi-agent settings: Mu Jo Co locomotion benchmark (Todorov, Erez, and Tassa 2012) and Multi Agent Particle Environment (MPE) (Lowe et al. 2017). Additionally, we perform ablation studies and further analyses to validate La Re s components and assess its properties. |
| Researcher Affiliation | Academia | Tsinghua University EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: La Re Input: LLM M, task information task, role instruction role, candidate responses number n, pre-collected random state-action pairs s, max episodes N max Output: policy network πθ, reward decoder model fψ |
| Open Source Code | Yes | Our code is available at https://github.com/thu-rllab/La Re |
| Open Datasets | Yes | We evaluate La Re1 on two widely used benchmarks in both single-agent and multi-agent settings: Mu Jo Co locomotion benchmark (Todorov, Erez, and Tassa 2012) and Multi Agent Particle Environment (MPE) (Lowe et al. 2017). Additionally, we perform ablation studies and further analyses to validate La Re s components and assess its properties. Moreover, we evaluate La Re in more complex scenarios from SMAC (Samvelyan et al. 2019) and a newly designed task, Triangle Area, in Appendix D and E. |
| Dataset Splits | No | The paper mentions using specific environments and benchmarks (Mu Jo Co, MPE, SMAC) and runs each algorithm on |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | The paper mentions using "GPT-4o from Open AI API" but does not list specific version numbers for other key software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers. |
| Experiment Setup | No | The paper states that "Further details and results are available in the Appendix" regarding experimental setups and baselines, implying that specific hyperparameters, training configurations, or system-level settings are not detailed in the main text. |