Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning

Authors: Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results witness that La Re (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks. We evaluate La Re1 on two widely used benchmarks in both single-agent and multi-agent settings: Mu Jo Co locomotion benchmark (Todorov, Erez, and Tassa 2012) and Multi Agent Particle Environment (MPE) (Lowe et al. 2017). Additionally, we perform ablation studies and further analyses to validate La Re s components and assess its properties.
Researcher Affiliation Academia Tsinghua University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: La Re Input: LLM M, task information task, role instruction role, candidate responses number n, pre-collected random state-action pairs s, max episodes N max Output: policy network πθ, reward decoder model fψ
Open Source Code Yes Our code is available at https://github.com/thu-rllab/La Re
Open Datasets Yes We evaluate La Re1 on two widely used benchmarks in both single-agent and multi-agent settings: Mu Jo Co locomotion benchmark (Todorov, Erez, and Tassa 2012) and Multi Agent Particle Environment (MPE) (Lowe et al. 2017). Additionally, we perform ablation studies and further analyses to validate La Re s components and assess its properties. Moreover, we evaluate La Re in more complex scenarios from SMAC (Samvelyan et al. 2019) and a newly designed task, Triangle Area, in Appendix D and E.
Dataset Splits No The paper mentions using specific environments and benchmarks (Mu Jo Co, MPE, SMAC) and runs each algorithm on
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions using "GPT-4o from Open AI API" but does not list specific version numbers for other key software components like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers.
Experiment Setup No The paper states that "Further details and results are available in the Appendix" regarding experimental setups and baselines, implying that specific hyperparameters, training configurations, or system-level settings are not detailed in the main text.