Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data
Authors: Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on four challenging Mu Jo Co tasks with sparse rewards. Following EMRLD (Rengarajan et al. 2022a), the agent gets a reward only after it has moved a certain number of units along the correct direction, making the rewards sparse. We take three popular off-policy RL algorithms as our vanilla algorithms, which are DDPG (Lillicrap et al. 2016), TD3 (Fujimoto, van Hoof, and Meger 2018), and SAC (Haarnoja et al. 2018). |
| Researcher Affiliation | Academia | 1University of Electronic Science and Technology of China, Chengdu, China 2Sichuan Artificial Intelligence Research Institute, Yibin, China 3Data Science Research Center, Duke Kunshan University, Kunshan, China EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: RL+GILD Input: Actor ϕ, critic θ, GILD ω, demonstration data Ddem, and empty replay buffer D 1: while not converging do 2: Collect data from the environment and store in D; 3: meta-training: 4: Sample (s, a, r, s ) from D, and (sd, ad) from Ddem; 5: Update critic θ via Eq. (5); 6: Pseudo-update actor ˆϕ with RL+IL via Eq. (6); 7: Update actor ϕ with RL + GILD via Eq. (7); 8: meta-optimization: 9: Update GILD ω via Eq. (11); 10: end while |
| Open Source Code | Yes | Our code is available at https://github.com/slDeng1003/GILD. |
| Open Datasets | Yes | Benchmarks and vanilla RL algorithms. We conduct experiments on four challenging Mu Jo Co tasks with sparse rewards. |
| Dataset Splits | No | The paper describes training steps and evaluation frequency (e.g., "train off-policy algorithms for 1 million steps with sparse rewards and evaluate them every 5000 steps"), and mentions "demonstration data Ddem" and a "replay buffer D" for storing collected transitions. However, it does not provide specific percentages, sample counts, or explicit methods for dividing any dataset (e.g., demonstration data) into training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models. It only discusses computational cost and run time without specifying the underlying hardware. |
| Software Dependencies | No | The paper mentions using "open-source implementations of Our DDPG 1, TD32, and SAC3" and provides links, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions) within the text. |
| Experiment Setup | Yes | Implementation details. To ensure a fair and identical experimental evaluation across algorithms, we train the (RL+IL and RL+GILD) variant using the same hyperparameters as their vanilla algorithms and introduce no domain-specific parameters. We train off-policy algorithms for 1 million steps with sparse rewards and evaluate them every 5000 steps with dense rewards. On-policy algorithms are trained with more steps (e.g., 30 million) to ensure convergence. Results are averaged over five random seeds and the standard deviation is shown with the shaded region or error bar. ... we assign the hyperparameters as wrl = β/ 1 N P s,a |Qθ(s, a)| and wil = 1 for off-policy RL+IL baselines in our experiment, with β=2.5 provided by the authors. ... GILD converges exception-ally quickly (within 1% of total steps) under the supervision of meta-loss. Meta-loss drops rapidly around zero after 1000 steps, verifying that GILD has distilled most of the knowledge in demonstrations from tws B N 640 times of processing each data, where tws=10000 is the warm-start steps, B=256 is the batch size, and N 4000 is the number of samples. |