reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data

Authors: Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on four challenging Mu Jo Co tasks with sparse rewards. Following EMRLD (Rengarajan et al. 2022a), the agent gets a reward only after it has moved a certain number of units along the correct direction, making the rewards sparse. We take three popular off-policy RL algorithms as our vanilla algorithms, which are DDPG (Lillicrap et al. 2016), TD3 (Fujimoto, van Hoof, and Meger 2018), and SAC (Haarnoja et al. 2018).
Researcher Affiliation	Academia	1University of Electronic Science and Technology of China, Chengdu, China 2Sichuan Artificial Intelligence Research Institute, Yibin, China 3Data Science Research Center, Duke Kunshan University, Kunshan, China EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: RL+GILD Input: Actor ϕ, critic θ, GILD ω, demonstration data Ddem, and empty replay buffer D 1: while not converging do 2: Collect data from the environment and store in D; 3: meta-training: 4: Sample (s, a, r, s ) from D, and (sd, ad) from Ddem; 5: Update critic θ via Eq. (5); 6: Pseudo-update actor ˆϕ with RL+IL via Eq. (6); 7: Update actor ϕ with RL + GILD via Eq. (7); 8: meta-optimization: 9: Update GILD ω via Eq. (11); 10: end while
Open Source Code	Yes	Our code is available at https://github.com/slDeng1003/GILD.
Open Datasets	Yes	Benchmarks and vanilla RL algorithms. We conduct experiments on four challenging Mu Jo Co tasks with sparse rewards.
Dataset Splits	No	The paper describes training steps and evaluation frequency (e.g., "train off-policy algorithms for 1 million steps with sparse rewards and evaluate them every 5000 steps"), and mentions "demonstration data Ddem" and a "replay buffer D" for storing collected transitions. However, it does not provide specific percentages, sample counts, or explicit methods for dividing any dataset (e.g., demonstration data) into training, validation, or test sets.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU or CPU models. It only discusses computational cost and run time without specifying the underlying hardware.
Software Dependencies	No	The paper mentions using "open-source implementations of Our DDPG 1, TD32, and SAC3" and provides links, but it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions) within the text.
Experiment Setup	Yes	Implementation details. To ensure a fair and identical experimental evaluation across algorithms, we train the (RL+IL and RL+GILD) variant using the same hyperparameters as their vanilla algorithms and introduce no domain-specific parameters. We train off-policy algorithms for 1 million steps with sparse rewards and evaluate them every 5000 steps with dense rewards. On-policy algorithms are trained with more steps (e.g., 30 million) to ensure convergence. Results are averaged over five random seeds and the standard deviation is shown with the shaded region or error bar. ... we assign the hyperparameters as wrl = β/ 1 N P s,a \|Qθ(s, a)\| and wil = 1 for off-policy RL+IL baselines in our experiment, with β=2.5 provided by the authors. ... GILD converges exception-ally quickly (within 1% of total steps) under the supervision of meta-loss. Meta-loss drops rapidly around zero after 1000 steps, verifying that GILD has distilled most of the knowledge in demonstrations from tws B N 640 times of processing each data, where tws=10000 is the warm-start steps, B=256 is the batch size, and N 4000 is the number of samples.