reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Behavior-agnostic Task Inference for Robust Offline In-context Reinforcement Learning

Authors: Long Ma, Fangwei Zhong, Yizhou Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Mu Jo Co environments demonstrate that BATI effectively interprets outof-distribution contexts and outperforms other methods, even in the presence of significant environmental noise. We conduct extensive experiments in several Mu Jo Co environments to evaluate the effectiveness of BATI.
Researcher Affiliation	Academia	1Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China 2Beijing Institute for General Artificial Intelligence, Beijing, China 3School of Artificial Intelligence, Beijing Normal University, Beijing, China 4School of Computer Science, Peking University, Beijing, China 5Institute for Artificial Intelligence, Peking University, Beijing, China 6State Key Laboratory of General Artificial Intelligence, Beijing, China. Correspondence to: Fangwei Zhong <EMAIL>.
Pseudocode	Yes	See Alg. 1 for the pseudocode of the full training pipeline. See Alg. 2 for the online evaluation procedure.
Open Source Code	No	Project page: https://sites.google.com/view/ bati-icrl. The paper provides a project page URL, but it does not explicitly state that this page hosts the source code for the methodology described in the paper.
Open Datasets	Yes	We choose five representative robot locomotion environments based on the Mu Jo Co simulator (Todorov et al., 2012) with varying properties and levels of difficulty.
Dataset Splits	Yes	In each evaluation environment, we randomly sample 20 tasks as ptrain(M) and another 20 tasks as ptest(M) according to its task parameterization... We split the 40 evaluation tasks in Ant Dir into different training and testing splits and reran the experiments. As shown in Tab. 5, across all splits, BATI achieves the best performance uniformly and is highly stable.
Hardware Specification	No	The paper mentions using "Mu Jo Co simulator" for environments but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	We implement BATI and the baselines on the same codebase with IQL (Kostrikov et al., 2022) as the base offline RL algorithm. Building on the codebase of the official implementation of UNICORN (Li et al., 2024), we implement BATI and all our baselines. The paper mentions various software components and frameworks but does not provide specific version numbers for any of them (e.g., IQL, UNICORN, CSRO, BRAC).
Experiment Setup	Yes	Table 3. Hyperparameters used in each of our evaluation environments. Parameter Name: Learning Rate, Batch Size, Task Contrastive Batch Size, IQL τ, IQL β, IQL Exp. Adv. Clip, # Gradient Steps, Episode Length, Dataset Size, Task Latent Dim, BATI # Latent Samples N, UNICORN Weight, CSRO CLUB Weight, CLUB Encoder Hidden Dims, Encoder Hidden Dims, Decoder Hidden Dims, RL Hidden Dims.