reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Authors: Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the accuracy and effectiveness of the learned video-language rewards on simulated robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019).
Researcher Affiliation	Collaboration	1Department of Computer Science, Aalto University 2Department of Computer Science, Toronto Metropolitan University 3Department of Computer Science, University of Manchester 4Intel Corporation
Pseudocode	No	The paper describes the methodology in prose, but there are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code	Yes	Source code and supplementary videos are available on the project website: https://sites.google.com/view/ video-language-critic.
Open Datasets	Yes	We evaluate the accuracy and effectiveness of the learned video-language rewards on simulated robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019)... We further report comparisons to prior work... using Open X-Embodiment (Open X-Embodiment Collaboration et al., 2023)... We perform hyperparameter selection, ablation studies, and finalize all training details on data from VLMbench (Zheng et al., 2022)... Pretraining datasets... Howto100M (Miech et al., 2019)... EPICKITCHENS (Damen et al., 2018)... Something Something v2 (Goyal et al., 2017)... Ego4D (Grauman et al., 2022).
Dataset Splits	Yes	For this purpose, we split Meta-World into 40 training and 10 test tasks (every 5th task alphabetically). This leaves roughly 1600 successful and 1300 unsuccessful videos as training data we refer to this subset as MW40.
Hardware Specification	Yes	Reward training on Meta-World videos took 2 hours for MW50 on a single NVIDIA A100 GPU, and 1 hour for MW40 on a Ge Force RTX 3090 GPU... Running inference on the VLC architecture to predict the reward for one time step (with batch size 1) takes up an additional 800 MB of GPU RAM, and 11 ms on a H100 GPU or 29 ms on a V100.
Software Dependencies	No	For RL training experiments, we adapt the SAC implementation of Clean RL (Huang et al., 2022)... We apply the standard normalization logic from Gymnasium (Towers et al., 2023)... (No specific software versions are provided for these tools, only citations to papers describing them).
Experiment Setup	Yes	We subsample the videos to 12 time steps... We set α, the ranking loss weight, to 33... Policy evaluation is done every 20,000 timesteps for 50 episodes. Both the actor and critic networks contain three hidden layers of size 400... We experimentally set the relative weights of the VLC and sparse reward components to 1 and 50, respectively...