Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
Authors: Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the accuracy and effectiveness of the learned video-language rewards on simulated robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019). |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Aalto University 2Department of Computer Science, Toronto Metropolitan University 3Department of Computer Science, University of Manchester 4Intel Corporation |
| Pseudocode | No | The paper describes the methodology in prose, but there are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code. |
| Open Source Code | Yes | Source code and supplementary videos are available on the project website: https://sites.google.com/view/ video-language-critic. |
| Open Datasets | Yes | We evaluate the accuracy and effectiveness of the learned video-language rewards on simulated robotic manipulation tasks from the Meta-World benchmark (Yu et al., 2019)... We further report comparisons to prior work... using Open X-Embodiment (Open X-Embodiment Collaboration et al., 2023)... We perform hyperparameter selection, ablation studies, and finalize all training details on data from VLMbench (Zheng et al., 2022)... Pretraining datasets... Howto100M (Miech et al., 2019)... EPICKITCHENS (Damen et al., 2018)... Something Something v2 (Goyal et al., 2017)... Ego4D (Grauman et al., 2022). |
| Dataset Splits | Yes | For this purpose, we split Meta-World into 40 training and 10 test tasks (every 5th task alphabetically). This leaves roughly 1600 successful and 1300 unsuccessful videos as training data we refer to this subset as MW40. |
| Hardware Specification | Yes | Reward training on Meta-World videos took 2 hours for MW50 on a single NVIDIA A100 GPU, and 1 hour for MW40 on a Ge Force RTX 3090 GPU... Running inference on the VLC architecture to predict the reward for one time step (with batch size 1) takes up an additional 800 MB of GPU RAM, and 11 ms on a H100 GPU or 29 ms on a V100. |
| Software Dependencies | No | For RL training experiments, we adapt the SAC implementation of Clean RL (Huang et al., 2022)... We apply the standard normalization logic from Gymnasium (Towers et al., 2023)... (No specific software versions are provided for these tools, only citations to papers describing them). |
| Experiment Setup | Yes | We subsample the videos to 12 time steps... We set α, the ranking loss weight, to 33... Policy evaluation is done every 20,000 timesteps for 50 episodes. Both the actor and critic networks contain three hidden layers of size 400... We experimentally set the relative weights of the VLC and sparse reward components to 1 and 50, respectively... |