A Case for Validation Buffer in Pessimistic Actor-Critic
Authors: Michał Nauman, Mateusz Ostaszewski, Marek Cygan
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the proposed approach on a variety of locomotion and manipulation tasks and report improvements in sample efficiency and performance. |
| Researcher Affiliation | Collaboration | Micha! Nauman1 , Mateusz Ostaszewski2 , Marek Cygan1,3 1University of Warsaw 2Warsaw University of Technology 3Nomagic |
| Pseudocode | Yes | We summarize VPL approach in Figure 2 and share pseudo-code in Section B.1, where we colour changes wrt. regular SAC. |
| Open Source Code | No | The paper uses the Jax RL codebase [Kostrikov, 2021] but does not state that its own implementation code is open-source or provide a link. |
| Open Datasets | Yes | We evaluate VPL against existing pessimism adjustment methods on Deep Mind control [Tassa et al., 2018] and Meta World [Yu et al., 2020]. |
| Dataset Splits | Yes | We observe that the regret associated with maintaining a validation buffer, and thus not utilizing it for actor-critic updates, diminishes over the course of training. Specifically, the regret SR-SAC reaches parity with the SR-SAC in performance for all validation proportions except at 1 2. ...using varying ratios of validation to training samples, specifically at proportions of 1 128, 1 32, 1 2. |
| Hardware Specification | No | The paper mentions using 'Polish high-performance computing infrastructure PLGrid (ACK Cyfronet AGH)' but does not provide specific hardware details like GPU or CPU models. |
| Software Dependencies | No | The paper states 'Our experiments are based on the Jax RL codebase [Kostrikov, 2021]' but does not provide specific version numbers for Jax RL or any other key software dependencies. |
| Experiment Setup | Yes | We align the common hyperparameters with those recommended for Scaled-By-Resetting SAC (SR-SAC) as per [D Oro et al., 2022]. This includes using the same network architectures and a two-critic ensemble in accordance with established practices. ... We evaluate agents after 500k environments steps for learning rates of [5e 5, 5e 4, 5e 3, 5e 2]. |