reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning

Authors: Haque Ishfaq, Guangyuan Wang, Sami Islam, Doina Precup

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces. We present empirical evaluations of LSAC on the Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016) and the Deep Mind Control Suite (DMC) (Tassa et al., 2018), showing that LSAC is able to outperform or match several strong baselines, including DSAC-T (Duan et al., 2023), the current state-of-the-art model-free off-policy RL algorithm. Figure 1: Training curves for six Mu Jo Co continuous control tasks over 1e6 time steps. Results are averaged over a window size of 11 epochs and across 10 seeds. Solid lines represent the median performance, and the shaded regions correspond to 90% confidence interval.
Researcher Affiliation	Academia	Haque Ishfaq , Guangyuan Wang*, Sami Nur Islam, Doina Precup Mila, Mc Gill University EMAIL
Pseudocode	Yes	Algorithm 1: Langevin Soft Actor-Critic (LSAC) Algorithm 2: Distributional Adaptive Langevin Monte Carlo
Open Source Code	Yes	Our code is available at https://github.com/hmishfaq/LSAC.
Open Datasets	Yes	We present empirical evaluations of LSAC on the Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016) and the Deep Mind Control Suite (DMC) (Tassa et al., 2018) To further evaluate the exploration ability of LSAC, we test our method on two types of maze environments, a custom version of Point Maze Medium-v3 and Ant Maze-v4 from de Lazcano et al. (2024), which are implemented based on the D4RL benchmark (Fu et al., 2020).
Dataset Splits	Yes	We first train the agent for 500k environment steps, and then use its oracle to complete 200 evaluation episodes. During the critic updates, for each ψ(i) ΨQ, where 1 i n, the sampled replay buffer data is mixed with a synthetic batch BMi with a ratio of 0.5.
Hardware Specification	Yes	To facilitate fair wall-clock time comparison, all algorithms are trained on the same hardware (i.e a single NVIDIA Quadro RTX 8000 GPU machine).
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. It mentions using 'Adam optimizer' and referring to the 'Synth ER implementation' but without specific versioning.
Experiment Setup	Yes	After an initial warm-up stage of 1e5 steps, we gradually anneal LMC step size ηQ from the initial 1e-3 down to 1e-4. For computing the adaptive drift bias ζψ, we use fixed values of α1 = 0.9, α2 = 0.999 in Equation 9, and λ = 10 8 without tuning them. To prevent gradient explosion during training, we clip the sum of the gradient and the adaptive bias term using clipc( ψLQ(ψ) + aζψ) by a constant c = 0.7. Table 4: Common hyperparameters used across all 6 Mu Jo Co and 12 DMC tasks for LSAC and baselines. (Includes details like Num. hidden layers, Num. hidden nodes, Batch size, Replay buffer size, Discount for reward, Target smoothing factor, Optimizer, Adaptive bias, Inverse temperature, Actor learning rate, Number of critics, Critic learning rate, Actor Critic grad norm, Replay memory size, Entropy coefficient, Expected entropy, Diffusion training frequency).