Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning
Authors: Haque Ishfaq, Guangyuan Wang, Sami Islam, Doina Precup
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of an LMC based Thompson sampling in continuous control tasks with continuous action spaces. We present empirical evaluations of LSAC on the Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016) and the Deep Mind Control Suite (DMC) (Tassa et al., 2018), showing that LSAC is able to outperform or match several strong baselines, including DSAC-T (Duan et al., 2023), the current state-of-the-art model-free off-policy RL algorithm. Figure 1: Training curves for six Mu Jo Co continuous control tasks over 1e6 time steps. Results are averaged over a window size of 11 epochs and across 10 seeds. Solid lines represent the median performance, and the shaded regions correspond to 90% confidence interval. |
| Researcher Affiliation | Academia | Haque Ishfaq , Guangyuan Wang*, Sami Nur Islam, Doina Precup Mila, Mc Gill University EMAIL |
| Pseudocode | Yes | Algorithm 1: Langevin Soft Actor-Critic (LSAC) Algorithm 2: Distributional Adaptive Langevin Monte Carlo |
| Open Source Code | Yes | Our code is available at https://github.com/hmishfaq/LSAC. |
| Open Datasets | Yes | We present empirical evaluations of LSAC on the Mu Jo Co benchmark (Todorov et al., 2012; Brockman et al., 2016) and the Deep Mind Control Suite (DMC) (Tassa et al., 2018) To further evaluate the exploration ability of LSAC, we test our method on two types of maze environments, a custom version of Point Maze Medium-v3 and Ant Maze-v4 from de Lazcano et al. (2024), which are implemented based on the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | Yes | We first train the agent for 500k environment steps, and then use its oracle to complete 200 evaluation episodes. During the critic updates, for each ψ(i) ΨQ, where 1 i n, the sampled replay buffer data is mixed with a synthetic batch BMi with a ratio of 0.5. |
| Hardware Specification | Yes | To facilitate fair wall-clock time comparison, all algorithms are trained on the same hardware (i.e a single NVIDIA Quadro RTX 8000 GPU machine). |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. It mentions using 'Adam optimizer' and referring to the 'Synth ER implementation' but without specific versioning. |
| Experiment Setup | Yes | After an initial warm-up stage of 1e5 steps, we gradually anneal LMC step size ηQ from the initial 1e-3 down to 1e-4. For computing the adaptive drift bias ζψ, we use fixed values of α1 = 0.9, α2 = 0.999 in Equation 9, and λ = 10 8 without tuning them. To prevent gradient explosion during training, we clip the sum of the gradient and the adaptive bias term using clipc( ψLQ(ψ) + aζψ) by a constant c = 0.7. Table 4: Common hyperparameters used across all 6 Mu Jo Co and 12 DMC tasks for LSAC and baselines. (Includes details like Num. hidden layers, Num. hidden nodes, Batch size, Replay buffer size, Discount for reward, Target smoothing factor, Optimizer, Adaptive bias, Inverse temperature, Actor learning rate, Number of critics, Critic learning rate, Actor Critic grad norm, Replay memory size, Entropy coefficient, Expected entropy, Diffusion training frequency). |