Where to Intervene: Action Selection in Deep Reinforcement Learning

Authors: Wenbo Zhang, Hengrui Cai

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments validate the established theoretical guarantees, demonstrating that our method surpasses various alternative techniques in terms of both performance in variable selection and overall achieved rewards. We conduct experiments on standard locomotion tasks in Mu Jo Co (Todorov et al., 2012) and treatment allocation tasks calibrated from electronic health records (EHR), the MIMIC-III dataset (Johnson et al., 2016).
Researcher Affiliation Academia Wenbo Zhang EMAIL Department of Statistics University of California, Irvine; Hengrui Cai EMAIL Department of Statistics University of California, Irvine
Pseudocode Yes Algorithm 1 Action-Selected Exploration in Reinforcement Learning; Algorithm 2 Knockoff-Sampling Variable Selection
Open Source Code No The paper states: "We adopt the implementation from Open AI Spinning up Framework (Achiam, 2018)." This refers to a third-party framework used, not an explicit release of the authors' own code or methodology described in the paper. No specific repository link or statement about their code being open-source is provided.
Open Datasets Yes We conduct experiments on standard locomotion tasks in Mu Jo Co (Todorov et al., 2012) and treatment allocation tasks calibrated from electronic health records (EHR), the MIMIC-III dataset (Johnson et al., 2016). MIMIC-III, a freely accessible critical care database.
Dataset Splits Yes For each setting, we run experiments over 2 105 and 106 steps for SAC and PPO, respectively, averaged over 10 training runs. For each evaluation point, we run 10 test trajectories and average their reward as the average return. We utilize the first 4000 samples for variable selection and then use the selection results to build a hard mask for action in deep RL models. Algorithm 2: Split D into non-overlapping sets {Dk}K k=1
Hardware Specification Yes All the experiments are conducted on the server with 4 NVIDIA RTX A6000 GPU.
Software Dependencies No The paper mentions adopting the "Open AI Spinning up Framework (Achiam, 2018)" but does not provide specific version numbers for this framework or any other software libraries (e.g., Python, PyTorch, TensorFlow) used in the implementation.
Experiment Setup Yes Tables B.1 and B.2 summarize the hyperparameters we used. For PPO, this includes optimizer (Adam), learning rates for policy and value (3.0 10^-4, 1.0 10^-3), discount (0.99), number of hidden layers (2), hidden units per layer ([64, 32]), samples per minibatch (256/100), steps per rollout (1000/100), non-linearity (ReLU). For SAC, optimizer (Adam), learning rates (3.0 10^-4), discount (0.9), replay buffer size (1 10^6), number of hidden layers (2), hidden units per layer ([256, 256]), samples per minibatch (256), non-linearity (ReLU), entropy coefficient (0.2), warm-up steps (1.0 10^4). We set the FDR rate α = 0.1 and voting ratio Γ = 0.5 in all settings.