reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Where to Intervene: Action Selection in Deep Reinforcement Learning

Authors: Wenbo Zhang, Hengrui Cai

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments validate the established theoretical guarantees, demonstrating that our method surpasses various alternative techniques in terms of both performance in variable selection and overall achieved rewards. We conduct experiments on standard locomotion tasks in Mu Jo Co (Todorov et al., 2012) and treatment allocation tasks calibrated from electronic health records (EHR), the MIMIC-III dataset (Johnson et al., 2016).
Researcher Affiliation	Academia	Wenbo Zhang EMAIL Department of Statistics University of California, Irvine; Hengrui Cai EMAIL Department of Statistics University of California, Irvine
Pseudocode	Yes	Algorithm 1 Action-Selected Exploration in Reinforcement Learning; Algorithm 2 Knockoff-Sampling Variable Selection
Open Source Code	No	The paper states: "We adopt the implementation from Open AI Spinning up Framework (Achiam, 2018)." This refers to a third-party framework used, not an explicit release of the authors' own code or methodology described in the paper. No specific repository link or statement about their code being open-source is provided.
Open Datasets	Yes	We conduct experiments on standard locomotion tasks in Mu Jo Co (Todorov et al., 2012) and treatment allocation tasks calibrated from electronic health records (EHR), the MIMIC-III dataset (Johnson et al., 2016). MIMIC-III, a freely accessible critical care database.
Dataset Splits	Yes	For each setting, we run experiments over 2 105 and 106 steps for SAC and PPO, respectively, averaged over 10 training runs. For each evaluation point, we run 10 test trajectories and average their reward as the average return. We utilize the first 4000 samples for variable selection and then use the selection results to build a hard mask for action in deep RL models. Algorithm 2: Split D into non-overlapping sets {Dk}K k=1
Hardware Specification	Yes	All the experiments are conducted on the server with 4 NVIDIA RTX A6000 GPU.
Software Dependencies	No	The paper mentions adopting the "Open AI Spinning up Framework (Achiam, 2018)" but does not provide specific version numbers for this framework or any other software libraries (e.g., Python, PyTorch, TensorFlow) used in the implementation.
Experiment Setup	Yes	Tables B.1 and B.2 summarize the hyperparameters we used. For PPO, this includes optimizer (Adam), learning rates for policy and value (3.0 10^-4, 1.0 10^-3), discount (0.99), number of hidden layers (2), hidden units per layer ([64, 32]), samples per minibatch (256/100), steps per rollout (1000/100), non-linearity (ReLU). For SAC, optimizer (Adam), learning rates (3.0 10^-4), discount (0.9), replay buffer size (1 10^6), number of hidden layers (2), hidden units per layer ([256, 256]), samples per minibatch (256), non-linearity (ReLU), entropy coefficient (0.2), warm-up steps (1.0 10^4). We set the FDR rate α = 0.1 and voting ratio Γ = 0.5 in all settings.