Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

On Shallow Planning Under Partial Observability

Authors: Randy Lefebvre, Audrey Durand

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now conduct experiments3 to highlight the relationships between the planning horizon, the partial observability, and the structural parameters of the underlying MDP. ... We explore the interaction of shallow planning and partial observability in the Cartpole-v1 environment (Towers et al. 2024). ... Fig. 4 displays the average reward obtained by running the 10 models associated with each (γ, σ) configuration on the 100 environment seeds.
Researcher Affiliation Academia Randy Lefebvre1, Audrey Durand1,2 1Universit e Laval 2Canada CIFAR AI Chair EMAIL, EMAIL
Pseudocode No The paper describes theoretical bounds and then presents numerical experiments and their results. It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Finally, we provide the open-source code1 for all our experiments to ensure reproducibility and offer a framework that practitioners can modify to better understand the impact of partial observability on their specific applications. 1https://github.com/GRAAL-Research/shallow-planning-partial-observability
Open Datasets Yes We explore the interaction of shallow planning and partial observability in the Cartpole-v1 environment (Towers et al. 2024).
Dataset Splits No The paper describes generating 'Random MDPs' for one set of experiments and using the 'Cartpole-v1 environment' for another. For Cartpole, it states 'We train 10 agents for each combination of (γ, σ), resulting in 150 models, and evaluate each of these models on 100 unseen seeds.' This describes evaluation seeds, but not traditional training/validation/test splits of a dataset.
Hardware Specification Yes Experiments are conducted on a AMD Ryzen 5 3600 CPU and a GTX 1660 Ti GPU.
Software Dependencies No The paper mentions using PPO (Schulman et al. 2017) and refers to recommended hyperparameters from Raffin et al. (2021) which is associated with Stable-Baselines3, but specific software versions are not provided in the text.
Experiment Setup Yes We consider the widely used PPO (Schulman et al. 2017) agent policy with the recommended hyperparameters for this task (Raffin et al. 2021). We then consider different discount factors γ {0, 0.24, 0.49, 0.73, 0.98}, with the largest γ = 0.98 from the baseline (Raffin et al. 2021). For the partially observable component, we simulate noisy sensors, which are common in real life. These are simulated by injecting noise into the state. The noise is sampled from a multivariate normal distribution N(0, Iσ2) parameterized by a diagonal covariance matrix with value σ2 on the diagonal. We consider σ {0, 0.1, 1}.