Reward Distance Comparisons Under Transition Sparsity

Authors: Clement Nyanhongo, Bruno Miranda Henrique, Eugene Santos

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical justification for SRRD s robustness and conduct experiments to demonstrate its practical efficacy across multiple domains. ... Empirical results highlight SRRD s superior performance, as evidenced by its ability to find higher similarity between rewards generated from the same agents and higher variation between rewards from different agents.
Researcher Affiliation Academia Clement Nyanhongo EMAIL Thayer School of Engineering Dartmouth College Bruno Miranda Henrique EMAIL Thayer School of Engineering Dartmouth College Eugene Santos Jr. EMAIL Thayer School of Engineering Dartmouth College
Pseudocode Yes C.1 Experiment 1: Transition Sparsity Pseudocode Algorithm 1 Analyzing the effect of limited sampling on reward distance
Open Source Code No The paper does not provide concrete access to the source code for the methodology described in this paper. It only mentions third-party code used for Maxent and AIRL implementations: "Maxent and AIRL implementations adapted from: https://github.com/Human Compatible AI/imitation (Gleave et al., 2022)".
Open Datasets Yes Robomimic an open source dataset of robotics manipulation tasks incorporating both human and simulated demonstrations (Mandlekar et al., 2021), Montezuma s Revenge an Atari benchmark dataset with human demonstrations for the Montezuma s Revenge game (Kurin et al., 2017), Star Craft II a simulation of combat scenarios where a controlled multiagent team aims to defeat a default AI enemy team (Vinyals et al., 2019), ... and MIMIC-IV a real-world de-identified electronic health dataset for patients admitted at an emergency or intensive care unit at Beth Israel Deaconess Medical Center in Boston, MA (Johnson et al., 2023).
Dataset Splits Yes We select a training to test set ratio of 70 : 30, and repeat this experiment 200 times.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes In this experiment, we train a k-nearest neighbors (k-NN) classifier to classify unlabeled agent trajectories by indirectly using computed rewards, to identify the agents that produced these trajectories. ... grid-search is used to identify candidate values for k and γ, and twofold cross-validation (using Rtrain) is used to optimize hyper-parameters based on accuracy. ... We select a training to test set ratio of 70 : 30, and repeat this experiment 200 times. ... Table 9: Reward Learning Parameters Across Domains AIRL MAXENT PTIRL Trajectories/run: 5 Trajectories/run: 5 Target Trajectories/run: 5 RL Algorithm: PPO RL Algorithm: PPO Non-Target Trajectories/run: 10 Discount (γ): 0.9 Discount (γ): 0.9 Max Reward Cap: +100 Reward Network MLP Hidden Size: [256, 128] Reward Network MLP Hidden Size: [256, 128] Min Reward Cap: -100 Learning Rate: 10-4 Learning Rate: 10-4 LP Solver: Cplex Time Steps: 105 Generator Batch Size: 2048 Discriminator Batch Size: 256.