Reward Distance Comparisons Under Transition Sparsity
Authors: Clement Nyanhongo, Bruno Miranda Henrique, Eugene Santos
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical justification for SRRD s robustness and conduct experiments to demonstrate its practical efficacy across multiple domains. ... Empirical results highlight SRRD s superior performance, as evidenced by its ability to find higher similarity between rewards generated from the same agents and higher variation between rewards from different agents. |
| Researcher Affiliation | Academia | Clement Nyanhongo EMAIL Thayer School of Engineering Dartmouth College Bruno Miranda Henrique EMAIL Thayer School of Engineering Dartmouth College Eugene Santos Jr. EMAIL Thayer School of Engineering Dartmouth College |
| Pseudocode | Yes | C.1 Experiment 1: Transition Sparsity Pseudocode Algorithm 1 Analyzing the effect of limited sampling on reward distance |
| Open Source Code | No | The paper does not provide concrete access to the source code for the methodology described in this paper. It only mentions third-party code used for Maxent and AIRL implementations: "Maxent and AIRL implementations adapted from: https://github.com/Human Compatible AI/imitation (Gleave et al., 2022)". |
| Open Datasets | Yes | Robomimic an open source dataset of robotics manipulation tasks incorporating both human and simulated demonstrations (Mandlekar et al., 2021), Montezuma s Revenge an Atari benchmark dataset with human demonstrations for the Montezuma s Revenge game (Kurin et al., 2017), Star Craft II a simulation of combat scenarios where a controlled multiagent team aims to defeat a default AI enemy team (Vinyals et al., 2019), ... and MIMIC-IV a real-world de-identified electronic health dataset for patients admitted at an emergency or intensive care unit at Beth Israel Deaconess Medical Center in Boston, MA (Johnson et al., 2023). |
| Dataset Splits | Yes | We select a training to test set ratio of 70 : 30, and repeat this experiment 200 times. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In this experiment, we train a k-nearest neighbors (k-NN) classifier to classify unlabeled agent trajectories by indirectly using computed rewards, to identify the agents that produced these trajectories. ... grid-search is used to identify candidate values for k and γ, and twofold cross-validation (using Rtrain) is used to optimize hyper-parameters based on accuracy. ... We select a training to test set ratio of 70 : 30, and repeat this experiment 200 times. ... Table 9: Reward Learning Parameters Across Domains AIRL MAXENT PTIRL Trajectories/run: 5 Trajectories/run: 5 Target Trajectories/run: 5 RL Algorithm: PPO RL Algorithm: PPO Non-Target Trajectories/run: 10 Discount (γ): 0.9 Discount (γ): 0.9 Max Reward Cap: +100 Reward Network MLP Hidden Size: [256, 128] Reward Network MLP Hidden Size: [256, 128] Min Reward Cap: -100 Learning Rate: 10-4 Learning Rate: 10-4 LP Solver: Cplex Time Steps: 105 Generator Batch Size: 2048 Discriminator Batch Size: 256. |