Successor Feature Representations
Authors: Chris Reinke, Xavier Alameda-Pineda
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally compare SFRQL in tasks with linear and general reward functions, and for tasks with discrete and continuous features to standard Q-learning and the classical SF framework, demonstrating the interest and advantage of SFRQL. We evaluated SFRQL in two environments. The first has discrete features. ... The second environment, the racer environment, evaluates the agents in tasks with continuous features. |
| Researcher Affiliation | Academia | Chris Reinke EMAIL Robot Learn INRIA Grenoble, LJK, UGA; Xavier Alameda-Pineda EMAIL Robot Learn INRIA Grenoble, LJK, UGA |
| Pseudocode | Yes | Algorithm 1: Q-learning (QL); Algorithm 2: Classical SF Q-learning (SF) (Barreto et al., 2017); Algorithm 3: Model-free SFRQL (SFR); Algorithm 4: One Step SF-Model SFRQL (MB SFR); Algorithm 5: Model-free SFRQL for Continuous Features (CSFR) |
| Open Source Code | Yes | Source code at https://gitlab.inria.fr/robotlearn/sfr_learning |
| Open Datasets | No | The environment consist of 4 connected rooms (Fig. 1, a). The agent starts an episode in position S and has to learn to reach goal position G. During an episode, the agent collects objects to gain further rewards. ... We further evaluated the agents in a 2D environment with continuous features (Fig. 2, a). |
| Dataset Splits | No | Each task was executed for 20, 000 steps, and the average performance over 10 runs per algorithm was measured. Each agent was evaluated on 40 tasks. The agents experienced the tasks sequentially, each for 1000 episodes (200, 000 steps per task). |
| Hardware Specification | Yes | Experiments were conducted on a cluster with a variety of node types (Xeon SKL Gold 6130 with 2.10GHz, Xeon SKL Gold 5218 with 2.30GHz, Xeon SKL Gold 6126 with 2.60GHz, Xeon SKL Gold 6244 with 3.60GHz, each with 192 GB Ram, no GPU). |
| Software Dependencies | No | We used Py Torch for the computation of gradients and its stochastic gradient decent procedure (SGD) for updating the parameters. The parameters H and wi 1,...20 were optimized with Adam (learning rate of 0.003). |
| Experiment Setup | Yes | Each task was executed for 20, 000 steps, and the average performance over 10 runs per algorithm was measured. ... The probability for random actions of the ϵ-Greedy action selection was set to ϵ = 0.15 and the discount rate to γ = 0.95. The initial weights θ for the function approximators were randomly sampled from a standard distribution with θinit N(µ = 0, σ = 0.01). ... A grid search over the learning rates of all algorithms was performed. Each learning rate was evaluated for three different settings which are listed in Table 1. |