Lenient Learning in Independent-Learner Stochastic Cooperative Games
Authors: Ermo Wei, Sean Luke
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We discuss the existing literature, then compare LMRL2 against other algorithms drawn from the literature which can be used for games of this kind: traditional (Distributed) Q-learning, Hysteretic Q-learning, WoLF-PHC, SOoN, and (for repeated games only) FMQ. The results show that LMRL2 is very effective in both of our measures (complete and correct policies), and is found in the top rank more often than any other technique. We tested against twelve games, either from the literature or of our own devising. This collection was meant to test a diverse array of situations... Table 4 summarizes the rank order among the methods and statistically significant differences. Table 5 shows the actual results. |
| Researcher Affiliation | Academia | Ermo Wei EMAIL Sean Luke EMAIL Department of Computer Science George Mason University 4400 University Drive MSN 4A5 Fairfax, VA 22030, USA |
| Pseudocode | Yes | 4. The LMRL2 Algorithm LMRL2 then iterates for some n times through the following four steps. First, it computes a mean temperature T. Second, using this mean temperature it selects an action to perform. Third, it performs the action and receives a reward resulting from the joint actions of all agents (all agents perform this step simultaneously and synchronously). Fourth, it updates the Q and T tables. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code for LMRL2 or a link to a code repository. It focuses on describing the algorithm and its experimental results. |
| Open Datasets | Yes | We tested with four repeated-game test problems from the literature. The widely used Climb and Penalty games (Claus and Boutilier, 1998) are designed to test some degree of relative overgeneralization and miscoordination. We also included versions of the Climb game with partially stochastic and fully stochastic rewards, here designated Climb-PS and Climb-FS respectively (Kapetanakis and Kudenko, 2002). We also tested against several stochastic games. The Boutilier game (Boutilier, 1999) was a repeated game with deterministic transitions which distributed a miscoordination situation among several stages. The Common Interest game (Vrancx et al., 2008) is also recurrent, but with stochastic transitions and some miscoordination. |
| Dataset Splits | No | The paper describes experiments conducted on various 'games' or 'test problems' where agents learn policies. It specifies running '10,000 iterations up front' and '1000 independent runs' for statistical comparison, but it does not describe traditional dataset splits (e.g., training, validation, test sets) for static data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper describes the LMRL2 algorithm and compares it to other multi-agent reinforcement learning techniques. However, it does not specify any software libraries, frameworks, or programming language versions used for its implementation. |
| Experiment Setup | Yes | LMRL2 for repeated games relies on the following parameters. Except for θ and ω, all of them will be fixed to the defaults shown, and will not be modified: α 0.1 learning rate γ 0.9 discount for infinite horizon δ 0.995 temperature decay coefficient Max Temp 50.0 maximum temperature Min Temp 2.0 minimum temperature ω > 0 action selection moderation factor (by default 1.0) θ > 0 lenience moderation factor (by default 1.0). Table 2: Default Parameter Settings for each technique. Table 3: Tuned Parameter Settings. |