Lenient Learning in Independent-Learner Stochastic Cooperative Games

Authors: Ermo Wei, Sean Luke

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We discuss the existing literature, then compare LMRL2 against other algorithms drawn from the literature which can be used for games of this kind: traditional (Distributed) Q-learning, Hysteretic Q-learning, WoLF-PHC, SOoN, and (for repeated games only) FMQ. The results show that LMRL2 is very effective in both of our measures (complete and correct policies), and is found in the top rank more often than any other technique. We tested against twelve games, either from the literature or of our own devising. This collection was meant to test a diverse array of situations... Table 4 summarizes the rank order among the methods and statistically significant differences. Table 5 shows the actual results.
Researcher Affiliation Academia Ermo Wei EMAIL Sean Luke EMAIL Department of Computer Science George Mason University 4400 University Drive MSN 4A5 Fairfax, VA 22030, USA
Pseudocode Yes 4. The LMRL2 Algorithm LMRL2 then iterates for some n times through the following four steps. First, it computes a mean temperature T. Second, using this mean temperature it selects an action to perform. Third, it performs the action and receives a reward resulting from the joint actions of all agents (all agents perform this step simultaneously and synchronously). Fourth, it updates the Q and T tables.
Open Source Code No The paper does not provide any explicit statements about open-sourcing the code for LMRL2 or a link to a code repository. It focuses on describing the algorithm and its experimental results.
Open Datasets Yes We tested with four repeated-game test problems from the literature. The widely used Climb and Penalty games (Claus and Boutilier, 1998) are designed to test some degree of relative overgeneralization and miscoordination. We also included versions of the Climb game with partially stochastic and fully stochastic rewards, here designated Climb-PS and Climb-FS respectively (Kapetanakis and Kudenko, 2002). We also tested against several stochastic games. The Boutilier game (Boutilier, 1999) was a repeated game with deterministic transitions which distributed a miscoordination situation among several stages. The Common Interest game (Vrancx et al., 2008) is also recurrent, but with stochastic transitions and some miscoordination.
Dataset Splits No The paper describes experiments conducted on various 'games' or 'test problems' where agents learn policies. It specifies running '10,000 iterations up front' and '1000 independent runs' for statistical comparison, but it does not describe traditional dataset splits (e.g., training, validation, test sets) for static data.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper describes the LMRL2 algorithm and compares it to other multi-agent reinforcement learning techniques. However, it does not specify any software libraries, frameworks, or programming language versions used for its implementation.
Experiment Setup Yes LMRL2 for repeated games relies on the following parameters. Except for θ and ω, all of them will be fixed to the defaults shown, and will not be modified: α 0.1 learning rate γ 0.9 discount for infinite horizon δ 0.995 temperature decay coefficient Max Temp 50.0 maximum temperature Min Temp 2.0 minimum temperature ω > 0 action selection moderation factor (by default 1.0) θ > 0 lenience moderation factor (by default 1.0). Table 2: Default Parameter Settings for each technique. Table 3: Tuned Parameter Settings.