Follow the Leader If You Can, Hedge If You Must

Authors: Steven de Rooij, Tim van Erven, Peter D. Grünwald, Wouter M. Koolen

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed four experiments on artificial data, designed to clarify how the learning rate determines performance in a variety of Hedge algorithms. These experiments are designed to illustrate as clearly as possible the intricacies involved in the central question of this paper: whether to use a high learning rate (by following the leader) or to play it safe by using a smaller learning rate instead. Rather than mimic real-world data, on which high learning rates often seem to work well (Devaine et al., 2013), we vary the main factor that appears to drive the best choice of learning rate: the difference in cumulative loss between the experts. We have kept the experiments as simple as possible: the data are deterministic, and involve two experts. In each case, the data consist of one initial hand-crafted loss vector ℓ1, followed by a sequence of loss vectors ℓ2, . . ., ℓT , which are either (0, 1) or (1, 0). For each experiment ξ {1, 2, 3, 4}, we want the cumulative loss difference Lt,1 Lt,2 between the experts to follow a target fξ(t), which will be a continuous, nondecreasing function of t. As the losses are binary, we cannot make Lt,1 Lt,2 exactly equal to the target fξ(t), but after the initial loss ℓ1, we choose every subsequent loss vector such that it brings Lt,1 Lt,2 as close as possible to fξ(t). All functions fξ change slowly enough that |Lt,1 Lt,2 fξ(t)| 1 for all t. For each experiment, we let the number of trials be T = 1000, and we first plot the regret R(η) of the Hedge algorithm as a function of the fixed learning rate η. We subsequently plot the regret Ralg t as a function of t = 1, . . ., T, for each of the following algorithms alg: 1. Follow-the-Leader (Hedge with learning rate ) 2. Hedge with fixed learning rate η = 1 3. Hedge with the learning rate that optimizes the worst-case bound (7), which equals η = p 8 ln(K)/(S2T) 0.0745; we will call this algorithm safe Hedge for brevity. 4. Ada Hedge 5. Flip Flop, with parameters ϕ = 2.37 and α = 1.243 as in Corollary 16 6. Variation MW by Hazan and Kale (2008), using the fixed learning rate that optimises the bound provided in their Theorem 4 7. Normal Hedge, described by Chaudhuri et al. (2009)
Researcher Affiliation Academia Steven de Rooij EMAIL VU University and University of Amsterdam Science Park 904, P.O. Box 94323, 1090 GH Amsterdam, the Netherlands Tim van Erven EMAIL Département de Mathématiques Université Paris-Sud, 91405 Orsay Cedex, France Peter D. Grünwald EMAIL Wouter M. Koolen EMAIL Leiden University (Grünwald) and Centrum Wiskunde & Informatica (Grünwald and Koolen) Science Park 123, P.O. Box 94079, 1090 GB Amsterdam, the Netherlands
Pseudocode Yes Figure 1: Numerically robust matlab implementation of Ada Hedge; Figure 2: Flip Flop, with new ingredients in boldface
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets No We performed four experiments on artificial data, designed to clarify how the learning rate determines performance in a variety of Hedge algorithms. [...] We have kept the experiments as simple as possible: the data are deterministic, and involve two experts.
Dataset Splits No The experiments are performed on "artificial data" which are generated deterministically based on specified functions (fξ(t)). The paper does not discuss or define any training, validation, or test splits for these generated datasets, as it pertains to an online learning setting where data is processed sequentially.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only mentions experiments were performed on "artificial data" and does not specify the computing environment.
Software Dependencies No While Figures 1 and 2 present code in Matlab syntax, implying the use of Matlab, the paper does not specify a version number for Matlab or any other software dependencies.
Experiment Setup Yes Flip Flop, with parameters ϕ = 2.37 and α = 1.243 as in Corollary 16. [...] For each experiment, we let the number of trials be T = 1000, and we first plot the regret R(η) of the Hedge algorithm as a function of the fixed learning rate η. We subsequently plot the regret Ralg t as a function of t = 1, . . ., T, for each of the following algorithms alg: 1. Follow-the-Leader (Hedge with learning rate ) 2. Hedge with fixed learning rate η = 1 3. Hedge with the learning rate that optimizes the worst-case bound (7), which equals η = p 8 ln(K)/(S2T) 0.0745; we will call this algorithm safe Hedge for brevity.