Harmless Overfitting: Using Denoising Autoencoders in Estimation of Distribution Algorithms
Authors: Malte Probst, Franz Rothlauf
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper suggests using Denoising Autoencoders (DAEs) as generative models within EDAs (DAE-EDA). The resulting DAE-EDA is able to model complex probability distributions. Furthermore, overfitting is less harmful, since DAEs overfit by learning the identity function. This overfitting behavior introduces unbiased random noise into the samples, which is no major problem for the EDA but just leads to higher population diversity. As a result, DAE-EDA runs for more generations before convergence and searches promising parts of the solution space more thoroughly. We study the performance of DAE-EDA on several combinatorial single-objective optimization problems. In comparison to the Bayesian Optimization Algorithm, DAE-EDA requires a similar number of evaluations of the objective function but is much faster and can be parallelized efficiently, making it the preferred choice especially for large and difficult optimization problems. Section 3 presents test problems, the experimental setup, and results. Furthermore, it provides an in-depth analysis of the overfitting behavior of a DAE and its influence on the EDA. |
| Researcher Affiliation | Collaboration | Malte Probst EMAIL Honda Research Institute EU Carl-Legien-Str. 30, 63073 Offenbach, Germany Franz Rothlauf EMAIL Information Systems and Business Administration Johannes Gutenberg-Universit at Mainz Jakob-Welder-Weg 9, 55128 Mainz, Germany |
| Pseudocode | Yes | Algorithm 1 Estimation of Distribution Algorithm 1: Initialize Distribution D0, t = 0 2: while not converged do 3: Pcandidates Sample new candidate solutions from Dt 4: Pt+1 Select high-quality solutions from Pcandidates based on their objective value 5: Dt+1 Update Dt using Pt+1 6: t t + 1 7: end while Algorithm 2 Pseudo code for training an AE (when training a DAE, replace xi with q(xi) in line 5) 1: Initialize θ = {W, bh, bz} randomly 2: Set 0 < α < 1, e.g., α = 0.1 3: while not converged do 4: for each example i in the training set do 5: h = c(xi; θ) 6: z = f(h; θ) 7: θ := θ α Errθ(xi,z) θ 8: end for 9: end while Algorithm 3 Pseudo code for sampling a DAE 1: Given the trained DAE s θ = {W, bc, bf}, its reconstruction function f(c(ˆx)), and the corruption process q(x) 2: Initialize x [0, 1]n randomly 3: for a fixed number s of sampling steps do 4: ˆx = q(x) 5: z = f(c(ˆx)) 6: x := z 7: end for 8: Use x as a sample from the DAE |
| Open Source Code | Yes | The complete source code including configuration files containing all parameter settings for the experiments is available at https://github.com/wohnjayne/eda-suite. |
| Open Datasets | Yes | We evaluate DAE-EDA on four standard benchmark problems using binary decision variables. ...NK landscapes (Kauffman and Weinberger, 1989) are defined by the two parameters n and k {0, n 1} as well as n components fi(x), i {1 . . . , n}. For our experiments, we use instances of NK landscapes with known optima provided by Pelikan (2008). [Bibliography entry]: Martin Pelikan. Analysis of estimation of distribution algorithms and genetic algorithms on NK landscapes. In C. Ryan and M. Keijzer, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2008), volume 10, pages 1033 1040, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-130-9. doi: 10.1145/1389095.1389287. URL http://medal-lab.org/files/nk-instances.tar.gz. |
| Dataset Splits | No | The paper does not provide traditional dataset splits (e.g., train/test/validation percentages or counts). It describes an iterative optimization process (EDA) where candidate solutions are sampled and a subset of high-quality solutions is selected in each generation. The test problems themselves are combinatorial optimization benchmarks, not datasets requiring fixed splits for supervised learning. |
| Hardware Specification | Yes | All algorithms were implemented in Matlab/Octave and executed using Octave V3.2.4 on a single core of an AMD Opteron 6272 processor with 2,100 MHz. |
| Software Dependencies | Yes | All algorithms were implemented in Matlab/Octave and executed using Octave V3.2.4 on a single core of an AMD Opteron 6272 processor with 2,100 MHz. |
| Experiment Setup | Yes | In a set of preliminary experiments, we coarsely tested various parameter settings, without extensive fine tuning to specific problem types. Throughout our experiments, we used the following parameters for DAE-EDA: since we want to make as few assumptions about the problem structure as possible, we set the number m of hidden neurons equal to the problem size n. The corruption process q(x) randomly selects c = 10% of the inputs. Subsequently, each of the selected bits is randomly set to 0 or 1 (salt+pepper noise). We draw the initial weights W from a normal distribution with zero mean and variance 0.005, as W N(0, 0.005). During training, the learning rates are set to α = 0.1 and α = 0.5 for weights and biases, respectively. We use a batch size of b = 100, and a cross-entropy loss. To improve generalization, we use weight decay (L2 regularization)with a weight cost of λ = 0.001. Furthermore, we use a momentum of β = 0.3 (Qian, 1999). When sampling new candidate solutions from the DAE, a proper choice of the number of sampling steps s is important. Preliminary experiments revealed that s should be relatively small. We chose s = 10...We run the learning process for a minimum of 50 training epochs. After the initial 50 epochs, we continue training, unless et rises. We measure the decrease γ of the reconstruction error in the last 33% of all epochs as γ = (e0.67t et)/(e0 et). We use γ to automatically check for convergence of the training. We stop training as soon as γ < 0.05... |