Multiple Descent in the Multiple Random Feature Model
Authors: Xuran Meng, Jianfeng Yao, Yuan Cao
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then provide a thorough experimental study to verify our theory. At last, we extend our study to the multiple random feature model (MRFM), and show that MRFMs ensembling K types of random features may exhibit (K + 1)-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models. The curve gives our theoretical predictions, and the dots are our numerical results. Figure 3: Examples of double and triple descent. (a) gives the excess risk of a random feature model with Re LU activation function; (b) shows the excess risk of a double random feature model with Re LU and sigmoid activation functions; (c) shows the excess risk of a double random feature model with ELU and Re LU activation functions. The x-axis is the model complexity (number of parameters/sample size) and the y-axis is the excess risk. The curve gives our theoretical predictions, and the dots are our numerical results. |
| Researcher Affiliation | Academia | Xuran Meng EMAIL Department of Statistics and Actuarial Science The University of Hong Kong Jianfeng Yao EMAIL School of Data Science The Chinese University of Hong Kong (Shenzhen) Yuan Cao EMAIL Department of Statistics and Actuarial Science The University of Hong Kong |
| Pseudocode | No | The paper describes methods and derivations mathematically and textually, but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Meng, Xuran, Jianfeng Yao, and Yuan Cao. Online supplementary material to multiple descent in the multiple random feature model . URL https://github.com/Xuran Meng/Multipledescent/blob/main/onlinesupplementary.pdf. |
| Open Datasets | No | The distribution of the data pair (x, y) is given as follows: 1. The input vector x follows the uniform distribution on the sphere d Sd 1 of raidus . 2. The output is y = β1,d, x +F0+ε, where β1,d Rd, F0 R, and ε is a noise independent of x. We assume that E(ε) = 0, E(ε2) = τ 2, and E(ε4) < + . The parameters of the data generation model are βd = [F0, β 1,d] and we hereafter denote by D(βd) the probability distribution of the pair (x, y). This data generation model is standard in recent literature on double descent. Similar settings have been studied in a number of recent works (Hamsici and Martinez, 2007; Marinucci and Peccati, 2011; Di Marzio et al., 2014; Mei and Montanari, 2022). |
| Dataset Splits | Yes | Given a training data set S = {(xi, yi)}n i=1 consisting of n independent samples from the data generation model in Definition 2.1... Training data {(xi, yi)}n i=1 are generated independently following Definition 2.1 with τ = 0.1: each xi is uniformly generated from the sphere d Sd 1, and the corresponding response is given as yi = β1, xi + F0 + εi, where β1 is a randomly chosen unit vector; F0 = 0.2, λ = 10 5; Training sample size n = 1000, data dimension d = 300 and N1 = N2 varying from 0 to 1.6n. As we gradually increase the dimensions of random features N1 = N2 from 0 to 1.6n, the model complexity parameter c(d) = (N1 + N2)/n varies from 0 to 3.2. The empirical and finite-horizon values for the limiting excess risk R(λ, ψ, µ, F1, τ) in Theorem 3.6 are obtained on a test data set of size 700 and averaged from 30 independent replications. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9, etc.), or specific solvers with versions. |
| Experiment Setup | Yes | Training data {(xi, yi)}n i=1 are generated independently following Definition 2.1 with τ = 0.1: each xi is uniformly generated from the sphere d Sd 1, and the corresponding response is given as yi = β1, xi + F0 + εi, where β1 is a randomly chosen unit vector; F0 = 0.2, λ = 10 5; Training sample size n = 1000, data dimension d = 300 and N1 = N2 varying from 0 to 1.6n. The experiment setups are the same as the experiments in Section 4.2, except that here we use different pairs of activation functions. For two activation functions σ1, σ2, we gradually decrease the scale of σ2 by using activation pairs (σ1(x), c0σ2(x)) with a smaller and smaller factor c0. The experimental setting is similar to the previous experiments reported in Section 4. We set d = 300, n = 1000, and λ = 10 4. In simulation, the training data {(xi, yi)}n i=1 are generated independently according to Definition 2.1: each xi is uniformly generated from the sphere d Sd 1, and the corresponding response is given as yi = β1, xi +F0+εi, where β1 is a randomly chosen unit vector, F0 = 0.2 and τ = 0.1. We consider two MRFMs with K = 3 and K = 4, respectively. For the case K = 3, we consider three activation functions σ1(x) = Re LU(9x), σ2(x) = Re LU(x) and σ3(x) = Re LU(0.1x), and set the ratios between dimensions of random features as N1 = N2 = N3/3. For the case K = 4, we use four activation functions σ1(x) = Re LU(80x), σ2(x) = Re LU(9x), σ3(x) = Re LU(x) and σ4(x) = Re LU(0.1x), and keep the ratios N1 = N2 = N3 = N4/3. |