Adaptive kernel predictors from feature-learning infinite limits of neural networks

Authors: Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide comparisons of our adaptive kernel machines to trained networks and lazy NTK and NNGPK predictors for MLPs and CNNs. We demonstrate that our adaptive-kernels are descriptive of feature-learning neural network training on a variety of metrics, including test loss, intermediate feature kernels, and preactivation densities. In addition, they outperform lazy kernel predictors on benchmark datasets. [...] In Fig. 3(a) we compare test losses of lazy vs feature learning kernels for a two-layer MLP trained on a P subset of two classes of CIFAR10 in a regression task.
Researcher Affiliation Academia 1John A. Paulson School of Engineering and Applied Sciences, Harvard University 2Center for Brain Sciences 3Kempner Institute. Correspondence to: Clarissa Lauditi <EMAIL>, Blake Bordelon <blake EMAIL>, Cengiz Pehlevan <EMAIL>.
Pseudocode Yes Algorithm 1 a NBK Regression Predictor [...] Algorithm 2 a NTK Regression Predictor
Open Source Code No The paper mentions using JAX (Bradbury et al., 2018) and Flax's Linen module in JAX in the experimental details, but it does not contain an explicit statement by the authors about releasing their own source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets. [...] In Fig. 3(a) we compare test losses of lazy vs feature learning kernels for a two-layer MLP trained on a P subset of two classes of CIFAR10 in a regression task. [...] In Fig. 4 we show simulations of a Bayesian L = 5 Tanh MLP and compare to our infinite-width predictors. [...] P = 50 patterns of MNIST with y = { 1}P labels.
Dataset Splits No The paper mentions using subsets of datasets (e.g., 'P subset of two classes of CIFAR10', 'P = 300 data of two-classes of CIFAR10', 'P = 50 patterns of MNIST', 'P = 1000 data of CIFAR10'). While 'test loss' is mentioned, implying a test set, the paper does not specify the exact split percentages or sample counts for training, validation, and test sets, or how these subsets were partitioned from the original datasets.
Hardware Specification No The paper mentions details about neural network architecture such as 'N = 5000 network' or 'N = 1028 width networks' which refer to the width of the neural network, not the hardware used for computation. No specific hardware details like GPU/CPU models, processors, or cloud computing resources are provided.
Software Dependencies No The paper states, 'We use automatic differentiation (via JAX) to compute gradients...' and 'We implement our CNN using Flax s Linen module in JAX.' While JAX and Flax's Linen module are mentioned as software used, specific version numbers for these dependencies are not provided, which is necessary for reproducibility.
Experiment Setup Yes For Langevin training, the otimizer we use optax.noisy sgd is designed to inject Gaussian white noise into the gradient updates at each iteration. This noise is drawn from a zero-mean Gaussian distribution whose variance is controlled by both the learning rate η and the inverse temperature parameter β 1 as in Equation (3). This white noise plays a critical role in approximating the posterior distribution (Equation (9)) over the network parameters. [...] For both Langevin and gradient descent dynamics, we use a weight decay contribution proportional to λ as it is for (3) (in all our experiments we use λ = 1 for each layer, except when we compare with CNN test loss, where λ = 1 10 2). For Langevin, we average the T = 20000 steps fluctuations every 1000 steps after t > 5000 epochs. We use a learning rate η = 5 10 4 and an inverse temperature β = 50. For gradient descent, we train until convergence for T = 20000 epochs and we use a learning rate η = 1 10 3. Both the experiments are performed by varying the sample size P and the feature learning strength γ0. [...] The CNN consists of a single convolutional layer with a kernel size of 8 8 and stride equal to the kernel size. This choice effectively splits each 32 32 input image (with 3 color channels) into non-overlapping patches. We set the number of filters to N = 1024, and the convolution weights are initialized using a normal distribution with unit variance. [...] A Re LU activation is applied afterwards. [...] The key hyperparameters for training the CNN are chosen as follows: learning rate η = 1 10 3, regularization λ = 1 10 2, and the experiments are performed by varying the number of training sample P and the feature learning strength γ0.