reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive kernel predictors from feature-learning infinite limits of neural networks

Authors: Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide comparisons of our adaptive kernel machines to trained networks and lazy NTK and NNGPK predictors for MLPs and CNNs. We demonstrate that our adaptive-kernels are descriptive of feature-learning neural network training on a variety of metrics, including test loss, intermediate feature kernels, and preactivation densities. In addition, they outperform lazy kernel predictors on benchmark datasets. [...] In Fig. 3(a) we compare test losses of lazy vs feature learning kernels for a two-layer MLP trained on a P subset of two classes of CIFAR10 in a regression task.
Researcher Affiliation	Academia	1John A. Paulson School of Engineering and Applied Sciences, Harvard University 2Center for Brain Sciences 3Kempner Institute. Correspondence to: Clarissa Lauditi <EMAIL>, Blake Bordelon <blake EMAIL>, Cengiz Pehlevan <EMAIL>.
Pseudocode	Yes	Algorithm 1 a NBK Regression Predictor [...] Algorithm 2 a NTK Regression Predictor
Open Source Code	No	The paper mentions using JAX (Bradbury et al., 2018) and Flax's Linen module in JAX in the experimental details, but it does not contain an explicit statement by the authors about releasing their own source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets. [...] In Fig. 3(a) we compare test losses of lazy vs feature learning kernels for a two-layer MLP trained on a P subset of two classes of CIFAR10 in a regression task. [...] In Fig. 4 we show simulations of a Bayesian L = 5 Tanh MLP and compare to our infinite-width predictors. [...] P = 50 patterns of MNIST with y = { 1}P labels.
Dataset Splits	No	The paper mentions using subsets of datasets (e.g., 'P subset of two classes of CIFAR10', 'P = 300 data of two-classes of CIFAR10', 'P = 50 patterns of MNIST', 'P = 1000 data of CIFAR10'). While 'test loss' is mentioned, implying a test set, the paper does not specify the exact split percentages or sample counts for training, validation, and test sets, or how these subsets were partitioned from the original datasets.
Hardware Specification	No	The paper mentions details about neural network architecture such as 'N = 5000 network' or 'N = 1028 width networks' which refer to the width of the neural network, not the hardware used for computation. No specific hardware details like GPU/CPU models, processors, or cloud computing resources are provided.
Software Dependencies	No	The paper states, 'We use automatic differentiation (via JAX) to compute gradients...' and 'We implement our CNN using Flax s Linen module in JAX.' While JAX and Flax's Linen module are mentioned as software used, specific version numbers for these dependencies are not provided, which is necessary for reproducibility.
Experiment Setup	Yes	For Langevin training, the otimizer we use optax.noisy sgd is designed to inject Gaussian white noise into the gradient updates at each iteration. This noise is drawn from a zero-mean Gaussian distribution whose variance is controlled by both the learning rate η and the inverse temperature parameter β 1 as in Equation (3). This white noise plays a critical role in approximating the posterior distribution (Equation (9)) over the network parameters. [...] For both Langevin and gradient descent dynamics, we use a weight decay contribution proportional to λ as it is for (3) (in all our experiments we use λ = 1 for each layer, except when we compare with CNN test loss, where λ = 1 10 2). For Langevin, we average the T = 20000 steps fluctuations every 1000 steps after t > 5000 epochs. We use a learning rate η = 5 10 4 and an inverse temperature β = 50. For gradient descent, we train until convergence for T = 20000 epochs and we use a learning rate η = 1 10 3. Both the experiments are performed by varying the sample size P and the feature learning strength γ0. [...] The CNN consists of a single convolutional layer with a kernel size of 8 8 and stride equal to the kernel size. This choice effectively splits each 32 32 input image (with 3 color channels) into non-overlapping patches. We set the number of filters to N = 1024, and the convolution weights are initialized using a normal distribution with unit variance. [...] A Re LU activation is applied afterwards. [...] The key hyperparameters for training the CNN are chosen as follows: learning rate η = 1 10 3, regularization λ = 1 10 2, and the experiments are performed by varying the number of training sample P and the feature learning strength γ0.