Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

Authors: Chi-Ning Chou, Hang Le, Yichen Wang, Sueyeon Chung

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show, in both theoretical and empirical settings, that as networks learn features, task-relevant manifolds untangle, with changes in manifold geometry revealing distinct learning stages and strategies beyond the lazy rich dichotomy. This framework provides novel insights into feature learning across neuroscience and machine learning, shedding light on structural inductive biases in neural circuits and the mechanisms underlying out-of-distribution generalization.
Researcher Affiliation Collaboration 1Center for Computational Neuroscience, Flatiron Institute, New York, NY, USA 2University of California, UCLA, Los Angeles, CA, USA 3Center for Neural Science, New York University, New York, NY, USA. Correspondence to: Chi-Ning Chou <EMAIL>, Hang Le <EMAIL>, Sue Yeon Chung <EMAIL>.
Pseudocode Yes Algorithm 1 Estimate simulated manifold capacity... Algorithm 2 Estimate manifold capacity and effective geometric measures
Open Source Code Yes All code required to reproduce the figures presented is available under an MIT License at https://github.com/chungneuroai-lab/feature-learning-geometry
Open Datasets Yes Specifically, we considered VGG-11 (Simonyan & Zisserman, 2015) and Res Net-18 (He et al., 2016) and datasets CIFAR10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009), CIFAR-10C (Hendrycks & Dietterich, 2018).
Dataset Splits Yes The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. ... The CIFAR-100 dataset (Krizhevsky & Hinton, 2009) is similar to CIFAR-10, except that it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
Hardware Specification No All experiments were performed using the Flatiron Institute s high-performance computing cluster.
Software Dependencies No Optimizer: We use Stochastic Gradient Descent with momentum (implemented as torch.optim.SGD(momentum=0.9)) to train the models. ... The error bar indicates the bootstraped 95% confidence interval calculated using seaborn.lineplot(errorbar=( ci , 95)).
Experiment Setup Yes Optimizer: We use Stochastic Gradient Descent with momentum (implemented as torch.optim.SGD(momentum=0.9)) to train the models. Data augmentation: We apply the following data augmentation during training: Random Crop(32, padding=4), Random Horizontal Flip. Learning rate and learning schedule: We follow the practice in (Chizat et al., 2019) and set initial learning rate η0 = 1.0 for VGG-11 and η0 = 0.2 for Res Net-18. The learning rate schedule is defined as ηt = η0 / (1 + (1/3)t).