Features are fate: a theory of transfer learning in high-dimensional regression

Authors: Javan Tahir, Surya Ganguli, Grant M. Rotskoff

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We adopt a feature-centric viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.
Researcher Affiliation Academia 1Department of Applied Physics, Stanford University, Stanford CA, USA 2Department of Chemistry, Stanford University, Stanford CA, USA. Correspondence to: Javan Tahir <EMAIL>.
Pseudocode No The paper describes methods and proofs using mathematical notation and prose, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes All code to reproduce the results in this paper can be found at https://github.com/javantahir/features_are_fate
Open Datasets No Assumption 3.1. Assume that the input data x Rd is normally distributed and that each dataset D consists of n independent samples. ... we choose x N(0, Id)...
Dataset Splits No The paper describes generating synthetic data for training, stating, 'We pretrain a linear network (7) with L = 2 and d = 500 to produce labels from linear source function y = βT s x + ϵ using the population loss (2). We then retrain the final layer weights on a sample of n = γd points (xi, yi = βT t xi + ϵi)...' and evaluates generalization error with respect to the population distribution. It does not provide explicit training/validation/test splits from a fixed finite dataset, but rather trains on 'n' samples and assesses performance on the overall population.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., CPU, GPU models, or cloud computing resources).
Software Dependencies No The paper does not provide specific software names along with their version numbers (e.g., Python 3.x, PyTorch 1.x) that would be needed for reproducibility.
Experiment Setup Yes For the experiments in deep linear models, we train a two layer linear network with dimension d = 500. We initialize the weight matrices with random normal weights and scale parameter α = 10^-5. To approximate gradient flow, we use full batch gradient descent with small learning rate η = 10^-3. We train each model for 10^5 steps or until the training loss reaches 10^-6. ... For the experiments in shallow Re LU networks, we use the parameters d = 100, m = 1000, m = 100. We initialize the weight matrices randomly on the sphere and the output weights are initialized at 10^-7. We approximate gradient flow with full batch gradient descent and learning rate 0.01m and train for 10^5 iterations or until the loss reaches 10^-6.