Features are fate: a theory of transfer learning in high-dimensional regression
Authors: Javan Tahir, Surya Ganguli, Grant M. Rotskoff
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We adopt a feature-centric viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks. |
| Researcher Affiliation | Academia | 1Department of Applied Physics, Stanford University, Stanford CA, USA 2Department of Chemistry, Stanford University, Stanford CA, USA. Correspondence to: Javan Tahir <EMAIL>. |
| Pseudocode | No | The paper describes methods and proofs using mathematical notation and prose, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | All code to reproduce the results in this paper can be found at https://github.com/javantahir/features_are_fate |
| Open Datasets | No | Assumption 3.1. Assume that the input data x Rd is normally distributed and that each dataset D consists of n independent samples. ... we choose x N(0, Id)... |
| Dataset Splits | No | The paper describes generating synthetic data for training, stating, 'We pretrain a linear network (7) with L = 2 and d = 500 to produce labels from linear source function y = βT s x + ϵ using the population loss (2). We then retrain the final layer weights on a sample of n = γd points (xi, yi = βT t xi + ϵi)...' and evaluates generalization error with respect to the population distribution. It does not provide explicit training/validation/test splits from a fixed finite dataset, but rather trains on 'n' samples and assesses performance on the overall population. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., CPU, GPU models, or cloud computing resources). |
| Software Dependencies | No | The paper does not provide specific software names along with their version numbers (e.g., Python 3.x, PyTorch 1.x) that would be needed for reproducibility. |
| Experiment Setup | Yes | For the experiments in deep linear models, we train a two layer linear network with dimension d = 500. We initialize the weight matrices with random normal weights and scale parameter α = 10^-5. To approximate gradient flow, we use full batch gradient descent with small learning rate η = 10^-3. We train each model for 10^5 steps or until the training loss reaches 10^-6. ... For the experiments in shallow Re LU networks, we use the parameters d = 100, m = 1000, m = 100. We initialize the weight matrices randomly on the sphere and the output weights are initialized at 10^-7. We approximate gradient flow with full batch gradient descent and learning rate 0.01m and train for 10^5 iterations or until the loss reaches 10^-6. |