MLPs Learn In-Context on Regression and Classification Tasks

Authors: William Tong, Cengiz Pehlevan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We begin by exploring MLPs behavior in a controlled ICL format, where their specific capacities and weaknesses can be precisely characterized. Specifically, we examine two tasks: in-context regression and in-context classification. Figure 1c plots the MSE achieved by different architectures as a function of total compute.
Researcher Affiliation Academia William L. Tong & Cengiz Pehlevan School of Engineering and Applied Sciences Center for Brain Sciences Kempner Institute for the Study of Artificial and Natural Intelligence Harvard University, Cambridge, MA 02138 {wtong@g,cpehlevan@seas}.harvard.edu
Pseudocode No The paper describes the model architectures (MLP, MLP-Mixer, Transformer, RB MLP) using mathematical equations and descriptive text in Appendix C, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. For example, for MLPs it states: 'h1(x) = ϕ (W 1x + b1) h2(x) = ϕ (W 2h1(x) + b2) ... hℓ(x) = ϕ (W ℓhℓ 1(x) + bℓ) f MLP(x) = W outhℓ(x) + bout'.
Open Source Code Yes For the most precise information on our setup, please refer to our Git Hub code repository: https://github.com/wtong98/mlp-icl
Open Datasets No We focus on controlled tasks commonly studied in the ICL literature... These tasks are necessarily synthetic approximations of natural language ICL prompting, but allow us to disambiguate a model s capacity for in-context learning from its ability to attain natural language fluency. Inputs are sampled as x N(0, I) and weights are sampled as β N (0, I/n).
Dataset Splits No All training examples are presented online with batch size 128. During testing, we probe the model s performance both on the training distribution where the weights are restricted to a finite pool β U {βi}k i=1 and an unrestricted distribution where the weights are drawn freely β N(0, I/n). The paper describes an online training regime with synthetic data generation rather than predefined train/test/validation splits with fixed datasets.
Hardware Specification Yes The per-experiment GPU time on an A100 to generate the above figures are estimated at
Software Dependencies No All models are implemented and trained using the Jax (Bradbury et al., 2018) family of libraries, particularly Flax (Heek et al., 2023). Plots are created using Seaborn (Waskom, 2021) and Pandas (pandas development team, 2020). The paper lists software libraries but does not provide specific version numbers for them.
Experiment Setup Yes For all tasks, we use ReLU activation functions applied pointwise ϕ(x) = max(x, 0). Widths of all hidden layers are fixed to the same value H. As with all models, all training examples are presented online with batch size 128. Training uses AdamW (Loshchilov and Hutter, 2017) with learning rate α = 1 × 10−4 and weight decay λ = 1 × 10−4. The hyperparameters used to train MLPs on each task are presented in Table 1.