In-Context Deep Learning via Transformer Models

Authors: Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training. ... In this section, we conduct experiments to verify the capability of ICL to learn feed-forward neural networks, and give details in Appendix F.
Researcher Affiliation Academia 1Center for Foundation Models and Generative AI, Northwestern University, USA 2Department of Computer Science, Northwestern University, USA 3University of California Berkeley, USA 4Department of Statistics and Data Science, Northwestern University, USA.
Pseudocode No The paper describes methods and processes algorithmically but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Our code is based on the Py Torch implementation of the in-context learning for the transformer (Garg et al., 2022) at https: //github.com/dtsip/in-context-learning. This refers to a third-party implementation, not the authors' own source code for the methodology described.
Open Datasets No We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. ... Specifically, we sample the input of feed-forward network x Rd from the Gaussian mixture distribution: w1N( 2, Id) + w2N(2, Id), where w1, w2 R, and d=20. We consider the network f : Rd R as a 3-, 4-, or 6-layer NN. We generate the true output by y = f(x). The datasets are synthetic and generated by the authors, with no public access information provided for them.
Dataset Splits Yes For the pertaining data, we use 50 in-context examples, and sample them from N( 2, Id). For the testing data, we use 75 in-context examples... The batch size is 64, and the number of batch is 100, i.e., we have 6400 samples totally. ... We assess performance using the mean R-squared value for all 6400 samples.
Hardware Specification Yes We conduct all experiments using 1 NVIDIA A100 GPU with 80GB of memory.
Software Dependencies No Our code is based on the Py Torch implementation of the in-context learning for the transformer (Garg et al., 2022)... No specific version of PyTorch or other software dependencies is mentioned.
Experiment Setup Yes Both models comprise 12 transformer blocks, each with 8 attention heads, and share the same hidden and MLP dimensions of 256. ... In our setting, we sample the pertaining data from N( 2, Id), i.e., w1 = 1 and w2 = 0. Following the pre-training method in (Garg et al., 2022), we use the batch size as 64. To construct each sample in a batch... The pretraining process iterates for 500k steps. ... We use the MSE loss between prediction and true value of oi. ...train the network with MSE loss for 100 epochs.