TopoLM: brain-like spatio-functional organization in a topographic language model

Authors: Neil Rathi, Johannes Mehrer, Badr AlKhamissi, Taha Binhuraib, Nicholas Blauch, Martin Schrimpf

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comparing Topo LM with a non-topographic baseline model (i.e. one trained without spatial loss) on a series of benchmarks, we show that while Topo LM achieves slightly lower scores on some behavioral tasks (BLi MP), its performance on other downstream tasks (GLUE) and on brain alignment benchmarks (using the Brain-Score platform) is on par with the non-topographic control.
Researcher Affiliation Academia 1EPFL, 2Stanford University, 3Georgia Institute of Technology, 4Harvard University
Pseudocode No The paper describes the model design and training process in text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1Code available at https://github.com/epflneuroailab/topolm.
Open Datasets Yes We train our models on a randomly sampled 10B-token subset of the Fine Web-Edu dataset. The final score is the Pearson correlation between actual and predicted brain activity, averaged over 10 cross-validation folds and over four brain-recording datasets (Blank et al., 2014; Fedorenko et al., 2016; Pereira et al., 2018; Tuckute et al., 2024, Appendix D). GLUE (Wang et al., 2018) is a multi-task benchmark for downstream performance on tasks like entailment and sentiment analysis. Using f MRI data from Hauptman et al. (2024) (Appendix B), we find verb- and noun-selective clusters in the left hemisphere... Available on OPENICPSR at https://doi.org/10.3886/E198163V3. Moseley & Pulverm uller (2014) focus on how cortical noun-verb selectivity relates to semantics.
Dataset Splits Yes The final score is the Pearson correlation between actual and predicted brain activity, averaged over 10 cross-validation folds and over four brain-recording datasets (Blank et al., 2014; Fedorenko et al., 2016; Pereira et al., 2018; Tuckute et al., 2024, Appendix D).
Hardware Specification Yes Models trained for 5 days on 4x NVIDIA 80GB A100s.
Software Dependencies No The paper describes the model architecture (Transformer, GPT-2-small style) and training optimizer (AdamW) but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes In the below experiments, we utilize an adapted GPT-2-small style architecture (Radford et al., 2019). We use hidden dimension 784 such that we can evenly embed units in a 28 × 28 grid. The model has 12 Transformer blocks, each with 16 attention heads and a GELU activation function. We train our models on a randomly sampled 10B-token subset of the Fine Web-Edu dataset. The task loss is cross-entropy on next-word prediction. We use batch size 48 and block size 1024. For spatial loss, we set αk = 2.5 across all layers and operationalize the inverse distance vector d with the ℓ1 norm. For each batch, we average the spatial loss across 5 randomly selected neighborhoods, each of ℓ1 radius 5. After hyperparameter tuning, we optimize using AdamW with β1 = 0.9, β2 = 0.95 and learning rate 6 × 10−4, scheduled with warmup and cosine decay. We use weight decay 0.1, gradient clipping at 1.0, and do not use dropout. We trained both models with early stopping after three consecutive increases on validation loss. We set unit distance 1.0 mm and FWHM 2.0 mm.