reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TopoLM: brain-like spatio-functional organization in a topographic language model

Authors: Neil Rathi, Johannes Mehrer, Badr AlKhamissi, Taha Binhuraib, Nicholas Blauch, Martin Schrimpf

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comparing Topo LM with a non-topographic baseline model (i.e. one trained without spatial loss) on a series of benchmarks, we show that while Topo LM achieves slightly lower scores on some behavioral tasks (BLi MP), its performance on other downstream tasks (GLUE) and on brain alignment benchmarks (using the Brain-Score platform) is on par with the non-topographic control.
Researcher Affiliation	Academia	1EPFL, 2Stanford University, 3Georgia Institute of Technology, 4Harvard University
Pseudocode	No	The paper describes the model design and training process in text and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available at https://github.com/epflneuroailab/topolm.
Open Datasets	Yes	We train our models on a randomly sampled 10B-token subset of the Fine Web-Edu dataset. The final score is the Pearson correlation between actual and predicted brain activity, averaged over 10 cross-validation folds and over four brain-recording datasets (Blank et al., 2014; Fedorenko et al., 2016; Pereira et al., 2018; Tuckute et al., 2024, Appendix D). GLUE (Wang et al., 2018) is a multi-task benchmark for downstream performance on tasks like entailment and sentiment analysis. Using f MRI data from Hauptman et al. (2024) (Appendix B), we find verb- and noun-selective clusters in the left hemisphere... Available on OPENICPSR at https://doi.org/10.3886/E198163V3. Moseley & Pulverm uller (2014) focus on how cortical noun-verb selectivity relates to semantics.
Dataset Splits	Yes	The final score is the Pearson correlation between actual and predicted brain activity, averaged over 10 cross-validation folds and over four brain-recording datasets (Blank et al., 2014; Fedorenko et al., 2016; Pereira et al., 2018; Tuckute et al., 2024, Appendix D).
Hardware Specification	Yes	Models trained for 5 days on 4x NVIDIA 80GB A100s.
Software Dependencies	No	The paper describes the model architecture (Transformer, GPT-2-small style) and training optimizer (AdamW) but does not provide specific version numbers for software libraries or dependencies (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	In the below experiments, we utilize an adapted GPT-2-small style architecture (Radford et al., 2019). We use hidden dimension 784 such that we can evenly embed units in a 28 × 28 grid. The model has 12 Transformer blocks, each with 16 attention heads and a GELU activation function. We train our models on a randomly sampled 10B-token subset of the Fine Web-Edu dataset. The task loss is cross-entropy on next-word prediction. We use batch size 48 and block size 1024. For spatial loss, we set αk = 2.5 across all layers and operationalize the inverse distance vector d with the ℓ1 norm. For each batch, we average the spatial loss across 5 randomly selected neighborhoods, each of ℓ1 radius 5. After hyperparameter tuning, we optimize using AdamW with β1 = 0.9, β2 = 0.95 and learning rate 6 × 10−4, scheduled with warmup and cosine decay. We use weight decay 0.1, gradient clipping at 1.0, and do not use dropout. We trained both models with early stopping after three consecutive increases on validation loss. We set unit distance 1.0 mm and FWHM 2.0 mm.