Population Transformer: Learning Population-level Representations of Neural Activity
Authors: Geeling Chau, Christopher Wang, Sabera Talukder, Vighnesh Subramaniam, Saraswati Soedarmadji, Yisong Yue, Boris Katz, Andrei Barbu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that our pretrained Pop T outperforms commonly used aggregation approaches (Ghosal & Abbasi-Asl, 2021), and is competitive with end-to-end trained methods (Zhang et al., 2024; Yang et al., 2024; You et al., 2019). Moreover, we find that these benefits hold even for subjects not seen during pretraining, indicating its usefulness for new subject decoding. We also show that the pretrained Pop T weights themselves reveal interpretable patterns for neuroscientific study. Finally, we demonstrate that our proposed framework is agnostic to the underlying temporal encoder, further allowing it to adapt to other neural recording modalities. Tables 1, 2 and Figures 2, 3, 4, 5, 6, 7 are provided to support these claims. |
| Researcher Affiliation | Academia | 1California Institute of Technology EMAIL 2MIT CSAIL, CBMM EMAIL |
| Pseudocode | Yes | Algorithm 1 Connectivity measurement between channels i and j |
| Open Source Code | Yes | We release our code as well as a pretrained Pop T to enable off-the-shelf improvements in multi-channel intracranial data decoding and interpretability. Code is available at https://github.com/czlwang/Population Transformer. |
| Open Datasets | Yes | i EEG: We use the publicly available subject data from Wang et al. (2024). EEG: We use the Temple University Hospital EEG and Abnormal datasets, TUEG and TUAB (Obeid & Picone, 2016), for pretraining and task data respectively. |
| Dataset Splits | Yes | We pretrain for 500,000 steps, and record the validation performance every 1,000 steps. Downstream evaluation takes place on the weights with the best validation performance. We use the intermediate representation at the [CLS] token dh = 512 and put a linear layer that outputs to dout = 1 for fine-tuning on downstream tasks. These parameters for pretraining were the same for any Pop T that needed to be pretrained (across temporal embeddings, hold-one-out subject, ablation studies). ... For all downstream decoding, we use a fixed train/val/test split of 0.8, 0.1, 0.1 of the data. |
| Hardware Specification | Yes | To run all our experiments (data processing, pretraining, evaluations, interpretability), one only needs 1 NVIDIA Titan RTXs (24GB GPU RAM). Pretraining Pop T takes 2 days on 1 GPU. Our downstream evaluations take a few minutes to run each. For the purposes of data processing and gathering all the results in the paper, we parallelized the experiments on 8 GPUs. ... Table 5: Pop T 1 NVIDIA TITAN RTX (24GB) |
| Software Dependencies | No | The paper mentions several libraries and optimizers by reference (e.g., LAMB optimizer (You et al., 2019), Adam W optimizer (Loshchilov & Hutter, 2017), MNE-Python (Gramfort et al., 2013), scikit-learn (Pedregosa et al., 2011), Nilearn (Nilearn contributors)), and a scheduler by a GitHub user (ildoonet, 2024), but it does not specify concrete version numbers for these software components as used in their experimental setup. |
| Experiment Setup | Yes | The core Population Transformer consists of a transformer encoder stack with 6 layers, 8 heads. All layers in the encoder stack are set with the following parameters: dh = 512, H = 8, and pdropout = 0.1. We pretrain the Pop T model with the LAMB optimizer (You et al., 2019) (lr = 5e 4), with a batch size of nbatch = 256, and train/val/test split of 0.89, 0.01, 0.10 of the data. We pretrain for 500,000 steps, and record the validation performance every 1,000 steps. ... For Pop T models, we train with these parameters: Adam W optimizer (Loshchilov & Hutter, 2017), lr = 5e 4 where transformer weights are scaled down by a factor of 10 (lrt = 5e 5), nbatch = 128, a Ramp Up scheduler (ildoonet, 2024) with warmup 0.025 and Step LR gamma 0.95, reducing 100 times within the 2000 total steps that we train for. |