Human-Aligned Image Models Improve Visual Decoding from the Brain

Authors: Nona Rajabi, Antonio H. Ribeiro, Miguel Vasco, Farzaneh Taleb, Mårten Björkman, Danica Kragic

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.
Researcher Affiliation Academia 1Division of Robotics, Perception, and Learning, KTH Royal Institute of Technology, Stockholm, Sweden 2Department of Information Technology, Uppsala University, Uppsala, Sweden.
Pseudocode No The paper describes its methodology using text and mathematical equations in Section 2, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1All codes are available at https://github.com/ Nona Rjb/Align Vis.git
Open Datasets Yes We used the Things EEG2 dataset (Gifford et al., 2022) to train and evaluate our framework. For the results in Section 5.4, we used preprocessed MEG data from (Hebart et al., 2023)... we extended our experiments to the NSD dataset (Allen et al., 2022).
Dataset Splits Yes The training set includes 1,654 unique concepts, each with 10 images shown in random order and repeated 4 times, totaling 1654 10 4 samples per participant. The test set contains 200 distinct concepts, each with 1 image shown 80 times, yielding 200 1 80 samples per participant. [...] Models were trained with a 90/10% split and evaluated on the test set. [...] The training set contains 1854 12 1 samples, while the test set includes 200 1 12 samples per participant. [...] This resulted in a dataset comprising 24,980 training samples and 2,770 test samples. For the test set, we averaged the brain responses across the three repetitions of each image, reducing the test set to 982 unique samples, while the training set remained unaveraged.
Hardware Specification No The computations and data handling were enabled mainly by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre and partly by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. This describes computing resources but does not provide specific hardware details such as GPU/CPU models or memory.
Software Dependencies No We obtained human-aligned image encoders directly from the Dreamsim, g Local, and Harmonization repositories provided by the authors, ensuring they function exactly as reported in their respective papers without any retraining. For the original unaligned encoders, we used publicly available pretrained models using Hugging Face transformers (Table 4) or timm (Table 5) libraries. The paper lists software libraries and frameworks but does not specify their version numbers for reproducibility.
Experiment Setup Yes For per-participant EEG experiments, NICE encoders were trained for up to 50 epochs with a batch size of 128, a learning rate of 0.0002, and a temperature of 0.04. The same hyperparameters were used for ATM-S, except for the number of epochs, which was set to 80. EEGNet and EEGConformer were both trained for 200 epochs. EEGConformer used a learning rate of 0.0002, a batch size of 128, and a temperature of 0.07, while EEGNet used 0.01, 512, and 0.1, respectively. For cross-participant training, NICE was trained for up to 150 epochs with a batch size of 512 and a learning rate of 0.0001. For MEG experiments, we used a learning rate of 0.00005, a batch size of 256, and a temperature of 0.1, training for up to 50 epochs. Training was halted in all models if validation loss did not improve for 25 consecutive epochs. We trained the MLP fMRI encoder with residual connections proposed by Scotti et al. (2023) for 50 epochs with a learning rate of 0.0001 and a batch size of 128.