reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vector-ICL: In-context Learning with Continuous Vector Representations

Authors: Yufan Zhuang, Chandan Singh, Liyuan Liu, Jingbo Shang, Jianfeng Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments across various tasks and modalities, including text reconstruction, numerical function regression, text classification, summarization, molecule captioning, time-series classification, graph classification, and f MRI decoding, Vector-ICL often surpasses both few-shot ICL and domain-specific model or tuning. We further conduct analyses and case studies, indicating the potential of LLMs to process vector representations beyond traditional token-based paradigms.
Researcher Affiliation	Collaboration	Yufan Zhuang1, Chandan Singh2, Liyuan Liu2, Jingbo Shang1, and Jianfeng Gao2 1UC San Diego 2Microsoft Research
Pseudocode	No	The paper describes methodologies and processes with figures (e.g., Figure 1 and Figure 2 illustrating Vector-ICL and pretraining/finetuning projectors) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/Evan Zhuang/vector-icl.
Open Datasets	Yes	To pretrain our text projectors, we leverage the WikiText-103 (Merity et al., 2016) dataset... We use two datasets for the text reconstruction task, Parallel Sentence Talks s English subset (Tiedemann, 2012) and Quora (Thakur et al., 2021)... We use five datasets for the text classification task. For binary classification, we include IMDB (Maas et al., 2011), Rotten Tomatoes (Pang & Lee, 2005), and the Stanford Sentiment Treebank (SST2) (Socher et al., 2013). For multi-class classification, we use the Emotion dataset (Saravia et al., 2018) and the Financial Phrasebank (Malo et al., 2014)... We use two datasets for the summarization task, XSum (Narayan et al., 2018) and the English subset of XLSum (Hasan et al., 2021). We use the Language + Molecules-24 (LPM24) dataset for the molecule captioning task... We analyze data from Le Bel et al. 2022 and Tang et al. 2023, which consists of f MRI responses... We use two datasets for the time-series classification task, Ford A, and Ford B, they are also part of the UCR Time Series Classification Archive (Dau et al., 2019)... We use the ogbg-molhiv (Hu et al., 2020) dataset for the graph classification task.
Dataset Splits	Yes	The data was separated into train set and test set by holding out the same three podcast stories from the three human subjects. We use the same pretraining methodology as text to pretrain on the brain f MRI data. As the data comes in as segments of text and the recorded f MRI, we randomly sample 20% of the segments to be in f MRI form and projected into box tokens, and we impose next token generation loss on the rest 80%.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Table 3: Hyperparameters for V-ICL training. Hyperparameter Value Learning Rate 1e-3 Learning Rate Schedule Cosine Annealing Optimizer Adam W β1 0.9 β2 0.999 Training dtype bf16 Batch Size 128 Generation Temperature 2e-1. Early stopping with patience of 500 steps is used during finetuning, as projectors converge quickly due to their small parameter sizes.