reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rapid Selection and Ordering of In-Context Demonstrations via Prompt Embedding Clustering

Authors: Kha Pham, Hung Le, Man Ngo, Truyen Tran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate the prompt embedding space... We provide extensive analyses to confirm the clustering property. In particular, we visualize prompt embeddings in 2D spaces using UMAP, run K-Means clustering on high-dimensional embedding spaces, and quantify the importance of input tokens by their partial derivative norms. Experimental results consistently support the existence of clusters... We apply Cluster-based Search in two selection scenarios... In both cases, our proposed method achieves competitive accuracies compared to exhaustive search while being significantly faster saving 92% to nearly 100% execution time.
Researcher Affiliation	Academia	1 Applied Artificial Intelligence Institute, Deakin University 2 Faculty of Data Science in Business, Ho Chi Minh University of Banking, Vietnam 1 EMAIL 2 EMAIL
Pseudocode	Yes	Algorithm 1 Entropy-Based Selecting Criterion Input: set of prompt candidates P set c Best = inf set p Best = None for p in P do compute logits ℓ 1 compute confidence score c (ℓ 1) if c (ℓ 1) > c Best then c Best = c (ℓ 1) p Best = p end if end for Output: p Best
Open Source Code	No	The paper does not provide an explicit statement or a link indicating that the authors have released the source code for their methodology.
Open Datasets	Yes	For text classification, we consider tasks of sentiment classification and language identification. We use data from SST-2 (Socher et al., 2013) dataset for sentiment classification... Dataset for language identification is taken from Hugging Face (Hugging Face, 2021)... For the common-sense reasoning task, we leverage question-answer pairs from the Commonsense QA dataset (Talmor et al., 2019)... For the mathematical arithmetic task, we use questions and answers from the Add Sub dataset (Hosseini et al., 2014)... we train decoder-only Transformers from scratch on Wiki Text2 dataset (Merity et al., 2017).
Dataset Splits	No	The paper mentions generating prompts with k demonstrations from a k_total pool and using '1,000 tuples of (E, q)' or '100 randomized prompts' for experiments. For Transformers trained from scratch, it states training on 'Wiki Text2 dataset with SGD optimizer with learning rate 5e-1 in 100 epochs.' However, it does not explicitly specify traditional training/validation/test splits for any of the datasets used for either the pre-trained LLMs or the custom-trained Transformers.
Hardware Specification	No	The paper mentions various LLMs used (GPT-2, GPT-Neo, Llama-v1/v2, MPT, Phi-2, Qwen-2.5) and describes their architecture (e.g., 12 self-attention layers, token embedding size 768 for custom-trained Transformers), but it does not specify the particular GPU models, CPU types, or other hardware used to run the experiments.
Software Dependencies	No	The paper mentions using UMAP, K-Means clustering, t-SNE for visualizations and analysis, and SGD optimizer for training. However, it does not provide specific version numbers for any of these software components, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Specifically, we train from scratch four Transformers: one with full components, one without positional encoding, one without causal attention mask, and one without both. Other Transformer components are always included. Each Transformer has 12 self-attention layers, each with 12 attention heads; the token embedding size is 768, and the hidden size in MLP layers is 2048. We train the Transformers with different types of positional encodings, namely the sinusoidal, rotary, and trainable positional encoding, on Wiki Text2 dataset with SGD optimizer with learning rate 5e-1 in 100 epochs. For fair comparisons, the training task for all Transformers is next-token prediction.