reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Iterative Vectors: In-Context Gradient Steering without Backpropagation

Authors: Yiting Liu, Zhi-Hong Deng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate IVs across various tasks using four popular models and observe significant improvements. Our findings suggest that in-context activation steering is a promising direction, opening new avenues for future research.
Researcher Affiliation	Academia	1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University. Correspondence to: Zhi-Hong Deng <EMAIL>.
Pseudocode	Yes	The pseudocode for the extraction and evaluation process is available in Appendix B. To facilitate understanding, Appendix C includes an example of the processes described. Algorithm 1 Extraction of Iterative Vectors Algorithm 2 Evaluation Algorithm 3 Episodic Functions
Open Source Code	Yes	Our code is available on Git Hub.
Open Datasets	Yes	Details of all the datasets used in this paper can be found in Appendix E, while additional results with the other two metrics are provided in Appendix F. E. Datasets A full list of all datasets utilized in this research, along with their corresponding access labels, is detailed in Table 5. The datasets are obtained from Hugging Face (Lhoest et al., 2021).
Dataset Splits	Yes	For a given split of an n-way k-shot classification task T = {Ttrain, Tval, Ttest}, which comprises textual query-answer pairs (x, y), an ICL episode 1 is sampled as: We evaluate over 200 episodes for both extraction (Ttrain) and hyperparameter search (Tval).
Hardware Specification	Yes	All experiments can be performed on a single Nvidia RTX A6000 GPU unless stated otherwise. Conducted on 3 Nvidia RTX A6000 GPUs.
Software Dependencies	No	The paper mentions various language models (GPT-J-6B, Llama 2, Llama 3.1) and the Hugging Face platform for datasets but does not provide specific version numbers for software libraries or dependencies used in their implementation.
Experiment Setup	Yes	For the hyperparameters of IVs, we use a fixed iterative batch size of b = 10 and explore the extraction strength and inference strength α1, α2 {0.1, 0.3, 0.5, 0.7, 0.9} across all tasks. Regarding the extraction shot k, we test k {1, 2, 3, 4} for both TVs and IVs. All experiments were conducted using a predetermined random seed (42) to mitigate selection bias. To ensure a robust representation of result distributions, the tests are averaged over a substantial number of episodes, namely 10,000. We reuse hyperparameters obtained from prior searches in the main experiment (k = 4, b = 10 fixed, α1 = 0.3, α2 = 0.5).