reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval

Authors: Sepanta Zeighami, Zac Wellmer, Aditya Parameswaran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a thorough theoretical and experimental analysis of NUDGE's non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pretraining. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often achieves NDCG@10 at least 10% higher than existing fine-tuning methods.
Researcher Affiliation	Collaboration	Sepanta Zeighami UC Berkeley EMAIL Zac Wellmer EMAIL Aditya Parameswaran UC Berkeley EMAIL
Pseudocode	Yes	Algorithm 1 NUDGE-M algorithm
Open Source Code	Yes	End-to-end source code is available at https://github.com/szeighami/nudge that downloads the datasets from publicly available sources, runs our methods and the baselines, and reproduces our results.
Open Datasets	Yes	For text retrieval datasets we use 7 standard datasets: Sci Facts (Wadden et al., 2020), Fever (Fever, 2024), Argu Ana (Arguana, 2024) (we use their BEIR (Thakur et al., 2021) versions), Trivia QA Joshi et al. (2017), Hotpot QA (Yang et al., 2018), and Natural Questions(Kwiatkowski et al., 2019) (we use their KILT (Petroni et al., 2021) versions), and NF-Corpus (Boteva et al., 2016). (...) For image retrieval, we use COCO (Lin et al., 2014) (we use the dataset from 2014) and Flickr (Young et al., 2014) datasets.
Dataset Splits	Yes	For all text and image datasets, we use 0.7-0.1-0.2 train, validation and test split, but limit test and validation sizes to at most 10,000 queries if there is more.
Hardware Specification	Yes	Table 8 shows the total fine-tuning time to run BGE-S on our text datasets (i.e., time obtain the associated results in Table 3) using an Nvidia A100 GPU as well as using 32 core Intel Broadwell CPUs.
Software Dependencies	No	The codebase contains a docker file and instructions to run the code in a docker container to ensure the code is easily executable and runs with the same configuration to reproduce our results exactly. The code contains any and all data processing steps (as already described in Sec. 4), hyperparameter and other experimental settings (as described in Sec. 4 and detailed in Appx. E.1), and implementation of all the methods and baselines. No specific software versions for libraries or programming languages are provided within the paper text, only that a Docker container is used for reproducibility.
Experiment Setup	Yes	For both Adaptor and PTFT, we did hyperparameter tuning to determine the learning rate (and use of a scheduler), batch size (although for PTFT it is bottlenecked by GPU memory size), number of training steps, model architecture and initialization (for Adaptor, we tried linear up to 8 layer MLPs) and which layers to train (for PTFT, we tried training the full model or training the last layer), and the choice and parameters of the loss function. We only performed hyper-parameter tuning for BGE-S, and used the resulting hyper-parameters for other models. (...) We use cosine similarity as the retrieval distance metric for No Fine-Tuning, Adaptor, and PTFT. (...) We use early stopping if validation accuracy drops by more than 5% compared with the maximum it had achieved.