reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval-Augmented Generation

Authors: Tobias Leemann, Periklis Petridis, Giuseppe Vietri, Dionysis Manousakas, Aaron Roth, Sergul Aydore

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run experiments with realistic datasets and baseline models to confirm the efficacy of Auto-GDA. Our results show that Auto-GDA is highly effective and improves performance in ROC-AUC scores of all tested models on all datasets. We present the main results obtained with Auto-GDA in Table 1.
Researcher Affiliation	Collaboration	Tobias Leemann University of Tübingen EMAIL; Periklis Petridis MIT EMAIL; Giuseppe Vietri AWS AI Labs EMAIL; Dionysis Manousakas AWS AI Labs EMAIL; Aaron Roth AWS AI Labs EMAIL; Sergül Aydöre AWS AI Labs EMAIL
Pseudocode	Yes	Algorithm 1 Automatic Generative Domain Adaptation (Auto-GDA)
Open Source Code	Yes	Code is available at https://github.com/amazon-science/Auto-GDA-Efficient-Grounding-Verification-in-RAG
Open Datasets	Yes	We evaluate our approach on three datasets for document-grounded summarization and question answering (QA). We select datasets that include documents, realistic LLM-generated longform answers, and human labels that can be used for testing. The Summ Edits dataset (Laban et al., 2023) contains GPT-3.5-generated and manual summaries of documents from different domains, e.g., judicial, sales emails, podcasts. We further use both the summary and the QA portion of the RAGTruth dataset (Niu et al., 2024). The RAGTruth dataset contains summaries and answers to questions created by LLMs (GPT-3.5/4, Mistral, Llama2). Finally, we use the LFQA-Verification dataset (Chen et al., 2023a), which retrieved documents for questions from the Explain me Like I am five -dataset and generated corresponding long-form answers with GPT-3.5 and Alpaca. Details and links to the datasets can be found in Appendix C.2. (Table 4: Dataset Train Val Test Link: ragtruth-Summary... https://github.com/Particle_Media/RAGTruth; summedits... https://huggingface.co/datasets/Salesforce/summedits; lfqa-verification... https://github.com/timchen0618/LFQA-Verification/)
Dataset Splits	Yes	We either use the available train/test splits (RAGTruth) or create splits making sure that summaries / answers derived from the same evidence are either only present in the train or the test split. The validation split is derived from the train split. The sizes and source links of the resulting datasets are provided in Table 4. Table 4: Dataset Train Val Test Link (e.g., ragtruth-Summary 2578 125 636)
Hardware Specification	Yes	Our experiments (including runtime) were run on a system with 16-core Intel(R) Xeon(R) CPU E5-2686 processors (2.30GHz) and a single Nvidia Tesla V100 GPU with 32GB of RAM.
Software Dependencies	No	The paper mentions using Huggingface checkpoints for De BERTa V2, BART-large and FLAN-T5, Optuna for hyperparameter optimization, a sentence-t5-base model for embeddings, spacy with en_core_web_sm tokenizer, and a T5-based paraphrasing model. However, specific version numbers for these software components are not explicitly provided.
Experiment Setup	Yes	Finetuning: 1 Epoch, learning rate 10 5 for De BERTA, BART, 2 10 4 for FLAN-T5, batch size 2. We employ optuna as a principled way of choosing the remaining hyperparameters λ u, λd, and the other teacher model used to estimate entailment probabilities for augmentations in Eqn. 1. We perform 50 trials per dataset and use the ROC score of a fine-tuned De BERTa V2 model on the small validation dataset as selection objective. In case limited budget for hyperparameter tuning is available, we recomment setting λ u=λd [20, 50] which led to stable performance. Auto-GDA is run for two iterations on RAGTruth and one iteration on the other datasets, generating synthetic datasets between 1.3 and 2 the original dataset size.