Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval-Augmented Generation
Authors: Tobias Leemann, Periklis Petridis, Giuseppe Vietri, Dionysis Manousakas, Aaron Roth, Sergul Aydore
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run experiments with realistic datasets and baseline models to confirm the efficacy of Auto-GDA. Our results show that Auto-GDA is highly effective and improves performance in ROC-AUC scores of all tested models on all datasets. We present the main results obtained with Auto-GDA in Table 1. |
| Researcher Affiliation | Collaboration | Tobias Leemann University of Tübingen EMAIL; Periklis Petridis MIT EMAIL; Giuseppe Vietri AWS AI Labs EMAIL; Dionysis Manousakas AWS AI Labs EMAIL; Aaron Roth AWS AI Labs EMAIL; Sergül Aydöre AWS AI Labs EMAIL |
| Pseudocode | Yes | Algorithm 1 Automatic Generative Domain Adaptation (Auto-GDA) |
| Open Source Code | Yes | Code is available at https://github.com/amazon-science/Auto-GDA-Efficient-Grounding-Verification-in-RAG |
| Open Datasets | Yes | We evaluate our approach on three datasets for document-grounded summarization and question answering (QA). We select datasets that include documents, realistic LLM-generated longform answers, and human labels that can be used for testing. The Summ Edits dataset (Laban et al., 2023) contains GPT-3.5-generated and manual summaries of documents from different domains, e.g., judicial, sales emails, podcasts. We further use both the summary and the QA portion of the RAGTruth dataset (Niu et al., 2024). The RAGTruth dataset contains summaries and answers to questions created by LLMs (GPT-3.5/4, Mistral, Llama2). Finally, we use the LFQA-Verification dataset (Chen et al., 2023a), which retrieved documents for questions from the Explain me Like I am five -dataset and generated corresponding long-form answers with GPT-3.5 and Alpaca. Details and links to the datasets can be found in Appendix C.2. (Table 4: Dataset Train Val Test Link: ragtruth-Summary... https://github.com/Particle_Media/RAGTruth; summedits... https://huggingface.co/datasets/Salesforce/summedits; lfqa-verification... https://github.com/timchen0618/LFQA-Verification/) |
| Dataset Splits | Yes | We either use the available train/test splits (RAGTruth) or create splits making sure that summaries / answers derived from the same evidence are either only present in the train or the test split. The validation split is derived from the train split. The sizes and source links of the resulting datasets are provided in Table 4. Table 4: Dataset Train Val Test Link (e.g., ragtruth-Summary 2578 125 636) |
| Hardware Specification | Yes | Our experiments (including runtime) were run on a system with 16-core Intel(R) Xeon(R) CPU E5-2686 processors (2.30GHz) and a single Nvidia Tesla V100 GPU with 32GB of RAM. |
| Software Dependencies | No | The paper mentions using Huggingface checkpoints for De BERTa V2, BART-large and FLAN-T5, Optuna for hyperparameter optimization, a sentence-t5-base model for embeddings, spacy with en_core_web_sm tokenizer, and a T5-based paraphrasing model. However, specific version numbers for these software components are not explicitly provided. |
| Experiment Setup | Yes | Finetuning: 1 Epoch, learning rate 10 5 for De BERTA, BART, 2 10 4 for FLAN-T5, batch size 2. We employ optuna as a principled way of choosing the remaining hyperparameters λ u, λd, and the other teacher model used to estimate entailment probabilities for augmentations in Eqn. 1. We perform 50 trials per dataset and use the ROC score of a fine-tuned De BERTa V2 model on the small validation dataset as selection objective. In case limited budget for hyperparameter tuning is available, we recomment setting λ u=λd [20, 50] which led to stable performance. Auto-GDA is run for two iterations on RAGTruth and one iteration on the other datasets, generating synthetic datasets between 1.3 and 2 the original dataset size. |