SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Authors: Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, Wen-Tau Yih
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of Self Cite is demonstrated by increasing citation F1 up to 5.3 points on the Long Bench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/ facebookresearch/Self Cite. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology, Cambridge, MA 02139, USA 2Meta FAIR, USA. Correspondence to: Yung-Sung Chuang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Self Cite Best-of-N Sampling for Citations |
| Open Source Code | Yes | The source code is available at https://github.com/ facebookresearch/Self Cite. |
| Open Datasets | Yes | We evaluate our approach on Long Bench Cite (Zhang et al., 2024), a comprehensive benchmark specifically designed for long-context QA with citations (LQAC). ... The benchmark contains five datasets, including single-doc QA Multi Field QA-en/zh (Bai et al., 2023), multi-doc QA Hotpot QA (Yang et al., 2018) and Du Reader (He et al., 2018), one summarization dataset Gov Report (Huang et al., 2021), and Long Bench-Chat (Bai et al., 2024) which covers diverse real-world queries with long contexts such as document QA, summarization, and coding. |
| Dataset Splits | Yes | For preference optimization with Sim PO (Section 2.4), we use 2K document question pairs from Long Cite-45K (Zhang et al., 2024) as the training set... We randomly sample 2K document and question pairs from the Long Cite-45k data, generate the best-of-N responses with our Algorithm 1 to obtain the preference data, and train for one epoch. We sample another 100 examples as development set to pick the best learning rate from {1e-7, 3e-7, 5e-7, 7e-7}. |
| Hardware Specification | Yes | We run all the finetuning experiments on with 8 A100 GPUs of 80 GB memory on a single node. |
| Software Dependencies | No | The paper mentions software like Huggingface Transformers (Wolf et al., 2020), Liger-Kernel (Hsu et al., 2024), NLTK (Bird, 2006) and the NLI model google/t5 xxl true nli mixture. However, specific version numbers for these software components are not explicitly provided within the text. |
| Experiment Setup | Yes | We keep other hyperparameters the same as the original Sim PO (Meng et al., 2024)... The responses are generated via top-p sampling (Holtzman et al., 2020) with p=0.7 and temperature=0.95. We set p=0.9 and temperature=1.2 when doing best-of-N sampling for the citation strings to increase the diversity. We set N=10 in all the experiments... The batch size is set to 1 per GPU due to the long context examples. We set our max context length to 25600 to prevent OOM. |