SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Authors: Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, Wen-Tau Yih

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of Self Cite is demonstrated by increasing citation F1 up to 5.3 points on the Long Bench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/ facebookresearch/Self Cite.
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology, Cambridge, MA 02139, USA 2Meta FAIR, USA. Correspondence to: Yung-Sung Chuang <EMAIL>.
Pseudocode Yes Algorithm 1 Self Cite Best-of-N Sampling for Citations
Open Source Code Yes The source code is available at https://github.com/ facebookresearch/Self Cite.
Open Datasets Yes We evaluate our approach on Long Bench Cite (Zhang et al., 2024), a comprehensive benchmark specifically designed for long-context QA with citations (LQAC). ... The benchmark contains five datasets, including single-doc QA Multi Field QA-en/zh (Bai et al., 2023), multi-doc QA Hotpot QA (Yang et al., 2018) and Du Reader (He et al., 2018), one summarization dataset Gov Report (Huang et al., 2021), and Long Bench-Chat (Bai et al., 2024) which covers diverse real-world queries with long contexts such as document QA, summarization, and coding.
Dataset Splits Yes For preference optimization with Sim PO (Section 2.4), we use 2K document question pairs from Long Cite-45K (Zhang et al., 2024) as the training set... We randomly sample 2K document and question pairs from the Long Cite-45k data, generate the best-of-N responses with our Algorithm 1 to obtain the preference data, and train for one epoch. We sample another 100 examples as development set to pick the best learning rate from {1e-7, 3e-7, 5e-7, 7e-7}.
Hardware Specification Yes We run all the finetuning experiments on with 8 A100 GPUs of 80 GB memory on a single node.
Software Dependencies No The paper mentions software like Huggingface Transformers (Wolf et al., 2020), Liger-Kernel (Hsu et al., 2024), NLTK (Bird, 2006) and the NLI model google/t5 xxl true nli mixture. However, specific version numbers for these software components are not explicitly provided within the text.
Experiment Setup Yes We keep other hyperparameters the same as the original Sim PO (Meng et al., 2024)... The responses are generated via top-p sampling (Holtzman et al., 2020) with p=0.7 and temperature=0.95. We set p=0.9 and temperature=1.2 when doing best-of-N sampling for the citation strings to increase the diversity. We set N=10 in all the experiments... The batch size is set to 1 per GPU due to the long context examples. We set our max context length to 25600 to prevent OOM.