reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Authors: Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, Wen-Tau Yih

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of Self Cite is demonstrated by increasing citation F1 up to 5.3 points on the Long Bench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/ facebookresearch/Self Cite.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology, Cambridge, MA 02139, USA 2Meta FAIR, USA. Correspondence to: Yung-Sung Chuang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Self Cite Best-of-N Sampling for Citations
Open Source Code	Yes	The source code is available at https://github.com/ facebookresearch/Self Cite.
Open Datasets	Yes	We evaluate our approach on Long Bench Cite (Zhang et al., 2024), a comprehensive benchmark specifically designed for long-context QA with citations (LQAC). ... The benchmark contains five datasets, including single-doc QA Multi Field QA-en/zh (Bai et al., 2023), multi-doc QA Hotpot QA (Yang et al., 2018) and Du Reader (He et al., 2018), one summarization dataset Gov Report (Huang et al., 2021), and Long Bench-Chat (Bai et al., 2024) which covers diverse real-world queries with long contexts such as document QA, summarization, and coding.
Dataset Splits	Yes	For preference optimization with Sim PO (Section 2.4), we use 2K document question pairs from Long Cite-45K (Zhang et al., 2024) as the training set... We randomly sample 2K document and question pairs from the Long Cite-45k data, generate the best-of-N responses with our Algorithm 1 to obtain the preference data, and train for one epoch. We sample another 100 examples as development set to pick the best learning rate from {1e-7, 3e-7, 5e-7, 7e-7}.
Hardware Specification	Yes	We run all the finetuning experiments on with 8 A100 GPUs of 80 GB memory on a single node.
Software Dependencies	No	The paper mentions software like Huggingface Transformers (Wolf et al., 2020), Liger-Kernel (Hsu et al., 2024), NLTK (Bird, 2006) and the NLI model google/t5 xxl true nli mixture. However, specific version numbers for these software components are not explicitly provided within the text.
Experiment Setup	Yes	We keep other hyperparameters the same as the original Sim PO (Meng et al., 2024)... The responses are generated via top-p sampling (Holtzman et al., 2020) with p=0.7 and temperature=0.95. We set p=0.9 and temperature=1.2 when doing best-of-N sampling for the citation strings to increase the diversity. We set N=10 in all the experiments... The batch size is set to 1 per GPU due to the long context examples. We set our max context length to 25600 to prevent OOM.