reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LitLLMs, LLMs for Literature Review: Are we there yet?

Authors: Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy Dj Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Particularly, we find that combining keyword-based and document-embedding-based search improves precision and recall during retrieval by 10% and 30%, respectively, compared to using either of the methods in isolation.
Researcher Affiliation	Collaboration	Shubham Agarwal * Service Now Research, Mila Quebec AI Institute, HEC Montreal Gaurav Sahu * Service Now Research, University of Waterloo Abhay Puri * Service Now Research Issam H. Laradji Service Now Research, University of British Columbia Krishnamurthy DJ Dvijotham Service Now Research Jason Stanley Service Now Research Laurent Charlin Mila Quebec AI Institute, HEC Montreal, Canada CIFAR AI Chair Christopher Pal Service Now Research, Polytechnique Montreal, Mila Quebec AI Institute, Canada CIFAR AI Chair
Pseudocode	Yes	Algorithm 1 Retrieval algorithm Require: Input abstract a 1: keywords = LLMKeywords(a); // Generate keywords from the abstract using an LLM 2: candidate_papers = Search Engine(keywords); // Query a search engine to retrieve candidates 3: reranked_papers = LLMRerank(candidate_papers, a); // LLM-based reranking of candidates 4: return reranked_papers
Open Source Code	Yes	We release both our datasets and our code to the community. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io. Code can be accessed at https://github.com/Lit LLM/litllms-for-literature-review-tmlr
Open Datasets	Yes	We release both our datasets and our code to the community. We create two datasets that contain papers posted on ar Xiv in August and December 2023, respectively, starting with 1,000 papers from each month. We use the ar Xiv wrapper in Python to create Rolling Eval datasets. We use the Multi XScience corpus (Lu et al., 2020) for our experiments.
Dataset Splits	No	The paper creates new datasets (Rolling Eval-Aug and Rolling Eval-Dec) and utilizes the Multi-XScience dataset. It mentions using these for 'extensive retrieval and literature review generation experiments' and discusses 'test set contamination in zero-shot evaluations'. While it describes the creation of a 'test corpus' and a 'subset of 1,000 examples (Rolling Eval-Aug)', it does not provide specific percentages or sample counts for training, validation, or testing splits of these datasets within their experimental framework.
Hardware Specification	No	The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers', 'Py Torch', 'Huggingface s evaluate library', 'Spa Cy' (with 'en_core_web_sm model'), 'Anyscale endpoints', and 'Open AI API' but does not provide specific version numbers for any of these software components or libraries.
Experiment Setup	Yes	We find a slight improvement when fine-tuning the Llama 2 7B model for 30k steps with an LR of 5e-6 over 0-shot model (see Table 8), but it quickly overfits as we increase the LR or the number of steps.