LitLLMs, LLMs for Literature Review: Are we there yet?
Authors: Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy Dj Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Particularly, we find that combining keyword-based and document-embedding-based search improves precision and recall during retrieval by 10% and 30%, respectively, compared to using either of the methods in isolation. |
| Researcher Affiliation | Collaboration | Shubham Agarwal * Service Now Research, Mila Quebec AI Institute, HEC Montreal Gaurav Sahu * Service Now Research, University of Waterloo Abhay Puri * Service Now Research Issam H. Laradji Service Now Research, University of British Columbia Krishnamurthy DJ Dvijotham Service Now Research Jason Stanley Service Now Research Laurent Charlin Mila Quebec AI Institute, HEC Montreal, Canada CIFAR AI Chair Christopher Pal Service Now Research, Polytechnique Montreal, Mila Quebec AI Institute, Canada CIFAR AI Chair |
| Pseudocode | Yes | Algorithm 1 Retrieval algorithm Require: Input abstract a 1: keywords = LLMKeywords(a); // Generate keywords from the abstract using an LLM 2: candidate_papers = Search Engine(keywords); // Query a search engine to retrieve candidates 3: reranked_papers = LLMRerank(candidate_papers, a); // LLM-based reranking of candidates 4: return reranked_papers |
| Open Source Code | Yes | We release both our datasets and our code to the community. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io. Code can be accessed at https://github.com/Lit LLM/litllms-for-literature-review-tmlr |
| Open Datasets | Yes | We release both our datasets and our code to the community. We create two datasets that contain papers posted on ar Xiv in August and December 2023, respectively, starting with 1,000 papers from each month. We use the ar Xiv wrapper in Python to create Rolling Eval datasets. We use the Multi XScience corpus (Lu et al., 2020) for our experiments. |
| Dataset Splits | No | The paper creates new datasets (Rolling Eval-Aug and Rolling Eval-Dec) and utilizes the Multi-XScience dataset. It mentions using these for 'extensive retrieval and literature review generation experiments' and discusses 'test set contamination in zero-shot evaluations'. While it describes the creation of a 'test corpus' and a 'subset of 1,000 examples (Rolling Eval-Aug)', it does not provide specific percentages or sample counts for training, validation, or testing splits of these datasets within their experimental framework. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions using 'Hugging Face Transformers', 'Py Torch', 'Huggingface s evaluate library', 'Spa Cy' (with 'en_core_web_sm model'), 'Anyscale endpoints', and 'Open AI API' but does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We find a slight improvement when fine-tuning the Llama 2 7B model for 30k steps with an LR of 5e-6 over 0-shot model (see Table 8), but it quickly overfits as we increase the LR or the number of steps. |