reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Augmented Language Models: a Survey

Authors: Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. ... In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues. ... Table 1: Evaluation of different reasoning methods on GSM8K, a popular reasoning benchmark. FT denotes fine-tuning and Co T denotes chain-of-thought. The reported accuracies are based on [1]: (Wei et al., 2022c); [2]: (Cobbe et al., 2021); [3]: (Chowdhery et al., 2022); and [4]: (Gao et al., 2022).
Researcher Affiliation	Industry	Grégoire Mialon EMAIL Roberto Dessì EMAIL Maria Lomeli EMAIL Christoforos Nalmpantis EMAIL Ram Pasunuru EMAIL Roberta Raileanu EMAIL Baptiste Rozière EMAIL Timo Schick EMAIL Jane Dwivedi-Yu EMAIL Asli Celikyilmaz EMAIL Edouard Grave EMAIL Yann Le Cun EMAIL Thomas Scialom EMAIL Meta AI Universitat Pompeu Fabra
Pseudocode	No	The paper is a survey and does not present a new algorithm or method that would require pseudocode. Figures 4 and 6 show snippets of Python code as examples from other papers being reviewed, not as pseudocode for this survey's own methodology.
Open Source Code	No	This paper is a survey of existing works and does not describe a novel methodology requiring its own source code release. There is no statement about the release of code for this survey paper.
Open Datasets	Yes	Using few-shot Co T prompting, Minerva (Lewkowycz et al., 2022) achieves excellent performance on math benchmarks such as GSM8K (Cobbe et al., 2021). ... Wei et al. (2022b) demonstrate that LLMs become able to perform some BIG-bench tasks3 via few-shot prompting once a certain scale is attained. 3https://github.com/google/BIG-bench
Dataset Splits	No	The paper is a survey and describes various research methodologies, including few-shot and zero-shot settings, which relate to how models are used with data. However, it does not provide specific details on dataset splits (e.g., percentages, counts, or explicit standard splits) for any of the datasets discussed, as it is reviewing other works rather than conducting its own experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware used to conduct its own research or analysis. It mentions 'software and hardware innovations' in a general context but no specific models or configurations.
Software Dependencies	No	The paper is a survey and does not present an implementation that would require specific software dependencies with version numbers. While it mentions tools like 'python interpreter' and 'faiss' in the context of the reviewed works, it does not specify software dependencies for its own methodology.
Experiment Setup	No	The paper is a survey and discusses experimental setups and training procedures of various research papers it reviews, such as 'fine-tuning with behavior cloning' or 'RLHF'. However, it does not provide specific experimental setup details (e.g., hyperparameters, training configurations) for its own research or analysis, as it is a review paper.