reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Search Is What You Need For Neural Text Generation

Authors: Yixuan Su, Nigel Collier

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we first answer the question: Are autoregressive LMs really anisotropic? . To this end, we extensively evaluate the isotropy of LMs across 16 languages. Surprisingly, we find that the anisotropic problem only exists in the two specific English GPT2-small/medium models. On the other hand, all other evaluated LMs are isotropic which is in contrast to the conclusion drawn by previous studies (Ethayarajh, 2019; Su et al., 2022b). Based on our findings, we further assess the contrastive search decoding method using off-the-shelf LMs on four generation tasks across 16 languages. Our experimental results demonstrate that contrastive search significantly outperforms previous decoding methods without any additional training. More notably, on 12 out of the 16 evaluated languages, contrastive search performs comparably with human-level performances as judged by human evaluations.
Researcher Affiliation	Academia	Yixuan Su EMAIL Language Technology Lab, University of Cambridge Nigel Collier EMAIL Language Technology Lab, University of Cambridge
Pseudocode	No	The paper includes mathematical equations for isotropy and contrastive search (e.g., Eq. 1, 2, 3, 5, 6, 7, 8) and descriptions of methods, but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and other related resources are publicly available at https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need.
Open Datasets	Yes	To measure the isotropy of LMs from different languages, we use the WIT dataset (Srinivasan et al., 2021) as our text corpus D (see Eq. (2)). We conduct experiments on the WIT dataset (Srinivasan et al., 2021) which consists of general-domain text collected from Wikipedia across 108 languages. We use the widely-used XSum dataset (Narayan et al., 2018) as our test bed which consists of news articles collected from BBC along with the corresponding one-sentence summaries. Following previous studies (Chen et al., 2021; Nijkamp et al., 2022), we use the Human Eval dataset (Chen et al., 2021) as our testbed. Lastly, we conduct experiments on the machine translation task using the IWSLT14 De-En dataset.
Dataset Splits	Yes	Following previous study (Holtzman et al., 2020), we use the large version of GPT-2 (Radford et al., 2019) (i.e. GPT-2-large) to generate texts conditioned on the initial paragraph (restricted to 40 tokens) of documents from the held-out set of Web Text (Radford et al., 2019). For each evaluated language, we use the LMs to generate text conditioned on the prefix (restricted to 16 tokens) from the test set of WIT.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using the Huggingface library (Wolf et al., 2019) and implicitly Python, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We compare various decoding strategies, including (1) greedy search; (2) beam search (b = 4); (3) typical sampling (τ = 0.95); (4) top-k sampling (k = 50); (5) nucleus sampling (p = 0.95); and (6) contrastive search (k = 5, α = 0.6). Specifically, the generation of text ends upon reaching an end-of-document token or a maximum length of 200 tokens. For each evaluated language, we use the LMs to generate text conditioned on the prefix (restricted to 16 tokens) from the test set of WIT. The generation of text ends upon reaching an end-of-document token or a maximum length of 64 tokens. Table 11 presents the details of (i) our evaluated languages; (ii) the link address of assessed LMs; and (iii) the hyperparameters (i.e. k and α) used in contrastive search for our experiments in multilingual open-ended text generation.