Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Contrastive Search Is What You Need For Neural Text Generation
Authors: Yixuan Su, Nigel Collier
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we first answer the question: Are autoregressive LMs really anisotropic? . To this end, we extensively evaluate the isotropy of LMs across 16 languages. Surprisingly, we find that the anisotropic problem only exists in the two specific English GPT2-small/medium models. On the other hand, all other evaluated LMs are isotropic which is in contrast to the conclusion drawn by previous studies (Ethayarajh, 2019; Su et al., 2022b). Based on our findings, we further assess the contrastive search decoding method using off-the-shelf LMs on four generation tasks across 16 languages. Our experimental results demonstrate that contrastive search significantly outperforms previous decoding methods without any additional training. More notably, on 12 out of the 16 evaluated languages, contrastive search performs comparably with human-level performances as judged by human evaluations. |
| Researcher Affiliation | Academia | Yixuan Su EMAIL Language Technology Lab, University of Cambridge Nigel Collier EMAIL Language Technology Lab, University of Cambridge |
| Pseudocode | No | The paper includes mathematical equations for isotropy and contrastive search (e.g., Eq. 1, 2, 3, 5, 6, 7, 8) and descriptions of methods, but no explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and other related resources are publicly available at https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need. |
| Open Datasets | Yes | To measure the isotropy of LMs from different languages, we use the WIT dataset (Srinivasan et al., 2021) as our text corpus D (see Eq. (2)). We conduct experiments on the WIT dataset (Srinivasan et al., 2021) which consists of general-domain text collected from Wikipedia across 108 languages. We use the widely-used XSum dataset (Narayan et al., 2018) as our test bed which consists of news articles collected from BBC along with the corresponding one-sentence summaries. Following previous studies (Chen et al., 2021; Nijkamp et al., 2022), we use the Human Eval dataset (Chen et al., 2021) as our testbed. Lastly, we conduct experiments on the machine translation task using the IWSLT14 De-En dataset. |
| Dataset Splits | Yes | Following previous study (Holtzman et al., 2020), we use the large version of GPT-2 (Radford et al., 2019) (i.e. GPT-2-large) to generate texts conditioned on the initial paragraph (restricted to 40 tokens) of documents from the held-out set of Web Text (Radford et al., 2019). For each evaluated language, we use the LMs to generate text conditioned on the prefix (restricted to 16 tokens) from the test set of WIT. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the Huggingface library (Wolf et al., 2019) and implicitly Python, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We compare various decoding strategies, including (1) greedy search; (2) beam search (b = 4); (3) typical sampling (τ = 0.95); (4) top-k sampling (k = 50); (5) nucleus sampling (p = 0.95); and (6) contrastive search (k = 5, α = 0.6). Specifically, the generation of text ends upon reaching an end-of-document token or a maximum length of 200 tokens. For each evaluated language, we use the LMs to generate text conditioned on the prefix (restricted to 16 tokens) from the test set of WIT. The generation of text ends upon reaching an end-of-document token or a maximum length of 64 tokens. Table 11 presents the details of (i) our evaluated languages; (ii) the link address of assessed LMs; and (iii) the hyperparameters (i.e. k and α) used in contrastive search for our experiments in multilingual open-ended text generation. |