reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Monoculture in Large Language Models

Authors: Fan Wu, Emily Black, Varun Chandrasekaran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate the prevalence of generative monoculture through analysis of book review and code generation tasks, and find that simple countermeasures such as altering sampling or prompting strategies are insufficient to mitigate the behavior.
Researcher Affiliation	Academia	Fan Wu1, Emily Black2 , Varun Chandrasekaran1 1 University of Illinois Urbana-Champaign 2 New York University Equal advising EMAIL, EMAIL
Pseudocode	No	The paper defines concepts and describes methods in natural language and figures, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We open source our code at https://github.com/Ge Mo LLM/Ge MO.
Open Datasets	Yes	For Dsrc, we use the Goodreads dataset (Wan et al., 2019), which contains multiple books with several reviews each. ... For Dsrc, we chose the Code Contests dataset (Li et al., 2022), a competitive programming problem dataset where each problem comes with multiple correct and incorrect solutions.
Dataset Splits	Yes	For Dsrc, we use the Goodreads dataset (Wan et al., 2019)... and craft a final dataset of N = 742 books with English titles, and i, ni = 10 reviews per book... For Dsrc, we chose the Code Contests dataset (Li et al., 2022)... For each problem in the subset, we randomly sampled i, ni = 20 correct solutions from all of the ncorrect i solutions for that problem.
Hardware Specification	Yes	The book review generation (N = 742, n = 10, max new tokens=500) on open-source models took around 10 hours on one H100 card per run, i.e., per combination of sampling parameters (T and p) and prompts.
Software Dependencies	No	The paper mentions various tools and models such as Hugging Face sentiment classifier, BERTopic, NLTK library, and COPYDETECT, but it does not specify concrete version numbers for these software components, nor for the programming language used.
Experiment Setup	Yes	We performed nucleus sampling (Holtzman et al., 2019) with various sampling parameters: (a) temperature T {0.5, 0.8, 1.0, 1.2, 1.5}, and (b) top-p {0.90, 0.95, 0.98, 1.00}. We also experimented with two candidates for Ptask: prompt (1) Write a personalized review of the book titled {title}: , and prompt (2) Write a book review for the book titled {title} as if you are {person}: .