reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bridging the Data Provenance Gap Across Text, Speech, and Video

Authors: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad Alghamdi, Minh Chien Vu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester James V. Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella R Biderman, Alex Pentland, Sara Hooker, Jad Kabbara

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities popular text, speech, and video datasets from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024... We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level...
Researcher Affiliation	Collaboration	This research was conducted by the Data Provenance Initiative, a collective of independent and academic researchers volunteering their time to data transparency projects. The Data Provenance Initiative is supported by the Mozilla Data Futures Lab Infrastructure Fund.
Pseudocode	No	The paper describes its methodology in prose (e.g., "Annotation Features & Methodology", "Scope & Dataset Selection") but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video. All annotations and analysis code will be made publicly available on release.
Open Datasets	Yes	Our manual analysis covers nearly 4000 public datasets between 1990-2024... As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit... All datasets are described, linked, and attributed in Appendix D.
Dataset Splits	No	The paper conducts an audit and analysis of datasets, rather than performing a machine learning experiment that would typically require training/test/validation splits. Therefore, such split information is not provided or applicable to the methodology described.
Hardware Specification	No	The paper describes a large-scale manual audit and data analysis. It does not mention any specific hardware (e.g., GPU/CPU models, memory specifications) used for conducting this research.
Software Dependencies	No	The paper mentions that "All annotations and analysis code will be made publicly available on release" but does not specify any particular software or library dependencies with version numbers used for their analysis.
Experiment Setup	No	The paper details a methodological approach involving manual audit and data analysis by domain experts. It does not describe an experimental setup with hyperparameters, training configurations, or model-specific settings typical of machine learning experiments.