Bridging the Data Provenance Gap Across Text, Speech, and Video
Authors: Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad Alghamdi, Minh Chien Vu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester James V. Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella R Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities popular text, speech, and video datasets from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024... We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level... |
| Researcher Affiliation | Collaboration | This research was conducted by the Data Provenance Initiative, a collective of independent and academic researchers volunteering their time to data transparency projects. The Data Provenance Initiative is supported by the Mozilla Data Futures Lab Infrastructure Fund. |
| Pseudocode | No | The paper describes its methodology in prose (e.g., "Annotation Features & Methodology", "Scope & Dataset Selection") but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video. All annotations and analysis code will be made publicly available on release. |
| Open Datasets | Yes | Our manual analysis covers nearly 4000 public datasets between 1990-2024... As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit... All datasets are described, linked, and attributed in Appendix D. |
| Dataset Splits | No | The paper conducts an audit and analysis of datasets, rather than performing a machine learning experiment that would typically require training/test/validation splits. Therefore, such split information is not provided or applicable to the methodology described. |
| Hardware Specification | No | The paper describes a large-scale manual audit and data analysis. It does not mention any specific hardware (e.g., GPU/CPU models, memory specifications) used for conducting this research. |
| Software Dependencies | No | The paper mentions that "All annotations and analysis code will be made publicly available on release" but does not specify any particular software or library dependencies with version numbers used for their analysis. |
| Experiment Setup | No | The paper details a methodological approach involving manual audit and data analysis by domain experts. It does not describe an experimental setup with hyperparameters, training configurations, or model-specific settings typical of machine learning experiments. |